Post mortem: 9 Jul - 12 Jul '21 nim-waku prod fleet incident

haelius · July 15, 2021, 5:18am

Background

On Friday, 9 Jul 2021, the state of the wakuv2.prod fleet was as follows:

Nodes were running nim-waku release v0.4
nim-waku chat2 clients reported issues with accessing store functionality when attempting to connect to prod. Issue reported here. Importantly, the js-waku client did not have any similar issues.
All prod fleet notes reported violations of the GossipSub backoff period that other clients has to respect before attempting a reconnection. The effect was that prod nodes failed to connect to each other and form a mesh.
There were indications of possible SQLite DB corruption (logs indicated message storage failure, select * queries returned unexpected results). This has not been fully investigated yet.

As far as can be established, the above-mentioned issues were present on the prod fleet for at least a week.

On the day, the chat2bridge to Matterbridge/Discord, which had been offline for about a week, was also redeployed off the latest nim-waku master. This meant that the prod fleet nodes, on top of their failure to connect to each other, didn’t support the ping keep-alive mechanism used by chat2bridge.

Steps taken on 9 Jul

It seemed likely that either DB corruption, unexpected behaviour of the deprecated keep-alive mechanism, or both caused the emergent issues. The exact cause is still the topic of an ongoing debugging investigation.

Based on the above, the following was done at around 8 AM UTC:

As first priority, attempted to get the prod fleet in a stable state by redeploying off latest master. Jenkins job here.
In parallel, tried to debug the issues.

Impact on `prod` fleet

The redeployment had the following effects on prod:

nim-waku clients could again connect to the prod fleet and access store functionality.
Error logs related to possible DB corruption, backoff violations, and keep-alive issues disappeared.
Connection to chat2bridge was not restored (this may have been related to the inconsistent Peer table).

Overall, the stability of the prod fleet was restored after the upgrade. The plan then was to continue debugging the cause of the original issues, fix connectivity to chat2bridge and communicate to clients that the fleet is usable again.

Impact on `js-waku` client

The upgrade changed the relay protocol ID advertised by the prod fleet to the stable /vac/waku/relay/2.0.0. Since the released version of the js-waku client does not support this protocol ID, the upgrade caused js-waku clients to fail to connect to prod. The protocol ID issue is tracked here.

Franck and external users of js-waku client, reported the regression around 8 AM UTC on Monday, 12 Jul. This blocked their progress, as they’ve previously been able to connect to prod nodes, despite the issues.

Steps taken on 12 Jul

The following steps were taken to revert the changes to the prod fleet:

Hanno: Redeploy release v0.4 to prod
Arthur: Recreate the SQLite DB (both Peer and Message table)
Arthur: Restore connectivity between prod fleet nodes

Current state of `prod` fleet

The current state of the wakuv2.prod fleet:

Nodes are running nim-waku release v0.4, with relay protocol ID /vac/waku/relay/2.0.0-beta2
Connectivity between nodes have been restored
Connectivity to the chat2bridge has not been restored. This will require either an upgrade of prod or a downgrade of chat2bridge.

The redeployment and recreation of the DBs seem to have fixed the keep-alive and connectivity issues of before. js-waku clients report that they can connect to prod as before.

Lessons learned

Waku incident channel: prod incidents and status updates should be clearly communicated. The #waku-network Discord channel could be used as “command centre” for incidents.
Strict upgrade procedure: prod upgrades should always be done in a coordinated fashion. It requires general agreement from all clients after informing them of possible impact.
Only run releases on prod: prod should only run released versions of nim-waku, unless there is an urgent reason not to (e.g. unforeseen and critical bugs in a release, etc.)

Next steps

Determine scope for next nim-waku release. Discuss impact with other Waku v2 clients.
Upgrade prod with release version.
Verify that:

[ ] All clients connect as expected to the upgraded prod fleet
[ ] Connectivity between prod fleet nodes is stable
[ ] prod nodes correctly connect and relay to the chat2bridge to Discord

Continue investigating the original causal issues, e.g. Error: unhandled exception: Stream EOF! [LPStreamEOFError] · Issue #659 · status-im/nim-waku · GitHub, Some nodes in prod fleet seem to not relay messages · Issue #637 · status-im/nim-waku · GitHub