On Friday, 9 Jul 2021, the state of the
wakuv2.prod fleet was as follows:
- Nodes were running
chat2clients reported issues with accessing
storefunctionality when attempting to connect to
prod. Issue reported here. Importantly, the
js-wakuclient did not have any similar issues.
prodfleet notes reported violations of the GossipSub backoff period that other clients has to respect before attempting a reconnection. The effect was that
prodnodes failed to connect to each other and form a mesh.
- There were indications of possible SQLite DB corruption (logs indicated message storage failure,
select *queries returned unexpected results). This has not been fully investigated yet.
As far as can be established, the above-mentioned issues were present on the
prod fleet for at least a week.
On the day, the
chat2bridge to Matterbridge/Discord, which had been offline for about a week, was also redeployed off the latest
master. This meant that the
prod fleet nodes, on top of their failure to connect to each other, didn’t support the
ping keep-alive mechanism used by
Steps taken on 9 Jul
It seemed likely that either DB corruption, unexpected behaviour of the deprecated keep-alive mechanism, or both caused the emergent issues. The exact cause is still the topic of an ongoing debugging investigation.
Based on the above, the following was done at around 8 AM UTC:
- As first priority, attempted to get the
prodfleet in a stable state by redeploying off latest
master. Jenkins job here.
- In parallel, tried to debug the issues.
The redeployment had the following effects on
nim-wakuclients could again connect to the
prodfleet and access
- Error logs related to possible DB corruption, backoff violations, and keep-alive issues disappeared.
- Connection to
chat2bridgewas not restored (this may have been related to the inconsistent
Overall, the stability of the
prod fleet was restored after the upgrade. The plan then was to continue debugging the cause of the original issues, fix connectivity to
chat2bridge and communicate to clients that the fleet is usable again.
The upgrade changed the
relay protocol ID advertised by the
prod fleet to the stable
/vac/waku/relay/2.0.0. Since the released version of the
js-waku client does not support this protocol ID, the upgrade caused
js-waku clients to fail to connect to
prod. The protocol ID issue is tracked here.
Franck and external users of
js-waku client, reported the regression around 8 AM UTC on Monday, 12 Jul. This blocked their progress, as they’ve previously been able to connect to
prod nodes, despite the issues.
Steps taken on 12 Jul
The following steps were taken to revert the changes to the
- Hanno: Redeploy release
- Arthur: Recreate the SQLite DB (both
- Arthur: Restore connectivity between
Current state of
The current state of the
- Nodes are running
- Connectivity between nodes have been restored
- Connectivity to the
chat2bridgehas not been restored. This will require either an upgrade of
prodor a downgrade of
The redeployment and recreation of the DBs seem to have fixed the keep-alive and connectivity issues of before.
js-waku clients report that they can connect to
prod as before.
Waku incident channel:
prodincidents and status updates should be clearly communicated. The
#waku-networkDiscord channel could be used as “command centre” for incidents.
Strict upgrade procedure:
produpgrades should always be done in a coordinated fashion. It requires general agreement from all clients after informing them of possible impact.
Only run releases on
prodshould only run released versions of
nim-waku, unless there is an urgent reason not to (e.g. unforeseen and critical bugs in a release, etc.)
- Determine scope for next
nim-wakurelease. Discuss impact with other Waku v2 clients.
prodwith release version.
- Verify that:
- [ ] All clients connect as expected to the upgraded
- [ ] Connectivity between
prodfleet nodes is stable
- [ ]
prodnodes correctly connect and relay to the
- Continue investigating the original causal issues, e.g. Error: unhandled exception: Stream EOF! [LPStreamEOFError] · Issue #659 · status-im/nim-waku · GitHub, Some nodes in prod fleet seem to not relay messages · Issue #637 · status-im/nim-waku · GitHub