Background
On Friday, 9 Jul 2021, the state of the wakuv2.prod
fleet was as follows:
- Nodes were running
nim-waku
releasev0.4
-
nim-waku
chat2
clients reported issues with accessingstore
functionality when attempting to connect toprod
. Issue reported here. Importantly, thejs-waku
client did not have any similar issues. - All
prod
fleet notes reported violations of the GossipSub backoff period that other clients has to respect before attempting a reconnection. The effect was thatprod
nodes failed to connect to each other and form a mesh. - There were indications of possible SQLite DB corruption (logs indicated message storage failure,
select *
queries returned unexpected results). This has not been fully investigated yet.
As far as can be established, the above-mentioned issues were present on the prod
fleet for at least a week.
On the day, the chat2bridge
to Matterbridge/Discord, which had been offline for about a week, was also redeployed off the latest nim-waku
master
. This meant that the prod
fleet nodes, on top of their failure to connect to each other, didn’t support the ping
keep-alive mechanism used by chat2bridge
.
Steps taken on 9 Jul
It seemed likely that either DB corruption, unexpected behaviour of the deprecated keep-alive mechanism, or both caused the emergent issues. The exact cause is still the topic of an ongoing debugging investigation.
Based on the above, the following was done at around 8 AM UTC:
- As first priority, attempted to get the
prod
fleet in a stable state by redeploying off latestmaster
. Jenkins job here. - In parallel, tried to debug the issues.
Impact on prod
fleet
The redeployment had the following effects on prod
:
-
nim-waku
clients could again connect to theprod
fleet and accessstore
functionality. - Error logs related to possible DB corruption, backoff violations, and keep-alive issues disappeared.
- Connection to
chat2bridge
was not restored (this may have been related to the inconsistentPeer
table).
Overall, the stability of the prod
fleet was restored after the upgrade. The plan then was to continue debugging the cause of the original issues, fix connectivity to chat2bridge
and communicate to clients that the fleet is usable again.
Impact on js-waku
client
The upgrade changed the relay
protocol ID advertised by the prod
fleet to the stable /vac/waku/relay/2.0.0
. Since the released version of the js-waku
client does not support this protocol ID, the upgrade caused js-waku
clients to fail to connect to prod
. The protocol ID issue is tracked here.
Franck and external users of js-waku
client, reported the regression around 8 AM UTC on Monday, 12 Jul. This blocked their progress, as they’ve previously been able to connect to prod
nodes, despite the issues.
Steps taken on 12 Jul
The following steps were taken to revert the changes to the prod
fleet:
- Hanno: Redeploy release
v0.4
toprod
- Arthur: Recreate the SQLite DB (both
Peer
andMessage
table) - Arthur: Restore connectivity between
prod
fleet nodes
Current state of prod
fleet
The current state of the wakuv2.prod
fleet:
- Nodes are running
nim-waku
releasev0.4
, withrelay
protocol ID/vac/waku/relay/2.0.0-beta2
- Connectivity between nodes have been restored
- Connectivity to the
chat2bridge
has not been restored. This will require either an upgrade ofprod
or a downgrade ofchat2bridge
.
The redeployment and recreation of the DBs seem to have fixed the keep-alive and connectivity issues of before. js-waku
clients report that they can connect to prod
as before.
Lessons learned
-
Waku incident channel:
prod
incidents and status updates should be clearly communicated. The#waku-network
Discord channel could be used as “command centre” for incidents. -
Strict upgrade procedure:
prod
upgrades should always be done in a coordinated fashion. It requires general agreement from all clients after informing them of possible impact. -
Only run releases on
prod
:prod
should only run released versions ofnim-waku
, unless there is an urgent reason not to (e.g. unforeseen and critical bugs in a release, etc.)
Next steps
- Determine scope for next
nim-waku
release. Discuss impact with other Waku v2 clients. - Upgrade
prod
with release version. - Verify that:
- [ ] All clients connect as expected to the upgraded
prod
fleet - [ ] Connectivity between
prod
fleet nodes is stable - [ ]
prod
nodes correctly connect and relay to thechat2bridge
to Discord
- Continue investigating the original causal issues, e.g. Error: unhandled exception: Stream EOF! [LPStreamEOFError] · Issue #659 · status-im/nim-waku · GitHub, Some nodes in prod fleet seem to not relay messages · Issue #637 · status-im/nim-waku · GitHub