Receive message reliability for Status desktop

pablo · June 28, 2024, 1:22pm

We receive messages in Status desktop via two Waku protocols: Relay or Store. None of them offers 100% reliability, Store might be a bit more reliable since those nodes are generally high availability nodes and therefore more likely to receive the Relay messages.

We have other protocols on top of those to ensure E2E deliverability of messages,

MVDS: the receiver doesn’t know if we are missing messages, the sender will resend if no acknowledgement was received.
New E2E protocol: the receiver might become aware that some messages are missed based on causal histories and bloom filters.

Connection management

Status Desktop will have connections to peers for each pubsub topic, if I start status connected to 3 pubsub topics then I’ll need to connect to peers for those topics, Status will check every 15 seconds those connections and update its connection status per topic.
See the following table as an example of peer connection health status:

pubsub topic	number of connected peers	connection status
`/waku/2/rs/16/32`	0	UnHealthy
`/waku/2/rs/16/64`	3 (less than D=4)	MinimallyHealthy
`/waku/2/rs/16/128`	4 (at least D=4)	SufficientlyHealthy

Status Desktop most likely won’t have high availability, it might be running on a laptop connected to unreliable networks, might get wifi disconnects or laptop might go to sleep or hibernate, etc. All those problems could be reduced to the following two main issues that Status needs to handle:

Offline: no peers connected on a pubsub topic, either because we don’t have other peers in this pubsubtopic or because we don’t have connection ourselves. In the case of no connection ourselves two things can happen:
1. We don’t know we are offline. We have observed that if the disconnection is less than 25-30 seconds Status won’t notice this, since we ping peers every 15 seconds plus the timeout.
2. We know we are offline. After those 25 seconds the ping will fail and Status will notice the disconnection.
Computer Sleep/Hibernate: Again, currently we might not notice if the app is suspended by 25 seconds, more time will get noticed.

In either case we have the issue of not noticing if we are offline or sleep for short spans of 30 seconds BUT if the span is larger than 30 seconds we will notice and therefore we can request that time range using Store. The problem remains for the small 30 second periods that might become quite common for a laptop.

Possible solutions

In the following picture we see at connection states for the app and how it receives the messages (blue: Store, green: Relay). when we are online (i.e.: minimally connected to peers for the pubsub topic), we are getting the messages realtime via Relay, for the periods that we are offline we are not receiving messages at all (obviously), but when we detect that we are back online we do a Store query for the time that we know we were offline. See the diagram below:

This would be great BUT we have those 30 second spans that we cannot detect if we were offline, therefore we would be missing those messages.

The reality would look more like the following and maybe it is good enough solution.

Maybe good enough

This figure adds grey question marks for those 30 second periods that we don’t know if we are actually offline, if we are actually offline then we will miss Relay messages if they happen to be sent on those times and we wouldn’t request them with Store.

In this scenario, can MVDS or the new E2E protocol help? Yes, MVDS will retry the send, if at the moment of the retry we are online we could get it by Relay, if we are offline (and we know it) then we would receive it via Store when we go back online. The problem is that we might get it later than expected. And still we might miss some messages not using other reliability protocols.

Periodic-store approach (aka: Schwarzenegger approach)

The way to maximise receiving messages is using Store all the time (e.g.: request messages periodically every 1 minute), together with Relay for real-time receives. I’m unsure how this would compare to just using Filter at least from the receiving messages point of view.

This approach is merged in master and will be used in Desktop shortly. Obvious problem is that it will be very chatty with store nodes but it will sort out the issue of not receiving messages during small 30 second offline periods.

Any comments are welcome.

fryorcraken · July 2, 2024, 6:19am

Thanks for this @pablo

This gives us a real world insight of why an e2e reliability protocol is needed: it is not possible to know disconnection under 30s due to TCP timeout parameters, without bombarding other peers with ping.

I would suggest the following approach:

As of now, adopt the periodic-store approach. This is to increase reliability on communities side for message reception.
In terms of message sending, store check of a message over relay is done to check if the messages was sent over the network.

As stated multiple times, the regular store query can give us a good view on whether other reliability strategies work and act as a safety net,

Once e2e reliability protocol is deployed, we can move away from:

regular store query
checking a message is sent over relay via store query.

Which means relying on the e2e message reliability protocol for this sub-30-seconds disconnections for both outgoing and incoming messages.

Keen to know if @haelius agrees on this requirement on the e2e reliability protocol.

haelius · July 2, 2024, 10:17am

Indeed. A good summary of how undetected disconnection periods necessitates an e2e approach. The new e2e protocol will use a combination of a causal history DAG (to retrieve missing messages) and bloom filter (to detect messages that failed to publish) for nodes to reach eventual consistency after undetected offline periods. This should be sufficient for sub-30 second disconnections. Note however that these “eventual consistency” methods are relatively slow and, where possible, should always be combined with much faster time-based queries to retrieve messages over any disconnection windows we can detect. We may therefore scale down the number of regular queries once we have e2e protocols in place, but will perhaps not disable it completely in all circumstances.

We’re also considering adding faster sync mechanisms for smaller groups - e.g. detecting you were offline by comparing a CRDT hash with a neighbour. The latter doesn’t scale well, but can provide a faster resume mechanism when group size allows.

fryorcraken · November 4, 2024, 1:44am

Wouldn’t the TCP session (message sequencing) help with the fact that a micro disconnection happen and messages may have not been sent over the wire?

Cc @haelius @prem

prem · November 4, 2024, 4:52am

Well it depends on what a micro-disconnection means. We detect there is a disconnection mostly either because a signal comes from a lower level from the OS or connection disconnection is notified.
If any of the disconnection is identified, the TCP layer would mark the connection as closed and drop its local send buffers which might have packets pending to be transmitted to other nodes. When a new connection is established, new buffers are created.

If it was such a short disconnection that it was not even noticed by the TCP layer, then the send buffers are intact and would not have packet loss. But this would rarely be the case.