IFT Research Call, December 4th 2024 - Reliability over Waku

Minutes

YouTube recording.

This is a transcript of the discussion after the call. Feel free to continue the discussion in this topic.

  • UX within Status

Q: What does the process look like for verifying and implementing this within Status so that the user experience within Status is reliable?

A: Firstly, we want to work on the aspect of the message status. Specifically for Status, we will try to retrieve messages in a cheaper way (similar to what we used in Waku store) and we can determine whether messages are missing. However, now we have this idea of distributed log, each participant should have the information on missing messages (both receiving and publishing). We will tweak various remedial actions to improve on that which, in our opinion, will have that information available in a more certain way. A user will be able to see in the chat log if they are waiting on messages that precede the one received already as well as see if their messages are yet to be delivered. In conclusion the information is available in each client.

  • Admin communication - de-MLS

Q: We want to use the local messaging for the between-admin communication in decentralized MLS - are there numbers of nodes or epochs or time in a simulation of a sorts (or something else) to measure the success of receiving messages? A definitive number or during a certain time epoch, for example.

A: Right now we are at a very early stage, there are some educated guesses based on various probabilistic models. This is a complex interaction of how well the principles work, how the parameters are tuned and we’re just not there yet. What we do have is a very simple implementation for a chat application but we haven’t attempted scaling.

This will be put into the DST roadmap which means when this matures enough, we will put it into focus and have it tested at scale. From a mathematical model perspective, that is definitely possible, but takes significantly more effort than simply measuring.

  • Guaranteed reliability possibilities

Q: You mentioned the ability to tune the reliability property - what would happen if we dropped the bloom filter completely for some groups and just add more information to the DAG, would that be a way to get full reliability? Could we have just one pub-sub topic with guaranteed reliability and use that one for establishing de-MLS keys?

A: This is something we considered but definitely not easy to be done. In essence you would have to always recompute your entire causal DAG and maybe modify the past messages in a way to be able to traverse all the way back to history. What is happening here is that these DAGs are just from each of the individual node’s perspectives and we don’t cover full history. So what could happen is that there are messages that branch off your DAG in a certain directions but never appear in the causal history of other messages because these are independently maintained logs and you’re just merging them. In conclusion it might be possible but quite difficult to do.

Even if it’s too much work for most use cases, but let’s say for very specific pub-sub topics, i.e. de-MLS key establishment (since if it could be pulled off, we could get rid of on-chain component) it might be worth it. We could also potentially have a set of more stable nodes that can help with the ordering and it would not become a federated system but more like a super-peer system. The whole point of it is eliminating the on-chain component and still have a subgroup of users being admins. One admin can be pulled off, but a subgroup requires total ordering. This will also be in ACZ roadmap.

This is something that is, in my opinion probably not the best choice of design - having both the causal history and the bloom filter, but it was difficult to get past that.

For chat applications it is nice-to-have both especially with Waku’s tunable/configurable protocols along various dimensions and for some specific pub-sub topics where you want to have guaranteed delivery that can become similar to a TLS level guarantee of getting a message delivered. In normal context having bloom filters might be quite nice, since if information is seen in the bloom filter there is a high probability of a message being delivered/received - especially for certain important messages, so that you can find them in the DAG and you continue eagerly pushing them.

That is the main idea, if you want a high level of probability to publish a message, you should continue publishing it until it becomes a part of someone else’s DAG. That is the way of getting that certainty. The bloom filters are also a great way for certain knacks such as negative acknowledgement - that addresses the main problem of publishers failing to publish the message. In most cases, when message receives the market network, it actually reaches the destination intended, but as we see from status perspective, when a message is missing it’s usually due to a message not being published (and there is no way to know it was never published).

  • Detection of disconnection

How can this happen?

There is usually a short undetected disconnection period. That is the case we see most often: the application believes it communicated with Waku, Waku believes the information was published, but actually it was disconnected, but not for long enough to be detected.

Shouldn’t your local node detect this?

Yes, but in TCP there is a time period where you can’t detect it, there is a timeout. We saw this a lot. We could address it, but it wouldn’t solve the cause of the problem, but we have seen that the sending of the message is the point of failure.

It would be interesting to see if this can be observed on the client side - just for this specific case, to have faster feedback.