Hey, this is one of the last design issues behind de-MLS: a recovery mechanism against synchronicity issues such as state partitioning.
Problem: Since de-MLS is a stateful protocol, meaning all members operate under a single global state. Although the de-MLS protocol protects this state under the assumption that message delivery works for the majority, practical network issues can violate this assumption. In such cases, there might be practical issues that violate this assumption, and the group state can be forked into multiple partition. In this case, we need a mechanism that allows the group to re-sync again.
We can divide the problem into two events:
- Weak sync issue: Single partition. A few members are out of sync while the majority (>n/2) are in a single state. This can be solved by a sync mechanism that members request and reply the missing commit messages from synced majority untill the they align again.
- Hard sync issue: There is no state that the majority agrees on. This requires a hard reset, which group is recreated from scratch. One idea can be defining the checkpoints and if there is hard sync issue and rollback to latest checkpoint but MLS does not allow rollback due to the security features.
So, it requires members must determine whether they are synced or not. For this, two capabilities are required:
- Deterministic state identification:
Each member computes a deterministic fingerprint of its local state. We can simply use the tree hash for that. - Exchanging fingerprints among the members:
Members gossip their fingerprints so they can compare them and detect whether a partition exists.
After this process, each members know whether there is a weak or hard sync issue in the network globally, if yes, its situation (whether they themselves are part of a minority partition).
Finally, the member does this according to the collected info:
- If there is a hard sync issue, A member initiates a hard reset request proposal (yes, another consensus here. We can safely assume there is a big partition in the state members, and still can conduct consensus)
- If there is a weak sync issue, then it initiates the state recovery procedure by data exchanging to retrieve missing commit messages to get the final latest state that gets the result votes or provides the latest commits to others.
Some discussion points:
- For a hard reset, probably the reset operator requires all members’ keyPackages, so it raises the question whether every keyPackages should be stored and synced among the members? Looks possible, but not efficient. A Logos storage can be used here.
- For the exchanging commit messages phase in the weak sync issue section, we can use SDS to not re-implement the custom exchange mechanism. cc:@jazzz and @haelius.