API Specification for End-to-end Reliability

shash256 · September 16, 2024, 2:17pm

This post outlines an integration proposal for the “e2e reliability protocol” (proposed here) into the status-go codebase, specifically focusing on its implementation within the encryption layer. The goal is to achieve e2e reliability in the simplest and quickest way - albeit with tradeoffs as compared to an ideal approach such as pulling all encryption related stuff into an API (which primarily has implementation challenges and requires longer development timeframe). The idea is as follows:

Reliability Library: We’ll create a new library that implements the core functions of our e2e reliability protocol. This library will operate on unencrypted application data, adding reliability metadata before encryption occurs.
Message Wrapping: We’ll extend our message structure to include reliability information such as Lamport timestamps and causal history. This wrapping occurs before encryption, effectively extending the application protocol.
Limited Caching: The library will maintain a short-term cache of recent message IDs, used for the rolling bloom filter, building causal history, and initial causal dependency checking. Fallback to longer-term history kept by the application could be used to increase the time period covered by reliability.
Application Interaction: The library will provide methods for the application to interact with the reliability system, including marking causal dependencies as met and handling missing messages.

Reliability Manager

The ReliabilityManager will be the core of our new library, designed to work with application data and the reliability constructs.

Breakdown of Protocol Components

Component	Functionality	Interaction with Application
Lamport Timestamp	Maintain and increment Lamport timestamps internally	Application provides clock value to initialize Lamport Timestamp
Causal History & Message ID Log	Maintain a unified log of recent message IDs (e.g., last 500) for acknowledged messages. The causal history of a message consists of the last few message IDs in this log.	Application provides long-term storage and retrieval for older messages
Bloom Filters	Maintain and update rolling bloom filter for recent message IDs	Application doesn’t need to interact with this
Message Buffering	Short-term buffering of incoming message IDs with unmet causal dependencies and outgoing unacknowledged message IDs	Application handles long-term storage and retrieval
Dependency Checking	Check dependencies against library’s recent message history	Return list of missing dependencies to application
Acknowledgment Tracking	Track acknowledgments for outgoing messages based on bloom filters and causal histories in received messages	Application marks messages as “sent” based on library signals

Note: In the context of Status, for the initial implementation, we could use a single channelID per Community, matching the proposed content topic usage. This simplifies the implementation while allowing for potential future expansion to sub-channels if needed.

Message History Management

The reliability library will maintain a short-term storage of recent message IDs (last 500 messages) for efficiency. The application will handle extended history storage and retrieval.

API Description

Sending a Message

In Go:

func (rm *ReliabilityManager) WrapOutgoingMessage(message []byte, channelID string) ([]byte, error)

In C:

unsigned char* WrapOutgoingMessage(ReliabilityManager* rm, const unsigned char* message, size_t messageLen, const char* channelID, size_t* outLen);

Library wraps the unencrypted message with reliability metadata (Lamport timestamp, short causal history)
Library updates internal Lamport clock
Library adds this message to an unacknowledged outgoing buffer to later ACK from an incoming message
- Later when the message is ACKed, it is removed from the outgoing buffer and the message ID is added to the log
Returns wrapped message for application to encrypt and send

Receiving a Message

In Go:

func (rm *ReliabilityManager) UnwrapReceivedMessage(message []byte) ([]byte, []MessageID, error)

In C:

typedef struct {
    unsigned char* message;
    size_t messageLen;
    MessageID* missingDeps;
    size_t missingDepsCount;
} UnwrapResult;

UnwrapResult UnwrapReceivedMessage(ReliabilityManager* rm, const unsigned char* message, size_t messageLen);

Library unwraps reliability metadata
Library sweeps the outgoing buffer and marks messages as ACKed that match the causal dependencies and/or received bloom filter
Checks causal dependencies against recent history
If all dependencies met: returns unwrapped message and null for missing dependencies
- The message ID is then added to the message log and removed from the incoming buffer
If dependencies not met: returns unwrapped message and list of missing MessageIDs
- Then the application will check if these missing dependencies exist in longer-term app history or retrieve them from a history store

Marking Dependencies as Met

In Go:

func (rm *ReliabilityManager) MarkDependenciesMet(messageIDs []MessageID) error

In C:

int MarkDependenciesMet(ReliabilityManager* rm, const MessageID* messageIDs, size_t count);

The missing dependencies are reported as “missing” from the PoV of the library in its own short-term history and provided to the application.
The application can try to find these in its own history or via a store query to a history store. It should use this method to mark these dependencies as met within the library to ensure the incoming buffer is appropriately maintained.
Library updates its internal state and processes any buffered messages whose dependencies are now met

Signals

The library will emit the following signals:

Message ready for processing
Outgoing message marked as sent
Periodic sync required

Go:

func (rm *ReliabilityManager) SetPeriodicSyncCallback(callback func())

Here are the C function pointer types for each signal:

// Signal for message ready for processing
typedef void (*MessageReadyCallback)(const char* messageID);

// Signal for outgoing message marked as sent
typedef void (*MessageSentCallback)(const char* messageID);

// Signal for periodic sync required
typedef void (*PeriodicSyncCallback)(void);

// Function to register callbacks
void RegisterCallbacks(ReliabilityManager* rm,
                       MessageReadyCallback messageReady,
                       MessageSentCallback messageSent,
                       PeriodicSyncCallback periodicSync);

The application can register these callbacks with the ReliabilityManager to handle the respective signals:

MessageReadyCallback: Called when a message’s causal dependencies have been met and it’s ready to be appended to the internal log…
MessageSentCallback: Called when an outgoing message has been acknowledged as sent. At this point, the message ID is added to the message log and removed from the outgoing buffer.
PeriodicSyncCallback: Called when the library suggests a synchronization should be performed. The application can decide when and how to act on this signal.

Integration with status-go

The proposed reliability system would integrate with status-go primarily through the encryption layer and message handling processes. Here’s an overview of how it would work within the status-go context:

Integration Points

The likely integration points would be:

In protocol/v1/status_message.go, particularly in the HandleTransportLayer, HandleEncryptionLayer, and HandleApplicationLayer methods.
In the encryption layer, possibly in protocol/v1/status_message.go within the EncryptionLayer struct.
In the message handling flow, likely in protocol/v1/message.go.

“Please note that this is as per primary and limited understanding, any comments from status contributors are highly appreciated”

Outgoing Messages

Message Creation: When a user sends a message, it would likely be handled in the v1 protocol package.
Wrapping for Reliability: Before encryption, the message would be passed to the ReliabilityManager:
```
wrappedMessage, err := m.reliabilityManager.WrapOutgoingMessage(message.Payload, message.ChatID)
```
Encryption: The wrapped message would then be encrypted using the existing encryption protocol, likely in the EncryptionLayer struct:
```
m.EncryptionLayer.Payload = encryptedWrappedMessage
```
Sending: The encrypted message is sent using the existing transport layer, possibly through the TransportLayer struct.

Incoming Messages

Message Reception: Incoming messages are likely received in the HandleTransportLayer method of the StatusMessage struct.
Decryption: The message would be decrypted in the HandleEncryptionLayer method.

Unwrapping for Reliability: After decryption, the message is passed to the ReliabilityManager:

unwrappedMessage, missingDeps, err := m.reliabilityManager.UnwrapReceivedMessage(m.EncryptionLayer.Payload)

Dependency Handling: If there are missing dependencies, the application would need to check its local history or retrieve them from a store:

if len(missingDeps) > 0 {
    foundDeps := m.checkLocalHistory(missingDeps)
    m.reliabilityManager.MarkDependenciesMet(foundDeps)
    
    remainingDeps := m.getRemainingDeps(missingDeps, foundDeps)
    if len(remainingDeps) > 0 {
	    err := m.retrieveMissingMessages(remainingDeps)
	    // Handle any error or retry
    }
}

Message Processing: The message can be shown to the end user on the UI as soon as it arrives, but with an indication such as “possible messages missing prior to this one”. The number of missing messages can be specified based on the number of unmet causal dependencies. Once a message is marked as having all dependencies met (i.e., “processed”), the message would be processed in the HandleApplicationLayer method and UI can remove the missing messages display.

Acknowledgment Handling

Marking Messages as Sent: The application would receive signals from the ReliabilityManager when outgoing messages are acknowledged:

m.reliabilityManager.SetMessageSentCallback(func(messageID string) {
    m.markMessageAsSent(messageID)
})

Periodic Sync

Sync Signal: The ReliabilityManager would periodically signal the need for synchronization:
```
m.reliabilityManager.SetPeriodicSyncCallback(func() {
    m.initiateSync()
})
```
Sync Process: The application would implement the initiateSync function to handle the actual synchronization process, by sending an empty sync message.

Next Steps

Detailed technical specification of the library’s API
Finalized integration plan with existing status-go message handling flow
Prototype implementation focusing on core functionality

We look forward to your thoughts and feedback on this proposal !

prem · September 17, 2024, 7:43am

Thanks for the detailed writeup @shash256!

some points/queries come to my mind wrt integration with status which i think would need recommendations/suggestions as to how the application has to implement these in order for the protocol to be used effectively.

it would be good to suggest what could be used as a channel ID. is it that each content-topic should have a channel ID or some other method has to be used to match them?
how would the participants be determined with which a specific channel will be sycned? since this protocol is especially needed for communities which currently uses store protocol to fetch/identify missing messages, it would be good to think of how each member would choose participant(s) to sync with? is it the application’s responsibility to identify participant group for syncing?
wondering if there needs to be some mechanism to limit the number of participants to sync with in order to make it less concentrated. e.g: if my node starts getting sync requests by 100 others in the community, i shouldn’t start syncing with all of them? maybe a simple participant limit can be specified locally e.g my node only syncs with a max of 10-20 participants that are part of same channel/group. Maybe some sort of userid based approach can be taken here i.e x userids that are closer to my userid and are online (who is part of a community) would be sycned with.
what if the nodes i am syncing with go offline (note that status desktop instances run on user’s laptop which maybe switched off for some part of the day)? that means there needs to be some way to know this and shift/identify other nodes to sync with.
this is a very rare scenario, but what would happen if no one is online apart from me to sync with?

maybe some of the above mentioned problems are already addressed while integrating MVDS (another sync protocol already used in 1:1 and group chats) and same approach can be reused here. but there are definitely new scenarios to think about in case of communities since number of users in cmmunity can become very large and we don’t need to sync with all the users.

rramos · September 17, 2024, 12:46pm

How will backward/forward compatibility be ensured between users that have a version that includes these changes and those that have not updated their clients yet?

fryorcraken · September 19, 2024, 4:20am

Looks good from high-level PoV.

As @prem suggested, what is the proposal for channelID in the context of Status Communities?
Is there be a relation or restriction between a channel id and a content topic?
What about context of 1:1 chat (even if we do not integrate this yet).

I would suggest the following:

Decide on a community marker that indicate usage of e2e reliability on community
Implement wrapping and unwrapping of messages for enabled communities . App version N
Add a toggle in options “create community with e2e reliability enabled”. Could be CLI/compilation option at first to dogfood.
Once stabilized, enable it by default at community creation for app version N+M. Need Status to define M (when can an app version can be considered end-of-life).
Then, we can review if there is a need to migrate pre-existing communities to it. Note this could be done in combination of migrating pre-existing communities to own shard.

edit: typo

fryorcraken · September 24, 2024, 6:27am

I wrote a bit more about handling roll out of breaking changes: Breaking changes and roll out strategies

haelius · September 24, 2024, 3:52pm

@fryorcraken @prem re:

what is the proposal for channelID in the context of Status Communities?
Is there be a relation or restriction between a channel id and a content topic?

For now, I would suggest a single channelID per Community (matching the proposed content topic usage). The channelID concept was introduced to allow for possible future use case where filter clients only interested in certain sub-channels could participate in e2e reliability only for those channels. For now, since we don’t have such filter use cases and to build the simplest thing first, I’d just stick with one channelID for each “sync group” (i.e. each Community). We should make this clearer in the post, @shash256

@prem, re

it would be good to think of how each member would choose participant(s) to sync with

(and other points about the sync group)
The idea here is to exclude 1:1 sync interactions completely and simply add causality, ordering and ACK information to existing messages that will allow eventual consistency. The “sync group” is the entire community, and the only place where missing messages will be attempted to be retrieved will be from the store node. This comes with a bunch of assumptions, of course, that prioritises scalability above strong guarantees (e.g. what if the store node does not have the missing dependencies?, etc.). I think once we understand the behaviour of the protocol better we can add mechanisms that provide stronger guarantees. We may even introduce 1:1 interactions for smaller sync groups to speed up time to consistency.

There are some details omitted from the post on how this API will be implemented. For example, in order to save bandwidth, attempting to retrieve missing messages from the store should be done very lazily, in case the missing messages are still to be received via normal Relay broadcast. The first aim, in fact, would be to simply indicate on the UI that messages are lost if a best effort (but naive) store retrieval could not find the missing messages. We could also delay implementing the periodic sync message (this is unnecessary if the community generates enough traffic of its own).

this is a very rare scenario, but what would happen if no one is online apart from me to sync with?

Mmm. Quite a good point. The protocol mostly aims to provide publishers with information to understand if their message has been ACKed by at least on other member of the sync group (i.e. the Community). That is because we want publishers to rebroadcast eagerly, rather than everyone else to individually pull messages that failed to be published. What will likely happen therefore is that this single online user will periodically rebroadcast its messages until it reaches some max attempts or someone else comes online and sends a message. Of course, since the number of online users is known, it should be easy to encode some exceptional handling of this case to avoid the unnecessary retransmissions.

haelius · September 24, 2024, 4:01pm

@shash256 one thing I think should be clarified is around the message causal dependencies and the message log.
Specifically, I think we should clarify that the causal history of a message is the last few message IDs in the message log. Then we should also clarify how message IDs get added to the message log:

in the case of received messages - ID added to log once all dependencies are met and message removed from the “incoming buffer”.
in the case of sent messages - ID added to log once message is ACKed and removed from “outgoing buffer”.

Point (2) is different from what we thought before and what is currently listed under “Sending a message”, but I think it might work better. This way we won’t include unacknowledged messages in the log. Otherwise a publisher could create an internal chain of causal dependencies for which no other node has received any of the messages, leading to an explosion of Store requests to retrieve these. WDYT?

shash256 · September 26, 2024, 1:03pm

Added a note in the post to clarify this

There is an exponential backoff strategy in place in the POC, such that the rebroadcast happens with an added delay with each turn, but yes, this particular scenario could be handled as an special exception with the info that no one else in online.

Agree. Adding the message to the log after it’s ACKed could be relatively a better approach. Updated the post to reflect this change.

fryorcraken · February 14, 2025, 2:45am

I think this API is a middle ground between what was done for MVDS and original Waku core protocols (very leaky abstractions that expose too much of the underlying protocol to the apps) and what we are aiming to do with the Messaging API

Placing this work in bigger context, the steps are:

(1) Initial integration

We integrate SDS with the API as above in status-go. Meaning that as specified, status-go code has to handle some retries, interfacing with local db and awareness of “message not acknowledged” and deciding what to do with that.

Nwaku integration is orthogonal to this integration.

(2) Nwaku integration

We replace go-waku core protocols with nwaku Golang bindings. This means no more usage of go-lipbp2.
go-waku still lives because tunable p2p reliabilty is implemented in go-waku’s API and we want to keep it for now.

(3) Messaging API

We define a Message API across all Waku SDKs (nwaku bindings, js-waku, Rust and Golang SDKs). This API removes all leaky abstractions, and includes the tunable p2p reliability (RIP go-waku).

All this works above is planned to start in 2025H1 and likely to finish within the year.

(4) Reliability API

The next step, not planned and to be done after Messaging API or once it is well matured. It is to provide an all-integrated SDS API.

The Reliability API would replace (in most scenarios) a couple of the functions of the Messaging API. Such as send message and subscribe, by introducing the concept of channel id.

The aim would be to remove the wrap/unwrap and provide an API that automatically handle retries, syncing, etc.

Those could be parameterized by the application, but the Application would not need to “resend using Waku”.

The API should handle the various retries, and in case of unrecoverable failure, give to the applicaiton to tell the user the message sent to fail.

At this point in time, a resend would be brand new message from a SDS PoV

Lots of unknown of course, but I encouraged @arseniy to provide such an API in JS, so that when integrating in Qaku, we can already learn and have a feel for what a good API would look like.

Once such an API is defined, then we will want to make it available in Golang, and Rust etc. The way I see it right now, is that the “C SDK” would provide this new API and would use libwaku and nim-sds together to deliver it.

Let me know if I need to expand more on that.

With this end goal in mind, I now wonder if the current proposed API for SDS in status-go makes sense. Or if it is too low level, and will leak too much in status-go.

Which may be fine. The team had great success in implementing p2p reliability directly in status-go first, and has the code was proven to work, it was slowly migration to the go-waku repo.

haelius · February 14, 2025, 12:15pm

I agree that these should be our next steps.

The most important difficulty with a Reliability API is that it would either have to be built on top of a new abstracted encryption layer or return “reliability wrapped” payloads back to the application for encryption. This seems rather complex and to imply another “Encryption API” step before we can get to the “Reliability API”. For example, I think we’d need an implementation and API for de-MLS before we can proceed with the Reliability API.

A secondary complexity with a Reliability API is that it needs to maintain a causal history to be effective. Currently, in order to save resources, we make it the responsibility of the application to maintain a local history and defer back to the application when processing causal history. It’s certainly possible to separate the causal and application histories, or configure a good interworking between app and reliability layer here, but would require some careful design.

fryorcraken · February 14, 2025, 10:46pm

Yes absolutely, the API is not expected to be simple as there needs to be two available callbacks for encryption and for read/write of causal history.

I think if the original proposal is the easiest way to get it in, then that’s fair, next steps being to tidy this up and refine the APIs.

fryorcraken · March 18, 2025, 6:25am

The initial version of the protocol includes a retrieval_hint in the causal_history. At first, this hint is a Waku message hash, allowing retrieval from store node using a hash query.

The issue with this approach is the current store limitations:

not decentralized
not incentivized
not durable/codex not integrated

Meaning that messages will get lost over time with no current recourse (apart from Status Communities’ archive system, assuming out of scope for now).

When discussing the Logos Forum PoC, the idea to have a follower concept, similar to https://portrait.so/ was raised.

In the context of SDS, it may be an interesting approach to allow reasonable message broadcast.

Assuming all user app cache seen messages locally.

Alice is a very active user of the forum
Bob follows Alice in the forum
Carol comes online, and realizes via SDS that she is missing messages
Carol failed to retrieve the messages from store nodes
Carol proceeds to send a “message request”
Bob sees that some of the message requested belong to Alice, so he re-broadcast them
Some messages that Carol are missing are from David, Bob does not spend upload bandwidth and RLN quota re-broadcasting for user he doesn’t care about/follow.

Other potential modifications could be:

Bob only caches 2 weeks of messages locally, apart from Alice’s message, caching them for a month

Ultimately, a proper review on whether re-broadcasting would be a good approach for SDS is necessary. However, we already know that it would eat on RLN quota. A user may be inclined to do so, when re-broadcast for users that he has affinity for.
In the context of a forum, user with most followers would get most message retention.

SionoiS · March 27, 2025, 12:22pm

I like this idea! That’s what I used for my social media protocol. I had the natural caching of IPFS but was also “pinning” likes and replies.

Cache the messages seen and reshare. Maybe in this case waku sync could be used.