API Specification for End-to-end Reliability

This post outlines an integration proposal for the “e2e reliability protocol” (proposed here) into the status-go codebase, specifically focusing on its implementation within the encryption layer. The goal is to achieve e2e reliability in the simplest and quickest way - albeit with tradeoffs as compared to an ideal approach such as pulling all encryption related stuff into an API (which primarily has implementation challenges and requires longer development timeframe). The idea is as follows:

  1. Reliability Library: We’ll create a new library that implements the core functions of our e2e reliability protocol. This library will operate on unencrypted application data, adding reliability metadata before encryption occurs.
  2. Message Wrapping: We’ll extend our message structure to include reliability information such as Lamport timestamps and causal history. This wrapping occurs before encryption, effectively extending the application protocol.
  3. Limited Caching: The library will maintain a short-term cache of recent message IDs, used for the rolling bloom filter, building causal history, and initial causal dependency checking. Fallback to longer-term history kept by the application could be used to increase the time period covered by reliability.
  4. Application Interaction: The library will provide methods for the application to interact with the reliability system, including marking causal dependencies as met and handling missing messages.

Reliability Manager

The ReliabilityManager will be the core of our new library, designed to work with application data and the reliability constructs.

Breakdown of Protocol Components

Component Functionality Interaction with Application
Lamport Timestamp Maintain and increment Lamport timestamps internally Application provides clock value to initialize Lamport Timestamp
Causal History & Message ID Log Maintain a unified log of recent message IDs (e.g., last 500) for acknowledged messages. The causal history of a message consists of the last few message IDs in this log. Application provides long-term storage and retrieval for older messages
Bloom Filters Maintain and update rolling bloom filter for recent message IDs Application doesn’t need to interact with this
Message Buffering Short-term buffering of incoming message IDs with unmet causal dependencies and outgoing unacknowledged message IDs Application handles long-term storage and retrieval
Dependency Checking Check dependencies against library’s recent message history Return list of missing dependencies to application
Acknowledgment Tracking Track acknowledgments for outgoing messages based on bloom filters and causal histories in received messages Application marks messages as “sent” based on library signals

Note: In the context of Status, for the initial implementation, we could use a single channelID per Community, matching the proposed content topic usage. This simplifies the implementation while allowing for potential future expansion to sub-channels if needed.

Message History Management

The reliability library will maintain a short-term storage of recent message IDs (last 500 messages) for efficiency. The application will handle extended history storage and retrieval.

API Description

Sending a Message

In Go:

func (rm *ReliabilityManager) WrapOutgoingMessage(message []byte, channelID string) ([]byte, error)

In C:

unsigned char* WrapOutgoingMessage(ReliabilityManager* rm, const unsigned char* message, size_t messageLen, const char* channelID, size_t* outLen);
  • Library wraps the unencrypted message with reliability metadata (Lamport timestamp, short causal history)
  • Library updates internal Lamport clock
  • Library adds this message to an unacknowledged outgoing buffer to later ACK from an incoming message
    • Later when the message is ACKed, it is removed from the outgoing buffer and the message ID is added to the log
  • Returns wrapped message for application to encrypt and send

Receiving a Message

In Go:

func (rm *ReliabilityManager) UnwrapReceivedMessage(message []byte) ([]byte, []MessageID, error)

In C:

typedef struct {
    unsigned char* message;
    size_t messageLen;
    MessageID* missingDeps;
    size_t missingDepsCount;
} UnwrapResult;

UnwrapResult UnwrapReceivedMessage(ReliabilityManager* rm, const unsigned char* message, size_t messageLen);
  • Library unwraps reliability metadata
  • Library sweeps the outgoing buffer and marks messages as ACKed that match the causal dependencies and/or received bloom filter
  • Checks causal dependencies against recent history
  • If all dependencies met: returns unwrapped message and null for missing dependencies
    • The message ID is then added to the message log and removed from the incoming buffer
  • If dependencies not met: returns unwrapped message and list of missing MessageIDs
    • Then the application will check if these missing dependencies exist in longer-term app history or retrieve them from a history store

Marking Dependencies as Met

In Go:

func (rm *ReliabilityManager) MarkDependenciesMet(messageIDs []MessageID) error

In C:

int MarkDependenciesMet(ReliabilityManager* rm, const MessageID* messageIDs, size_t count);
  • The missing dependencies are reported as “missing” from the PoV of the library in its own short-term history and provided to the application.
  • The application can try to find these in its own history or via a store query to a history store. It should use this method to mark these dependencies as met within the library to ensure the incoming buffer is appropriately maintained.
  • Library updates its internal state and processes any buffered messages whose dependencies are now met

Signals

The library will emit the following signals:

  1. Message ready for processing
  2. Outgoing message marked as sent
  3. Periodic sync required

Go:

func (rm *ReliabilityManager) SetPeriodicSyncCallback(callback func())

Here are the C function pointer types for each signal:

// Signal for message ready for processing
typedef void (*MessageReadyCallback)(const char* messageID);

// Signal for outgoing message marked as sent
typedef void (*MessageSentCallback)(const char* messageID);

// Signal for periodic sync required
typedef void (*PeriodicSyncCallback)(void);

// Function to register callbacks
void RegisterCallbacks(ReliabilityManager* rm,
                       MessageReadyCallback messageReady,
                       MessageSentCallback messageSent,
                       PeriodicSyncCallback periodicSync);

The application can register these callbacks with the ReliabilityManager to handle the respective signals:

  1. MessageReadyCallback: Called when a message’s causal dependencies have been met and it’s ready to be appended to the internal log…
  2. MessageSentCallback: Called when an outgoing message has been acknowledged as sent. At this point, the message ID is added to the message log and removed from the outgoing buffer.
  3. PeriodicSyncCallback: Called when the library suggests a synchronization should be performed. The application can decide when and how to act on this signal.

Integration with status-go

The proposed reliability system would integrate with status-go primarily through the encryption layer and message handling processes. Here’s an overview of how it would work within the status-go context:

Integration Points

The likely integration points would be:

  1. In protocol/v1/status_message.go, particularly in the HandleTransportLayer, HandleEncryptionLayer, and HandleApplicationLayer methods.
  2. In the encryption layer, possibly in protocol/v1/status_message.go within the EncryptionLayer struct.
  3. In the message handling flow, likely in protocol/v1/message.go.

“Please note that this is as per primary and limited understanding, any comments from status contributors are highly appreciated”

Outgoing Messages

  1. Message Creation: When a user sends a message, it would likely be handled in the v1 protocol package.

  2. Wrapping for Reliability: Before encryption, the message would be passed to the ReliabilityManager:

    wrappedMessage, err := m.reliabilityManager.WrapOutgoingMessage(message.Payload, message.ChatID)
    
  3. Encryption: The wrapped message would then be encrypted using the existing encryption protocol, likely in the EncryptionLayer struct:

    m.EncryptionLayer.Payload = encryptedWrappedMessage
    
  4. Sending: The encrypted message is sent using the existing transport layer, possibly through the TransportLayer struct.

Incoming Messages

  1. Message Reception: Incoming messages are likely received in the HandleTransportLayer method of the StatusMessage struct.

  2. Decryption: The message would be decrypted in the HandleEncryptionLayer method.

  3. Unwrapping for Reliability: After decryption, the message is passed to the ReliabilityManager:

    unwrappedMessage, missingDeps, err := m.reliabilityManager.UnwrapReceivedMessage(m.EncryptionLayer.Payload)
    
  4. Dependency Handling: If there are missing dependencies, the application would need to check its local history or retrieve them from a store:

    if len(missingDeps) > 0 {
        foundDeps := m.checkLocalHistory(missingDeps)
        m.reliabilityManager.MarkDependenciesMet(foundDeps)
        
        remainingDeps := m.getRemainingDeps(missingDeps, foundDeps)
        if len(remainingDeps) > 0 {
    	    err := m.retrieveMissingMessages(remainingDeps)
    	    // Handle any error or retry
        }
    }
    
  5. Message Processing: The message can be shown to the end user on the UI as soon as it arrives, but with an indication such as “possible messages missing prior to this one”. The number of missing messages can be specified based on the number of unmet causal dependencies. Once a message is marked as having all dependencies met (i.e., “processed”), the message would be processed in the HandleApplicationLayer method and UI can remove the missing messages display.

Acknowledgment Handling

Marking Messages as Sent: The application would receive signals from the ReliabilityManager when outgoing messages are acknowledged:

m.reliabilityManager.SetMessageSentCallback(func(messageID string) {
    m.markMessageAsSent(messageID)
})

Periodic Sync

  1. Sync Signal: The ReliabilityManager would periodically signal the need for synchronization:

    m.reliabilityManager.SetPeriodicSyncCallback(func() {
        m.initiateSync()
    })
    
  2. Sync Process: The application would implement the initiateSync function to handle the actual synchronization process, by sending an empty sync message.

Next Steps

  1. Detailed technical specification of the library’s API
  2. Finalized integration plan with existing status-go message handling flow
  3. Prototype implementation focusing on core functionality

We look forward to your thoughts and feedback on this proposal !

5 Likes

Thanks for the detailed writeup @shash256!

some points/queries come to my mind wrt integration with status which i think would need recommendations/suggestions as to how the application has to implement these in order for the protocol to be used effectively.

  • it would be good to suggest what could be used as a channel ID. is it that each content-topic should have a channel ID or some other method has to be used to match them?
  • how would the participants be determined with which a specific channel will be sycned? since this protocol is especially needed for communities which currently uses store protocol to fetch/identify missing messages, it would be good to think of how each member would choose participant(s) to sync with? is it the application’s responsibility to identify participant group for syncing?
  • wondering if there needs to be some mechanism to limit the number of participants to sync with in order to make it less concentrated. e.g: if my node starts getting sync requests by 100 others in the community, i shouldn’t start syncing with all of them? maybe a simple participant limit can be specified locally e.g my node only syncs with a max of 10-20 participants that are part of same channel/group. Maybe some sort of userid based approach can be taken here i.e x userids that are closer to my userid and are online (who is part of a community) would be sycned with.
  • what if the nodes i am syncing with go offline (note that status desktop instances run on user’s laptop which maybe switched off for some part of the day)? that means there needs to be some way to know this and shift/identify other nodes to sync with.
  • this is a very rare scenario, but what would happen if no one is online apart from me to sync with?

maybe some of the above mentioned problems are already addressed while integrating MVDS (another sync protocol already used in 1:1 and group chats) and same approach can be reused here. but there are definitely new scenarios to think about in case of communities since number of users in cmmunity can become very large and we don’t need to sync with all the users.

How will backward/forward compatibility be ensured between users that have a version that includes these changes and those that have not updated their clients yet?

Looks good from high-level PoV.

As @prem suggested, what is the proposal for channelID in the context of Status Communities?
Is there be a relation or restriction between a channel id and a content topic?
What about context of 1:1 chat (even if we do not integrate this yet).

I would suggest the following:

  1. Decide on a community marker that indicate usage of e2e reliability on community
  2. Implement wrapping and unwrapping of messages for enabled communities . App version N
  3. Add a toggle in options “create community with e2e reliability enabled”. Could be CLI/compilation option at first to dogfood.
  4. Once stabilized, enable it by default at community creation for app version N+M. Need Status to define M (when can an app version can be considered end-of-life).
  5. Then, we can review if there is a need to migrate pre-existing communities to it. Note this could be done in combination of migrating pre-existing communities to own shard.

edit: typo

1 Like

I wrote a bit more about handling roll out of breaking changes: Breaking changes and roll out strategies

1 Like

@fryorcraken @prem re:

what is the proposal for channelID in the context of Status Communities?
Is there be a relation or restriction between a channel id and a content topic?

For now, I would suggest a single channelID per Community (matching the proposed content topic usage). The channelID concept was introduced to allow for possible future use case where filter clients only interested in certain sub-channels could participate in e2e reliability only for those channels. For now, since we don’t have such filter use cases and to build the simplest thing first, I’d just stick with one channelID for each “sync group” (i.e. each Community). We should make this clearer in the post, @shash256

@prem, re

it would be good to think of how each member would choose participant(s) to sync with

(and other points about the sync group)
The idea here is to exclude 1:1 sync interactions completely and simply add causality, ordering and ACK information to existing messages that will allow eventual consistency. The “sync group” is the entire community, and the only place where missing messages will be attempted to be retrieved will be from the store node. This comes with a bunch of assumptions, of course, that prioritises scalability above strong guarantees (e.g. what if the store node does not have the missing dependencies?, etc.). I think once we understand the behaviour of the protocol better we can add mechanisms that provide stronger guarantees. We may even introduce 1:1 interactions for smaller sync groups to speed up time to consistency.

There are some details omitted from the post on how this API will be implemented. For example, in order to save bandwidth, attempting to retrieve missing messages from the store should be done very lazily, in case the missing messages are still to be received via normal Relay broadcast. The first aim, in fact, would be to simply indicate on the UI that messages are lost if a best effort (but naive) store retrieval could not find the missing messages. We could also delay implementing the periodic sync message (this is unnecessary if the community generates enough traffic of its own).

this is a very rare scenario, but what would happen if no one is online apart from me to sync with?

Mmm. Quite a good point. The protocol mostly aims to provide publishers with information to understand if their message has been ACKed by at least on other member of the sync group (i.e. the Community). That is because we want publishers to rebroadcast eagerly, rather than everyone else to individually pull messages that failed to be published. What will likely happen therefore is that this single online user will periodically rebroadcast its messages until it reaches some max attempts or someone else comes online and sends a message. Of course, since the number of online users is known, it should be easy to encode some exceptional handling of this case to avoid the unnecessary retransmissions.

2 Likes

@shash256 one thing I think should be clarified is around the message causal dependencies and the message log.
Specifically, I think we should clarify that the causal history of a message is the last few message IDs in the message log. Then we should also clarify how message IDs get added to the message log:

  1. in the case of received messages - ID added to log once all dependencies are met and message removed from the “incoming buffer”.
  2. in the case of sent messages - ID added to log once message is ACKed and removed from “outgoing buffer”.

Point (2) is different from what we thought before and what is currently listed under “Sending a message”, but I think it might work better. This way we won’t include unacknowledged messages in the log. Otherwise a publisher could create an internal chain of causal dependencies for which no other node has received any of the messages, leading to an explosion of Store requests to retrieve these. WDYT?

Added a note in the post to clarify this

There is an exponential backoff strategy in place in the POC, such that the rebroadcast happens with an added delay with each turn, but yes, this particular scenario could be handled as an special exception with the info that no one else in online.

Agree. Adding the message to the log after it’s ACKed could be relatively a better approach. Updated the post to reflect this change.

1 Like