Private Chat SDK: Roadmap Skeleton

fryorcraken · April 4, 2025, 12:41am

Chat SDK work is expected to ramp up in 2025 H2. While I do not have enough information yet to draft a roadmap, here is a skeleton that will be used when planning the work.

Also, thank you to the Waku and Status CCs who investigated the current codebase [1] and enabled me to write this.

Terminology

I define “private chat” as the combination of one-to-one chats and private group chats, in contrast to Communities.

Methodology

Bottom-Up approach

Where the bottom is Waku, and more specifically, the Waku Messaging API [2], and the top is the Chat SDK API, or most likely, a collection of APIs and libraries maintained by the Waku Chat/App team, and integrated in Status by the Status team (eventually).

We start at the bottom and review the current code and logic, tidy up and then move own to the next module/package upward.
A clean dependency tree (internal and external deps) allows a step away from technical debt.

A private chat SDK isn’t extracted if it brings the rest of status-go in because the dependencies aren’t clean. More specifically, this could be because it uses some code that is also used by Communities, but we did not take the time to extract from status-go.

The upper API becomes the boundary between Waku Chat/App team and Status team in terms of code ownership.

Top-Down approach

A top-down approach could enable carving a Chat SDK API first, and then extract the code and finally re-work the internals. Such approach would have the benefit to define the Chat SDK API early on, and proceed with changes behind the API without further impacting Status code.

In our case, this seems risky and impractical:

We are unlikely to be able to define an API first; there are a number of leakage across the modules (e.g. chat key). Foundational changes may be necessary that would mean modifying the Chat SDK API.
Intra-dependencies within status-go would prevent proper extraction of the chat sdk out of status-go. While a chat sdk may be usable, the code extraction to enable clear definition of internal APIs and behaviour would be difficult
The 2 points above mean that internal changes behind the API may have unpredictable side effect across other app functionalities. This can only be prevented by ensuring what the modules are and who uses them is known and defined.

In conclusion, a bottom-up approach seems to be the best way to not only “clean-up” the stack (see steps in next section), but also be the least risky approach.

Layers

By “layer”, I mean a modular approach with separate packages/library responsible for separate functional behaviours and protocols. The “layering” term expresses the fact that some modules are lower in the dependency tree, see the previously explained bottom-up approach.

Several layers of the Chat SDK have already been identified, see a rough diagram below (thanks @pablo):

From bottom to top:

Waku (message routing)
Segmentation
Encryption and authentication mechanisms
Data Sync/end-to-end reliability: MVDS, or better, SDS
“Chat” logic, which is likely to contain identity management, group management, contact request, etc - the clear boundary with Status app is yet to be defined.

A deliverable under a Chat SDK related milestone, would be doing the following for one of the layers/modules/libraries:

Understand current behaviour (RFC)
Define new behaviour (RFC update)
Implement new behaviour
a. Either brand new code (nim)
b. Or modifying existing code (go)
Clean dependencies: only use IFT maintained or recommended dependencies
Define (RFC, optional) and implement API
Black box testing by Vac-QA (functional) and Vac-DST (performance), thanks to API and RFCs
Make upper layer use the API and dogfood (always)

(1) enables us to understand the interconnecting tissue of the codebase, identifying the functionalities and draw boundaries around them.

(2) will be driven by product (Status) and technical (Waku) needs.

The driving force is to ensure that the protocol can operate over Waku in a secure and scalable manner; meaning RLN can be applied and produce reasonable properties in term of bandwidth usage and app usability, Waku Store is not used as a CDN, etc.
The properties of chat features (PFS, plausible deniability, group scalability, etc) matches Status and IFT expectations; and Chat SDK enables new desired features (e.g. contact discovery).

(3) Whether a new library is written, or current code is modified will depend on (2) and how much of the current logic/code is re-usable.
If code is modified (4.b), then we will need to review on whether a later re-write is needed or expected.

Breaking changes are expected. Breaking change management [3] is already in place. While it will not be a show stopper, it is likely to create some temporary mess during switching [4] phase.

Either way, a review will be needed to understand how experimental features and PoC will be dogfed in Status app [5]

(4) When writing a new library, consultation with Vac-ACZ will be needed to understand the libraries to use for specific encryption schemes. When changing existing code, a review will still be needed: some chat features depend on go-ethereum, for hashing and discovery (hopefully this later will go away with nwaku integration). This step ensures that those are cleaned up.

(5) An RFC that defines the API (e.g. [6], [7]) is not always needed. But it is important to consider it. It depends on the underlying protocol.
Using existing Waku protocols as an example, the API can often be deduced from the protocol (e.g. Waku Filter [8], Waku Store [9]), but not always (e.g. p2p reliability [10]).

(6) With specifications and a clean API, it then becomes possible for Vac-QA to proceed with thorough functional testing, as well as defining Reliability and Performance commitments for Vac-DST to check.

(7) and (5) may happen in parallel (3.b) or sequentially (3.a). Making use of a deliberate API may lead to necessary refactor, and solving technical debt [11] in the upper layer. Every piece of work should aim to be dogfed as soon as possible. PoC are to be integrated full stack, to get feedback and show working software.

Skeleton Roadmap

Milestone: Hardening and Scaling Foundations for Private Chats [12]

This milestone remains, as it does step (1) for the Chat SDK stack.
It is foundational work to understand the next steps. It includes a review of the usage of RLN, to better understand whether the rate limit can be applied with the current protocols, and the obstacles to clear.
It also covers implementing a rate-limit in the chat protocols now, that may be more higher than desired (eg 600msgs/epoch instead of 100) to cope with the chattiness of the current protocols.

New Milestone (name pending)

Deliverable: Segmentation

Proceed with steps above for segmentation. This is likely to be a useful library for any Waku users. Either available in Waku Messaging API or as an easy tool/library.

Deliverable: RLN membership integration

This deliverable focuses on the integration of the RLN smart contract in the user flow. A first step would be to have an experimental mode where users deposit to get an RLN, as per the current RLN specs [13]. This involves engineering work to generate and secure RLN credentials, proof generation and verification in the app, as well as API for UX concerns.

A second deliverable is likely needed to define future steps, such as diversification of entrypoints [14] (e.g. Status sponsored membership distribution) or proof generation performance improvement on mobile.

Deliverable: Thrifty Mutually Authenticated Session-Based Ratcheting Protocol - one-to-one chat PoC

The first step would be to review the encryption and authentication mechanism to provide a solid foundation to future work. This deliverable aims to:

Improve the application of rate-limit via RLN, to enable high enough limit so most users aren’t aware of it, and low enough so it provides reasonable bandwidth usage/protection on mobile and desktop.
Provide modular and sustainable foundations, reducing the number of encryption mechanisms used the across the application; and reconsolidate desired properties and actual implementation of chat (privacy, etc).

Due its complexity and known implementation challenges, MLS may not be first choice (yet).

The scope of this PoC would be limited to one-to-one chats, no private groups and no device pairing/syncing at first. FURPS should define reduced message production, in comparison to existing protocols. The output would be a new library (nim), specs and an integration in an experimental Status build (to be defined).

The path to instead improve current encryption mechanisms isn’t closed off. Further discussions to happen once the hardening and scaling foundations for private chats milestone is delivered [12].

Deliverable: Association-Based Identities

Another caveat is the omni-usage of a chat key as identity across the Status app. This creates coupling across the code base, reducing the modularity of the encryption layer.

Security problem also raise from this design: what happens if this key is compromised? There is no recourse to recover identity. Finally, it prevents more advanced app spam protection mechanism that can be efficient at the encryption layer (e.g. currently, a rejected contact request makes the rejecting party produce messages on the network and negotiate handshake).

By moving to association based identity, a number of building blocks are introduced:

Foundation to enable user to identify their chat via other external identities: eth wallet, farcaster, btc wallet, etc
Ability to recover account after loss
Ability to bring other devices (installation) for pairing using a common stack (”installations” in a group chat can either be another device of same user, or another user, it is the same from this layer PoV)

The last point is about better separating concerns of each layer, making identity mechanism agnostic to the context (Status app). Allowing a more re-usable and flexible library to be used.

Deliverables: Private Groups and Device Pairing

With the 2 previous deliverables, it becomes possible to review the protocol stack for private groups and device pairing.

The previous deliverable act as foundation that enable those two features to use the same technology stack: several devices from the same or a different user are a set of installations that have a secure channel (group chat) to communicate in a thrifty (low message generation) manner.

Association-based identity helps handling whether a new installation joining the group is a new user, or a new device of an existing member.

For device pairing, FURPS should focus on simplicity and reliability in comparison to the current device pairing feature. Feature parity should not be expected out of the box, and a review of the actually desired properties need to happen with the Status team

For private groups, FURPS should focus on message generation, as well as performance and group size. Mechanisms such as sender key should enable a reduction of operation costs.

Next Steps

The work currently in progres, future discussions to happen here (Vac Forum) and during the all hands should focus on the following points:

Confirm the work needed to make Status chat securely scalable; understanding what rate limit can be set, and what rate limit do we want to target, based on bandwidth usage target and behaviour assumptions to be discussed with the Status team [12];
Ensure that priorities and desired properties are aligned with IFT and Status stakeholders;
Clarify that the proposed deliverables, do bring those desired properties, and are indeed the best path forward;
Discuss team organization, and better define the boundaries between:
- chat/app dev (Pablo) and chat app research (Jazz), those boundaries should be modelled on the Waku core team (eg Hanno/Ivan);
- chat/app (Pablo/Jazz) and core (Hanno/Ivan/Sasha): Matters such as end-to-end reliability and encryption have historically been handled by the core team, but those domains are likely closer to chat/app expertise;
- chat/app (Pablo/Jazz) and Status (Volo/Icaro/Jo): I do not believe we will be able to get a definitive answer just yet, but some discussions can happen;
The current skeleton suggest new foundations over re-using, the question of nim competency in the chat/app team needs to be raised and discussed, including whether hiring on either chat/app teams should be requested. As a re-write approach is suggested, a strong justification will be required to support it over reusing existing code, and a well-documented case with compelling arguments must be presented;
How experimental changes should be handle in the Status app, and how those are dogfed by CCs, but also keen users;

References

fryorcraken · April 14, 2025, 11:50am

Please note that this roadmap does provide a view on what the work would be needed to reach the situation of having a working Chat SDK that is scalable and secure, including DoS protection.

Having said that, the critical path remains the integration of RLN in the application.

The first step remains having an understanding of the current protocols, to enable the integration of RLN and setting of rate limit parameters.

Even if we have RLN parameters set very high due to the limitation of the current chat protocols (eg 1000msgs per 10 min), it is a better situation that having a fancy new chat protocol that running on a not-DoS protected network.

The output of Hardening and scaling foundations for private chats remains:

local rate limit
plan to setup full RLN integration

Then we execute RLN integration and sature this work with chat/app team engineers.

And only then, can the modification of re-writing of chat sdk bottom up can happen.

This also means being clever in terms of blocker. I have put a lot of importance on UX for users that reach their limits. An interesting and different view point is that the rate limit should be low enough to protect the network, and high enough to be ignored by most users.

In a situation where we know that the rate limit has to be excessively high due to the current protocols being greedy, it removes some of the UX constraints and blocker, as we could expect that no user would reach the limit.
This could remove some potential obstacle.

Once RLN is integrated, then the rate limit could be lowered as:

we make chat protocols more savvy in terms of message production (see first post in this thread)
make the UX better when user reach the rate limit (dependency on Status)