The future of Waku Store

Store’s limitations

Waku Store has been the source of issues, due to various misuse instead of actual decentralized storage (see Optimizing Community Description).

I have already identified that we should reduce the exposed functionality of store in the Messaging Waku API (API was renamed).

SDS is becoming an integral part of the Waku stack (see Introducing the Reliable Channel API - #14 by fryorcraken), and there has always been a question on how to best define the boundary and/or integration of Codex and Waku Store.

Do note that SDS defines usage store message hash queries to recover missing messages. This is an issue as it provides potential linkability between store clients, showing their interest in the same message content.
(the linkability is not absolute, as some may use message hash queries to just retrieve missed messages spotted via p2p reliability, but there are potential statistical attacks for long-running observations).

Moreover, decentralizing store is one of the remaining open problems, with incentivization being a needed block towards it, in addition to capability discovery.

Finally, I have recently highlighted that store is over-engineered.

What problem does Store solve?

Store is a generic solution (let’s have a DB!) to a problem. First, we need to review the actual problem and needs.

Store generally solves the issues of devices missing messages when they are disconnected from the network. However, this is a very high-level description of the issue, we need to dig a bit deeper:

(1) I know I was offline, I am now back online and want to get the messages propagated through Waku Relay that I missed

(2) I did not know I was offline, and I want to get the messages I may have missed (good read)

(3) I sent a message over the wire, I want to confirm that some of my neighbouring peers did receive the message.

WAKU-P2P-RELIABILITY proposes clear solutions to these problems, leveraging store among other protocols:

P2P reliability

Let’s look at the usage of store in this protocol:

Detect and remedy losses > Failure to publish

See Store-based reliability for a specification of the strategy.

Detect and remedy losses > Failure to receive

Nodes using either 17/WAKU2-RLN-RELAY or 12/WAKU2-FILTER to receive messages MAY determine message losses by combining these protocols with 13/WAKU2-STORE to compare their local message history with the historical messages cached in the 13/WAKU2-STORE. The usual remedial action is to retrieve the missing messages from the 13/WAKU2-STORE. See Store-based reliability for a specification of this strategy.

Store-based reliability > Store-based reliability for publishing

The publisher MUST periodically perform a presence query to the 13/WAKU2-STORE service against the message hashes of all published messages in the outgoing buffer to verify their existence in the store.

Store-based reliability > Store-based reliability for receiving

  • The node MUST periodically perform a content filtered query to the 13/WAKU2-STORE service, spanning the time period for which reliability is required (“reliability window”) and including all content topics over which it is interested to receive messages. The include_data field SHOULD be set to false to retrieve only the matching message hashes from the 13/WAKU2-STORE service. If a connection loss is detected (e.g. if a query fails due to disconnection), the next query MAY span at least the time period since the last successful query if this is longer than the reliability window.
  • The node SHOULD perform a message hash lookup query for all missing message hashes to retrieve the full contents of the corresponding messages. It MAY do so either periodically (in batches) or upon reception of the content filtered query response.

Store is being used in 3 ways:

A. hash message queries to check message presence

  • include_data = false
  • message_hashes = <hashes of sent messages>

B. hash message queries to retrieve message content

  • include_data = true
  • message_hashes = <hashes of messages detected as missing>

C. content filter queries to detect missing messages

  • include_data = false
  • content_topics = <content topics subscribed to>
  • end_time = now()
  • start_time = min(<start of reliability window>, <time of last successful store query>)

These strategies have several limitations:

  • A store node receiving a message does not mean the actual recipient received the message
  • Heavy on store node queries:
    • Periodic queries for every message sent
    • Periodic time range queries to see if any message has been missed
  • When considering privacy and censorship-resistance:
    • it breaks receiver anonymity provided by Waku Relay (reveal interests via content topic), unless store queries are done over mix
    • If sender anonymity is provided via light push mix, then store queries must also be done over mix to retain it.
  • It only provided reliability on a limited time range (aka “reliability window”) as it is impractical on extensive time range which would imply store node retaining such data.
  • Note that exposing message hash queries to application layer, may lead to linkability between nodes (2 store clients systematically asking for similar message hash ids could mean they are in the same chat group).

e2e reliability

SDS was designed to overcome the limitations above by providing e2e reliability.
It means that reliability is provided between the actual sender and recipients of a message, instead of being just among local nodes (aka, nodes with a direct connection).

e2e reliability latency is greater. So a mix of p2p reliability and e2e reliability is usually advised: to get faster resolution to detected network loss, as well as better heuristics in ensuring a message actually reaches its recipient.

timeline
    title Reliability windows
    section Message archival (Codex?)
        Start of conversation
        Weeks ago
    section E2E Reliability (SDS)
        Days ago
    section P2P Reliability
        Hours ago
        Now

Note: The time ranges are not representative of best usage of each protocol. Further work is needed to better define them.

Let’s review the store reliability logic described above in the context of SDS:

(p2p reliability) Store-based reliability > Store-based reliability for publishing
While this may provide faster feedback to the application that a message may have not gotten through, SDS provides more certainty on whether a message was received by the intended recipient(s).

(p2p reliability) Store-based reliability > Store-based reliability for receiving
This may provide faster bootstrapping when getting back online (one store query to get recent messages). However, it has scalability issues: The oldest message the client will get will depend on recent traffic and store DB size, and decentralization issues: Heaviest requirements on store DB in terms of size, means more centralization of the services.
SDS, and SDS Repair, reduce requirements on high availability store nodes, trading off latency, and message rate usage.

Message hash queries

Providing message hash queries to application logic leads to potential privacy leaks. As users of the same applications, in the same channel, are likely to request similar message IDs, enabling store providers to correlate client IPs over time.

Hence, we recommend the removal of:

  • hash message query usage from SDS
  • exposure of hash message queries in the Waku API

hash message queries can remain at the p2p reliability layer, which does not reveal application domain behaviour or information

Required functionalities

Considering the above, the store protocol functionalities can be reduced to:

key-value: ability to retrieve message content (payload) given a message ID (message hash) aka message hash queries. Requests:

  • must include message hash
  • should include shard/cluster information
  • could include content topic (for server performance optimization); as the client always knows this information

key presence: ability to check the presence of a message, given a message ID (message hash). Similar to key-value but the message payload does not need to be sent in the response.

Wayback machine: request for IDs/hashes of messages recently seen on a given content topic. From now (time of query), in backward order:

  • payload could never be included, delegating this functionality to key-value. Potentially increasing number of request from the user, but the functionality separation should enable better performance and decentralization fine tuning
  • max message hashes per page to be set by both client and server
  • pagination becomes a core feature, that can be optimized for this paginate-backward-from-now usage
  • start/end times, direction, parameters can be dropped, as query is either:
    • from now to the past, backward
    • OR, next page of the above, backward

When using such queries, the client would compare returned message ID’s with locally seen message IDs, proceeding with key-value queries to retrieve the payload of never seen messages.

The client may paginate back until either:

  • time of a previous successful wayback machine query is reached, giving assurance to the client that they are likely to have seen all messages for the given content topic(s)
  • And/Or, application-level latest relevant messages are retrieved, such as SDS messages with causal history. Enabling switching to different recovery protocol, such as SDS-Repair. Whether to switch may depend on window recoveries, bandwidth restrictions of the client, etc.

From the above, one can realise Wayback machine is actually a differential mechanism to enable a client to synchronize with a server, in terms of seen message IDs, and identify which ones are missing.

This is what Waku Sync does. Meaning that wayback machine could be fully replaced by Waku sync usage (with support of content topic dimension).

Proposed roadmap

Best done in the proposed order, except if the timeline is TBD.

As usual, for every change we do, dogfooding, QA and DST testing need to be done, FURPS needs to be defined, to understand what parameters are required to reach acceptable performance for UX.

1. Remove usage of hash message queries above Waku API, eg, by SDS

Timeline: Now/2026H1

Due to privacy concerns, it would be best to remove usage of message hash queries (key-value) and instead iterate on performance of SDS Repair + p2p reliability.

The potential issue with SDS using message hash queries is that it would positively impact latency and UX. But as we decide later to remove it for privacy concerns, we may have to “degrade” the performance to then “re-improve” it. Entrenching us in a solution that we know is not appropriate for our privacy goals.

Instead, it would be better to attempt to reach good UX without it, and identify early on the areas of improvement.
Re-introducing it temporarily could be considered, as a stop-gap solution, if it becomes critical

2. Waku API/P2P reliability to align with key-value/key presence/wayback machine model

Timeline: 2025H2/2026H1

Ensure that the Waku API, and p2p reliability implementations are aligned with the proposed model. This means moving towards clearer boundaries between the 3 functionalities. This will open the door to making future work easier by working on functionalities separately:

  • replacing wayback machine with Waku sync
  • Optimizing one of the features above, if necessary (eg moving key-value to nosql, dropping payload data from PostgreSQL, etc)
  • Deploying a specific functionality in a lightweight manner for desktop instances: eg enable key presence but not key-value
  • Deploying incentivization differently for each protocol (or not at all, eg incentivize key-value and Wayback machine/sync but not key presence

Backwards compatibility and migration: These changes will be implemented behind the Waku API, meaning application developers will experience them as standard library version updates. The API abstraction layer will handle the transition, requiring only dependency version bumps in application code. Communication will be provided through release notes and migration guides, but no fundamental refactoring of application logic should be required.

3. Replace Wayback machine with Waku Sync

Timeline: TBD

This can be prioritized if:

  • we need lower bw usage from client
  • we want lower resource usage from server
  • we want to better decentralise the feature
  • we want improved speed for extended p2p reliability window

4. Rewrite implementation of these functionalities

Timeline: TBD

Rewrite the PostgreSQL implementation of store. This should be prioritized if:

  • we need to reduce resource usage on server side
  • we want to decentralise the server side, eg serve from desktop instances

5. Define new protocols

Timeline: TBD

If we need to do (4), then the first step would be to specify new protocols for each functionality, making it separate libp2p protocols.

This would also be necessary if we want to make any of these protocols more decentralized (eg deploy on desktop instances), meaning a need to be able to differentiate each functionality from a discovery, libp2p, stream PoV.

Conclusion

Waku Store’s current implementation creates significant challenges around privacy, scalability, and decentralisation. By exposing message hash queries to the application layer, Store compromises the receiver anonymity that Waku Relay provides while placing unsustainable demands on store nodes.

The solution is to refine Store into three distinct functionalities: key-value retrieval, key presence checking, and synchronisation. This separation enables us to address each functionality’s unique requirements independently and resist the temptation to optimise for short-term UX at the expense of privacy guarantees.

The proposed five-phase roadmap prioritises privacy-preserving solutions first, with changes implemented behind the Waku API abstraction layer. This allows application developers to benefit from improvements through standard library updates without significant refactoring. By clearly delineating these functionalities, we create the flexibility to deploy, incentivise, and decentralise each component according to its specific requirements.

1 Like

A missing item is that Waku sync can also help replace, and enhance, key presence and its usage. The following item can be added:

N. Replace and enhance key presence with Waku Sync

Timeline: TBD

Instead of the client asking for a key being present on the server, the client could initiate a Waku sync request, that would include the recently sent message id.

Enabling in a few requests to both:

  • check if recent messages were seen by the server and missed by the client
  • get the presence check from the client
  • encourage the server to get the outbound message from the client, if didn’t seen before

This can be prioritized if:

  • we need lower bw usage from client
  • we want improved performance for extended p2p reliability window

A broader usage of Waku Sync was proposed in Sync as a replacement for Filter and Light push

Waku Sync specs: specs/standards/core/sync.md at master · waku-org/specs · GitHub