Sync as a replacement for Filter and Light push

SionoiS · September 6, 2024, 8:44pm

Hello!

In this post I will explain why I think it would be advantageous to completely replace Light push and Filter protocols with a range-based set reconciliation (RBSR) protocol similar to Waku Sync.

First let me briefly describe RBSR, this algorithm efficiently find differences between 2 sets. It does so by recursively splitting a set into ranges and comparing them. Concretely it operates on ranges of hashes that are ordered by time. More criterion can be used to define different ranges, only time and hash are required, more on this later.

In the context of Waku, light push and filter protocols are used by “light” nodes for pushing and pulling message to and from “full” nodes. For reliability purposes, knowing if messages have been received is always required which require yet another protocol in the from of Waku Store. All of this logic can be replaced by a much simpler synchronization of variable length caches.

With this new process, a light node wanting to “send” a message, would first add it to the local cache and initiate a sync with a full node. The range of values to sync is up to the initiating (light) node, a useful property for bandwidth management. Another criteria could also be added, content topic, to further limit the amount of values to sync. Inversely, a full node wanting to notify light nodes of new messages would initiate syncs with various ranges and criterion based on each client preferences. Reliability becomes a function of the range of values to sync and the frequency. Keep in mind that syncing a single value is also a valid strategy.

The main disadvantage of this approach is an increase in latency and larger overhead compared with the base protocols (light push, filter) but when adding reliability in the mix the simplicity of syncing is appealing. The privacy of syncing is comparable, the status quo is maintained.

Let me know what you think below!

P.S Stay tuned for another post on how to replace Relay with Sync…

prem · September 9, 2024, 10:49am

Interesting thought to think of using RBSR for syncing of lightClients.

But i do have some concerns with this approach.

Store based reliability is used to determine whether messages sent by a lightClient are received by the network i.e beyond the service/full node the lightClient is interacting with. This basically means Store is not used to established reliability for the last mile (i.e lightClient ↔ serviceNode) alone but rather till second to last mile i.e beyond the serviceNode and hence increases the probability that a message has permeated into the network .

In order to achieve something similar with RBSR, a lightClient will have to initiate sync with multiple nodes in the network to ensure messages sent by the node and it is supposed to receive are not missed. If we start to think in this direction, i am wondering how will it be different than gossipsub which also uses a type of sync mechanism with short-term message-cache but with more nodes.

This maybe a problem for lightclients as this gives a bad UX for end-users. what is the latency that we ensivion though?

Will this be a storage overhead? if so, it may be good to have an idea of the same to understand the impact it may have on lightClients (as especially they are supposed to be light in nature).

SionoiS · September 9, 2024, 1:35pm

I didn’t know that, make sense!

The advantage of RBSR is that the state required (a message cache) to operate is already needed for a light client anyway. GossipSub is also more complex than a simple sync because of the mesh formation and long lived connections.

We can design the protocol to minimize this but syncing require at least one round-trip. If the light node also use Filter we gain efficiency because we fetch and push new messages at the same time.

Bandwidth overhead as we need to sync and then send messages. We can store as few messages as we need, I don’t think it would be a problem.

prem · September 10, 2024, 6:53am

True, but if we look at browser based light-clients they may not store state similar to say status-mobile(where message cache is stored in local db). But for lightclients not based off of browser, this will not be a problem.
Also, if we are looking at this as an alternative to LP+Filter to improve reliability, then a node will have to end-up syncing with more than 1 node in order to achieve the same.

With sync as well don’t we have to maintain long lived connections or is the idea the peers would be selected randomly for each sync session?

1 RTT should not add any latency that what we have currently for lightpush which involves 1RTT. But i did not mean that form of latency rather sync will not be triggered for each message sent rather at a particular interval only which is what might add latency than current lightpush/Filter which only have 1RTT latency between service-node and client.

got it, Thanks!

SionoiS · September 10, 2024, 11:41am

IMO multiple random syncs would be the best strategy.

In my mind, a sync would be triggered for each messages. Batching is done automatically. Imagine Node A add a message to it’s local cache then trigger a sync. The first protocol payload will contain for example the xor hash of the first 3 msgs for a short recent range. Node B receive it and it doesn’t match it’s own xor hash of the same range, sends back all hashes for that range. In the mean time more messages can be added to A’s cache and when the payload from B arrive and doesn’t match, missing messages are sent to B. That’s how I envision the protocol to work.

haelius · September 10, 2024, 2:49pm

Thanks a lot for this.

As an initial thought, I also envision the future of Sync as becoming an essential part of our routing mechanism (lightpush, filter, relay) as you describe it, with the main advantage being the integrated reliability which we now mimic with a clumsy workaround using store checks.

for Filter and Lightpush, it adds the necessary cache-and-sync for which Relay has at least an IHAVE/IWANT mechanism
for all protocols, including Relay, it adds efficient diffs and longer term, programmable caching on top of the IHAVE/IWANT mechanism

My main comment would be that I would see this as an addition/augmentation of existing filter/lightpush (and relay, although this may belong to your other post) and not necessarily a replacement. The “eager push” characteristic of these protocols make it suitable not only to maintain the low-latency path but also to keep resource requirements flexible for environments where caching is not possible. Would have to develop this thought a bit more, but I also think that for incentivised services accounting may be a bit easier for lightpush/filter (with clear service provider and client) vs sync where beneficiary and provider of the service is ambiguous. The benefit of having this as a supplementary mechanism is also that, if in future we can prove that Sync on its own does provide the propagation characteristics we need, it could replace the eager push mechanisms.

Some questions:

How do you view efficient state management? I.e. for different clients each having a different set of content topics they’re interested in and requiring sync over different time periods, wouldn’t the state (and processing) required to build client-specific ranges dynamically be rather large? It seems to me that we’re getting close to the Store Query complexity given the cardinality of content topics and dynamic time ranges required for this feature. I guess the easiest way to find out may be to build and test a PoC. Of course, we may end up replacing some (or all) store queries too.
Inversely, a full node wanting to notify light nodes of new messages would initiate syncs with various ranges and criterion based on each client preferences.

Presumably the fact that the service node initiates the action is to keep the latency reasonable for new messages? This seems to require fairly complex state on the server-side. Wouldn’t it be easier for clients to initiate this sync periodically, since clients already know what set of content topics and time ranges they’re interested in? Best scenario seems to me to maintain the best-effortness of Filter for low-latency propagation and Sync as a client-initiated, periodic reliability layer (similar in function to IHAVE/IWANT in gossipsub).

Overall excited about this direction to have resource-restricted routing (and perhaps store querying?) benefit from Sync.
What would you say next steps may be? Perhaps:

A somewhat more thorough investigation/writeup of the devil in the details here (how much state do we expect? what data structure works best? how many nodes do we sync with at a time and how regularly? how do we scale?), followed by
a basic PoC that demonstrates the functionality

SionoiS · September 10, 2024, 3:51pm

Yes the text book RBSR (Negentropy) may not be the best in our case. One idea I had was to mark messages as new and send them right away when a sync is initiated. The design would have RBSR and message exchange as parallel systems. As diffs are found messages are sent right away bidirectionally (Negentropy doesn’t do this).

I find it hard to imagine where caching would be impractical. Browser and mobile can easily handle small caches. I’m confidant Blake3 hashes and clever tricks can make the system very efficient.

It depends on what we want to optimize for. Some clients may want to compute everything on the fly as to save as much storage space as possible, others, store the 3 full duplicated indexes for efficient range scoping. We also have the advantage that our protocol would be very specific to our use case which can save us from “genericity” cost. One example would be that we don’t need to do the hardening for incremental hashing since messages are RLN verified, a simple XORing would suffice. Of course the devil is in the details but in theory it should work.

Client preferences to save would be sync range and content topics. The “server” would optimize for scoping speed and use the full indexes I mentioned. The server may also want to continuously update the xor hash of all the ranges of all clients as new messages are added to speed up syncing queries.

IMO, having the server initiate syncs keeps the server in control of resource management and reduce latency.

First step would be to, indeed, investigate if Status usage would make sense in this new paradigm, write the details down then a POC perhaps.

prem · September 17, 2024, 4:58am

can you elaborate on this? i did not get the point as to how incremental hashing is a method of hardening. afaiu, it is just a method of computing a combined unique identifier for a set of messages.

Looking at status app as an example, the number of contentTopics used(even after optimizing) would be a lot from each application. Considering we want a single Waku network that would serve multiple applications, i can easily see large increase in them if just number of apps become 10x. I am envisioning that this approach will not scale on server-side as resource requirements to manage the state would increase.
Also i agree with @haelius that this is in-fact a responsibility of the client who is interested in specific subset of state.

Another thought that comes to me is maybe it would be good to explore options where a subset of overall state can be on the fly generated at speed based on demand rather than server storing complete segregated state based on content-topic?

SionoiS · September 17, 2024, 12:08pm

Finding 2 sets of hashes that XORed result in the same fingerprint is trivial. We know this is a possible attack vector and can design counter-measures. I argue that in our case this is not a problem in practice because of RLN and encrypted messages.

That was just an idea. K-D trees could also be used to store multidimensional data so as to speed up the fingerprinting process when content topics are part of the query. The specifics are engineering details and can be tailored for each use case, which is the advantage of RBSR.