Use Waku in a Bandwidth Efficient Way

kaichao · September 26, 2024, 7:22am

Abstract

Waku, as a real-time communication network, inherently uses network bandwidth for message routing. Its protocols are designed to facilitate the exchange of messages between peers in decentralized systems, but efficient bandwidth use is crucial to ensure scalability and resource management, especially for devices with limited connectivity or data plans.

Waku supports different protocols for routing messages, for example Relay and Lightpush / Filter. Each protocol has its own strengths and trade-offs, making it essential to choose the right one based on use case and resource constraints.

This specification aims to provide guidance on how to use Waku in a bandwidth-efficient way, including best practices and potential optimizations for existing applications like Status.

Best Practices

Favors Lightpush/Filter Protocol on User Devices

Relay protocol propagates messages in real-time across the network but can result in significant bandwidth usage due to the broadcasting nature of message routing.

Lightpush/Filter protocol are more bandwidth-efficient as they minimize unnecessary data transmission, either by pushing messages to specific peers or filtering out irrelevant messages based on interested topics.

To optimize bandwidth usage on user devices, it is recommended to favor the Lightpush and Filter protocols when possible. The shortcoming of this approach is the application must provide access to service nodes to facilitate these protocols. Unlike the Relay protocol, which operates in a fully decentralized manner by broadcasting messages across the network, Lightpush and Filter require intermediary nodes to handle message forwarding and filtering.

In the long term, the Waku Network should implement an incentivization model to encourage the provision of Lightpush and Filter services, ensuring scalability, decentralization, and reliability. By incentivizing service node operators, the network can promote wider participation, distribute the operational load, and create a sustainable system for handling bandwidth-efficient protocols.

Potential Issues and Optimizations

Multiple factors can cause excessive bandwidth usage, impacting overall bandwidth consumption in Status app. In following sections we will discuss some of the potential issues and options for optimizations, different options can be utilized together to achieve the best result.

Global Shard Message Routing

Direct messages (DMs) and group chats are currently routed through the default shard /waku/2/rs/16/32, meaning every relay peer handles these messages. As user volume increases, this leads to exponential traffic growth, potentially overwhelming the network and making it unsustainable for users to bear the escalating bandwidth demands.

Option 1

A more practical approach would be to enforce the use of Lightpush and Filter protocols for global shard messages if such protocols are not enabled by default. This ensures that instead of relaying all messages across all peers, only relevant messages are pushed to nodes that have explicitly subscribed to them. This reduces overall network traffic and prevents unnecessary data from being routed to all peers, mitigating the risk of exponential traffic growth as the user base expands.

Concerns:

depends on service nodes to facilitate Lightpush and Filter protocols, which may introduce centralization and privacy concerns.

Option 2

Implement a dynamic sharding system for global message routing. When a user joins the network, they are assigned a shard, and this shard index is broadcast to all their contacts. Additionally, shard information is embedded in contact links and contact request messages. Users can switch to a lower-traffic shard as needed, and any shard changes are broadcast to all their contacts to maintain communication consistency.

Concerns:

handle the complexity when shard changes, e.g., when a user switches to a different shard, the application must ensure that all their contacts are aware of the change to avoid message loss.
traffic increase could be exponential when adding more contacts.

Option 3

Not joining the global shard relay network, instead fetching the messages from Store node periodically.

Concerns:

it will introduce additional latency and create a dependency on the availability and performance of the Store node, which result in bad UX.

Message Retransmission in MVDS

Direct messages, group messages and some community join request messages are sent via MVDS. When the recipient is offline, these messages are repeatedly resent until a predefined limit is reached, which leads to more bandwidth usage.

Option 1

Replace MVDS with a more reliable and bandwidth-efficient end-to-end (E2E) reliability protocol to reduce message retransmissions.

Concerns:

fully adopting e2e reliability protocol requires significant time and resources for implementation.

Option 2

Disable MVDS retransmission for recipients that are not online/active.

Concerns:

user may set their status to offline but still want to receive messages. (TODO needs input from Status team)

Option 3

Increase the time interval between message retransmissions to reduce bandwidth usage. The current resend epoch calculation in Status app is as follows:


next_epoch = current_epoch + (2^(send_count−1)×30×3) + rand(0,30)

The interval can be increased by adjusting the constant factor (30) to a higher value, e.g., 60 or 90, to reduce the frequency of message retransmissions.

Concerns:

it may increase the latency of message delivery.

Community Description

Refer to Optimizing Community Description

Store Node Queries for missing messages and messages sent check

Regular queries to store nodes can create additional network traffic.

Option 1

Use e2e reliability for missing messages retrieval and messages sent check.

Device Synchronization

To ensure a consistent user experience across multiple devices, a lot of messages are sent through Waku global shard, for example user profile, contacts, communities, activies, etc.

Scope of user profile data: what we need to backup to waku and sync

Option 1

Change the product design to only allow sync through a negotiated shard when two devices are online at the same time.

Waku can allocate a range of shards specifically for syncing messages, allowing users to subscribe to a shard only when needed. From the user’s perspective, this operates as a transient shard, dynamically utilized for short-term tasks.

Concerns:

UX is not matched with the current design.

Option 2

Do not sync messages through Waku, instead use a bittorrent or other systems to sync history messages, and use the following backup flow for profile, contacts, communities synchronization.

User Data Backup

Frequent backups of user data, such as profile information, contacts, and communities ((e.g., ApplicationMetadataMessage_BACKUP)), can be bandwidth-intensive. It’s also tightly coupled with the device synchronization.

See: PR 2413 and PR 2976.

Option 1

Waku provides a new protocol (namely, stateful store),

there needs to be another table in store node’s database, let’s call it states for now
each record in states table has fields (id, pubkeys, content, pubsubTopic, contentTopic), the pubkeys is a list of public keys of user’s devices.
we will assign a new shard to route the messages of state creation and updates.

How user backup their data,

send user profile along with the device’s pubkey and signature of the message hash
store node receives the message, verifies the content in message with the bundled pubkey, further save it to the state table. (TODO Status app may use the same key for different devices, need to confirm)
when new update event happens, it sends a new message just like the previous one
store node further check and updates the content

Note:

To handle the conflict between different devices, we use LWW (Last Write Wins) strategy, which means the last update message will be saved to the state table.

Each devices should fetch the latest stete from store, update local state, then compose and send the new backup message to store.

Concerns:

it requires a significant amount of work to implement the stateful store protocol.

Option 2

The backup should be disabled by default, and manually triggered by the user when necessary.

User can enable periodically backup and set the intervals as needed. Currently, the periodically backup is enabled by default.

Status Update Messages

User sends status update messages to the global shard every a few minutes. With the increase of users, the traffic can be heavy.

Option 1

Favor lightpush and filter protocol for such messages.

Option 2

Don’t broadcast the message via Waku pubsub protocol, instead use p2p connection (WebRTC) directly to the contacts for routing messages.

If the status update message is for a community, it should bradcast to community assigned shard.

Concerns:

it may require a significant amount of work to implement the p2p connection between contacts.

haelius · September 26, 2024, 2:13pm

Thanks! I added some comments below.

In summary, I think what we need is (see [Deliverable] Status usage of Waku scaling and bandwidth optimization recommendation · Issue #197 · waku-org/pm · GitHub):

an overview specification of Status app procedures (we can start with communities only) that is explicit on how we should think about the frequency/routing/reliability/encryption of each procedure. For example, that way we can use the spec as blueprint to start splitting traffic into separate shards for global, self-addressed, and community functions, as many of your options below suggest.
telemetric analysis to understand where the largest bandwidth bottlenecks lie
incremental work to improve bandwidth, lowest-hanging fruits first. For example, we can start by using lightpush/filter on the new self-addressed shard (i.e. a special shard for user backups and sync), fix Community Description issues, etc. depending on what our telemetry indicates, including considering the implementation of a stateful store as you suggest.

A major shortcoming is also that these light protocols sacrifice a lot in terms of privacy and censorship resistance. What decentralisation properties they have they derive from a strong, diversified Relay backbone network. If we were to switch to a lightpush/filter default model for Status clients right now, we will have to deploy a lot of service nodes, all managed by a single, centralised entity (us). Definitely agree with your point on service incentivisation playing an important role to strengthen this infrastructure in future.

A lot is also built on the intrinsic incentivisation model where participants/beneficiaries of the network (Status clients) also contribute to the network (by e.g. providing lightpush/filter for certain shards). The design goal of a successful Waku network for Status protocols is not simply to lower bandwidth usage as much as possible (for that we should not use p2p protocols), but to find the balance between anonymity, latency and bandwidth that works best. The final goal here is likely for Status clients to support a combination of using lightpush/filter for certain procedures on certain shards and full Relay support for other procedures on other shards. For that we need to follow a systematic approach where we properly categorise and shard different types of app traffic and procedures, use telemetry to see empirically where bandwidth bottlenecks lie and incrementally improve our bandwidth usage until we have a protocol specification that reaches a useful bandwidth-anonymity-latency balance.

Implement a dynamic sharding system for global message routing.

Our roadmapped plan for global message routing (1:1s, private group chats) is to use autosharding + RLN. I think we need telemetry to understand the priority of this. My feeling is that the lower hanging fruit would be to address the issues of user backups and device syncs.

Change the product design to only allow sync through a negotiated shard when two devices are online at the same time.

How about an intermediate solution/workaround where we use a separate shard(s) for device sync and user backups to which no clients have a Relay subscription? Only Store nodes are subscribed to these no-broadcast shards. For now, clients can backup/sync via the Store nodes by simply using lightpush to publish the sync messages (no bandwidth used for routing).

Waku provides a new protocol (namely, stateful store),

It seems to me the stateful store itself is more about saving storage space than saving bandwidth? From the point of view of the client, whether the store node stores multiple copies of the backup data for each user or updates a single entry should not necessarily affect bandwidth? I think we’re likely to design a stateful store at some point, as it’s very useful for many applications.

we will assign a new shard to route the messages of state creation and updates.

Similar to the workaround described for device sync, this new shard could be a lightpush-only shard which will immediately provide huge bandwidth savings without requiring any changes in our store protocol.

The backup should be disabled by default

Depending on app requirements, this could be a good idea in any case.

kaichao · September 26, 2024, 3:07pm

Glad we can keep discussing this topic here.

Imagine there are 3 or even more apps that using relay mode in the same desktop, the bandwidth consumption will be unacceptable from user’s perspective. If our design is not able to support 3 apps that using same protocol, we are in the wrong direction IMO.

Relay has better privacy/anonymity compared with lightpush/filter, but it’s not production ready. We need to push relay into a production ready mode, and this is the relay service network IMO, it’s much lighter compared with store service, and easy to setup.

And maintain the privacy/anonymity in an acceptable level for lightpush/filter protocol. I feel even in current design, it’s already fit for 90% users. Actually it’s much better comparing with Discord or Matrix in privacy/anonymity if I understand correctly.

I have been thinking about the autosharding, pubsub and content Topic recently, the current protocol seems restricted to fix the mapping between content topic to pubsub topic.
Pubsub topic is about routing, content topic works like an identifier for app/user interested services, for example, a store node may only want to store the messages from contentTopic-1 but not contentTopic-2.
Pubsub topic is orthotropic to content topic, in another words, pubsub topic is about message reach to some peers, and content topic is about how full peers provide services to client peers.
I’m not sure if it makes sense, but keen to share my thoughts so that we can have early discussion around it.

This approach makes sense for backup process, but for sync process, use may expect a real time experience when all the devices are online, there is always a request interval using store node.

Combing the stateful store with a dedicated shard for the update messages will reduce the bandwidth, since use won’t subscribe to this shard, instead user just query the state from store or subscribe to the global shard which routes the hash or id of the state update message.

fryorcraken · September 27, 2024, 2:16am

Thank you for the detailed review. Great work.

Would be good to back this up with some data so we can understand the expect gain of rolling out MVDS. I don’t expect that now, but more work we will need to plan once e2e reliability protocol is rolled out.

Indeed, we know when a user was previously online, even if “offline” status, they do broadcast their X3DH bundle on a regular basis. If the bundle is not seen for a while, then it would make sense to stop rebroadcasting because we know they are offline.

There would be some edge cases:

Alice sends message to Bob
No ack, Alice check Bob’s most recent x3DH bundle: older than 5 days
Alice stops retransmission and goes offline
Bob goes online, emit message
Alice goes online, sees recent X3DH bundle, retransmit.

The issue is that if both Alice and Bob are sporadic users, then they may miss each other. But the message should be in store. I think it may be interesting further studying this option.

is that in seconds?

When doing a hash query, do we end up redownloading our message (ie store replies with full message)? Could we update the store hash query to only return message presence to save download bandwidth?

This seems to be a big chunk of work. I think it may make sense to have an iterative approach and deploy a solution for the biggest culprit (Community Description?), and then move other messages to the same solution if it makes sense.

This is back to Optimizing Community Description - #18 by fryorcraken

Are the requirements for user data backup and community description similar enough so that we could develop one solution to solve both problems?

The main issue I see is that all community members are interested in the community description, whereas only one user (several devices) is interested in backup data.

In the case of community description, it could be sent to IPFS and all community members could pin the message to ensure it’s accessible even if community noide owner is offline.

But it does not really make sense to do that for backups. Also, need to think long term: who is going to pay for those backups? using Waku is unlikely to be the most cost efficient way for a user.

I think this needs to be reviewed from a product pov first, and then the right technology can be selected/designed for this problem.

What is that? how frequent and how big?

This is enough to solve the problem here. Alice receives her messages on contentopiv1 that is autosharded to shardA.
Alice can subscrive to this shard.
Bob is also on shardA, alice can sends messages to Bob
Charlie on shardB, Alice uses light push to send a message to Charlie.

Yes, this is something we have in mind, we are just waiting to have a clear demand to plan the work for it.

I think I don’t undestand enough the sync process, back to some of my previous questions. when devices are paired, do they still need to communicate?

If so, then indeed it would make sense to investigate direct connections and ad hoc networks.

kaichao · September 27, 2024, 3:19am

Yeah, actually each epoch is at 300ms interval, 30 * 3 means 30s in this sense.

The message is only downloaded when needed/missing, but still it’s in a fast interval, 3 seconds (skip if no outgoing message) for sent message check, 1 minute for missing message query.

Agree, this is something that needs to be reviewed thoroughly.

They both share the feature of stateful storage and authenticated updates, which is the reason why I propose the stateful store protocol.
It’s an unsolved problem from my experience, and Waku has a shot to provide a capable solution. The worst case is that when a better protocol comes out to replace it, we can always deprecate the stateful store protocol.

It’s sent by every users every 5 minutes if I remember correctly, not big though. It uses to show the online status of a user, and subsystems like UI and MVDS depends on it.

A lot of sync messages happens during different requests, for example ApplicationMetadataMessage_SYNC_INSTALLATION_COMMUNITY. But my knowledge to sync user case is limited, would be good to have Status team to chime in here. cc @Patryk @jonathanr

fryorcraken · September 27, 2024, 4:47am

Ah yes, sorry I did not remember we had this option available on store queries:

bool include_data = 2; // Response should include full message content

fryorcraken · September 27, 2024, 4:48am

I think it’s important that we do review current solutions to support whatever decision we make (use other protocol or create new).

Patryk · September 30, 2024, 7:22pm

That might not be possible for a link that is already shared. We had the same problem with links for communities, which do not include the shard information for this exact reason. The solution we came up with was to periodically advertise the shard info on the default shard.

Offline status should be unrelated to MVDS. Users with status set to offline will still receive direct messages and send ACKs. I understand it can be a bit confusing, but the offline status is purely a client-side setting; protocol-wise, everything functions as usual.

kaichao:

Option 3

Increase the time interval between message retransmissions to reduce bandwidth usage. The current resend epoch calculation in Status app is as follows:
next_epoch = current_epoch + (2^(send_count−1)×30×3) + rand(0,30)
The interval can be increased by adjusting the constant factor (30) to a higher value, e.g., 60 or 90, to reduce the frequency of message retransmissions.

Concerns:

it may increase the latency of message delivery.

+1. Tweaking exponential backoff parameters seems to be a quick win indeed.

Please note that device syncing is also happening asynchronously, i.e., devices can be online alternately. A stateful store from the next point could also be used in this case; we would need a signal message to indicate that a sync message was pushed to the stateful store.

Please note that currently backup is distributed not with one but with many messages: status-go/protocol/messenger_backup.go at 031b5342f1782a4245a120222c708449db3b4c00 · status-im/status-go · GitHub. I don’t know the rationale behind it. I can only guess that the concern was the message size, as the message segmentation layer had not been implemented at that time.

Devices use the same pubkey, because they backup the same account. Devices are distinguished by installationID but it is private among paired devices if I recall correctly.

Patryk · October 1, 2024, 7:51am

Wouldn’t it be problematic in terms of privacy? Using direct connections can expose IP addresses of contacts.

Patryk · October 1, 2024, 8:09am

I believe that extending the proposal with a logical clock, rather than using LWW, would make the potential solution more robust and less complex. Without this approach, numerous edge cases could arise. For example, one might fetch the latest state from the store, then go offline for a week (e.g., by suspending the PC), and subsequently push a backup that overwrites data from other devices. By incorporating a logical clock, we only need to handle the if newClock > oldClock condition within the stateful store and ensure this clock is incremented on the clients.

source: Optimizing Community Description - #25 by Patryk

vpavlin · October 2, 2024, 1:27pm

I don’t think this can ever be solved by better bandwidth management. If/When Waku becomes popular and widely used, we will need to start thinking about how applications can share a single node on the device rather than each of them run their own node:) But that is for a separate and more complex discussion.

petty · October 10, 2024, 4:51pm

heavy agree on this concept. The user should understand that they depend on some “heavy” node and know how to run one themselves (embedded in some app) and then have a number of “light” nodes that depend on that heavy.

It should be easy to point your light node to a specific heavy one, or fallback to some default (bootnodes I imagine).

Anything outside of this experience would result in a specialized niche community of limited size, almost exclusively to bandwidth consumption reasons for an individual user.

kaichao · October 11, 2024, 3:03am

There could be two path to run the “heavy” nodes, one is in a trusted way based on user’s network / social connections / communities, user may pay the nodes directly somehow.

Another one is in a trustless way with proper incentivization for such nodes and user pay the network somehow. RLN only covers relay but not other services provided by such heavy nodes.

User can always run their own heavy node on a home server.

The two options may finally merge into together to best serve users, but we need to discuss which path to go first.

kaichao · October 15, 2024, 6:18am

Yeah, that’s valid concern, there is a balance between privacy and usability, and I think user should have an option to decide which one is more important.

kaichao · October 25, 2024, 6:02am

This is the test result after running relay and light mode for 2 hours.
tldr, the relay mode consumed bandwidth growed from 7x to 50x in 2 hours.

cc @fryorcraken @haelius @pablo @Patryk

fryorcraken · October 29, 2024, 10:06am

Interesting, will look in more details.

The fact that relay consumes significantly more than light is good news.
Because it means light mode consumes significantly less than relay, and achieves the objective of making Waku usable on mobile. That, despite the artificial redundancy and other reliability related mechanisms.

At the end of the day, we need to expect that either:

a user runs a relay node, meaning that the resources consumed by Waku may be high but they can use it for free/cheap
Or, a user runs a ~~light~~ edge node. Meaning that their resource consumption will be low, but they’ll have to pay in one way or another for the usage of resources. Whether directly or by doing whatever action in the Status app that makes it worth it for Status (or Community owner) to subsidize their costs.

I wrote more about that in Cost related to Waku infrastructure - Messenger - Status.app

We are in the land of sovereignty. We cannot expect infrastructure to be subsidized by data harvesting and have comparable bandwidth usage to Discord. There is no free lunch.

fryorcraken · November 5, 2024, 3:38am

Is this even needed? I would like Status product to consider dropping this feature until there is a clear need, and when re-introducing, do it properly.

It is expect from most decentralized app that your local profile get lost if you lose your device. As long as the seed is safe.

What information is in there? if it’s setting + profile data, then it would mean we are impacting all users bandwidth for the sake of a user saving 5min to re-upload his profile picture when reinstalling on a new device?

petty · November 5, 2024, 9:59am

My suggestion is to update this diagram to current Status <> Waku usage and start finding alternarive, more efficient means to functionality.

fryorcraken · November 5, 2024, 10:23pm

I think this is an excellent idea: converting @kaichao 's original post to a visual.
This, in addition to splitting traffic per shard to get measurements and reporting them in a visual will help get an overview and communicate on it.