Waku v2 scalability studies

petty · June 2, 2022, 5:21pm

Introduction

Waku is being relied upon more and more everyday, and efforts are now well underway to extend the use cases that it being use for and relied upon.

In line with that, it stands to reason that as an organization that pushes for the adoption of a protocol, we should be well aware of the boundaries in which it operates, and the various methods and reason it degrades or fails. For the lazy, here are the bulletpoints listed at the bottom w.r.t. “future work:”

What proportion of Waku v2’s bandwidth usage is used to propagate payload versus bandwidth spent on control messaging to maintain the mesh?
To what extent is message latency (time until a message is delivered to its destination) affected by network size and message rate?
How reliable is message delivery in Waku v2 for different network sizes and message rates?
What are the resource usage profiles of other Waku v2 protocols (e.g.12/WAKU2-FILTER and 19/WAKU2-LIGHTPUSH)?

We have done work in this area before, and outlined potential future work in this article, but there remains a significant amount of additional confidence to be gained around the use and scale of Waku v2.

This post outlines some of the reasoning behind why I think this should be prioritized and a initial proposal on how we start doing on-going testing of the distributed technology we (Vac + Status + Logos) build and maintain.

Previous and Similar Work

There is currently quite a bit of research and effort being put to understanding the scale of the underlying gossip layer (gossipsub) that Waku is built upon by organizations like Filecoin Foundation/Protocol Labs and the Ethereum Foundation. We can of course piggy-back off this work to help gain clarity around the fundamental limitations of what we build upon, but this work only goes so far to give insight to the scale of the abstractions and features we build on top.

Justification of Prioritization

Status Requirements

Arguably the largest current consumer of Waku as infrastructure is the Status clients and real soon, the Communities product offering within them. During conversations with them, they belabored the desire to better understand the following scenarios:

What are the current walls Status will hit as a function of user growth
What are the main contributing factors to these walls, e.g.
- number of total communities vs number of total users
- specific features contribution to total message throughput
- specific graph structure of the underlying p2p network and gossip parameters
- payload size
- required level of availability
- required ratio of available network services vs clients that only consume
How can the application fail safely?
What are the key indicators of failure?
- is it gradual or immediate?
Is a scaling wall global to the network or can it be mitigated to an individual community’s growth?

I’ll let the Status client teams add/edit any of that. More generally, Status would like to push the Communities feature as hard as possible while maintaining a threshold quality of service. As I understand it today, almost none of those thresholds are well understood.

Additional Waku Outreach

Recently, we have been spending significant resources to push Waku as infrastructure into different areas of the ecosystem, all of which have varied requirements w.r.t. expected quality. In other words, how they expect to scale out and strain Waku will be different.

A part of these engagements is the feedback loop they provide throughout their deployments and integrations. While this is incredibly valuable and integral to the growth of success in Waku development, it is purely reactionary and in some cases, may hinder the actual production deployment of Waku into these programs because they hit a blocker from some current limitation of Waku what was not understood up front.

Summing up

Having a more fundamental understanding of Waku’s scalability will allow us the following:

Better identify current ideal project integrations and the limits of their usage throughout the ecosystem.
Help prioritize future research and development based on current bottlenecks of scale weighed against our desired protocol offerings.
Build data-driven confidence around the robustness of the current protocols (Maybe an additional tier of RFC specifications can be added that points to a level of battle-testing a given protocol spec has gone through, e.g. “Hardened”)
Cultivate a culture of rigor within the broader ecosystem around protocol development
Facilitate users of Waku to better understand expectations and limits of an underlying protocol which facilitates their overall product development and trouble-shooting (is this my fault or a known limitation of what I build on?)
other stuff that could be named but this is getting long

An Initial Proposal

To be clear, I (program owner of Logos) am fully planning to build in-house resources for “distributed systems testing” to continuously be doing this for the products we offer under the collective. A JD for the initial hire along these lines should be out shortly (If you’re part of the community and this excites you, contact me). I expect that to eventually grow into a team. I am a firm believer that we should work to understand and gain as much confidence as we can around the tech we build and push out, both theoretically and experimentally (I am a physicist at heart still). It is my expectation that Vac will be a strong partner and contributor to this work as a research entity.

In the name of efficiency, I think that working with a platform like Kurtosis would not only allow us to get started quickly, but also develop organizational-wide competence in a platform that can be used in a myriad of other ways, namely:

shipping testnets and aiding developers
Simulating complex distributed network operations
- They are currently working heavily to assist in simulating the Ethereum merge event
applying traditional standards testing
- It is inside their planned roadmap to issue a suite a tests that mimic Jepsen analysis.
easing the process of users get bootstrapped and configured appropriately
harden CI/CD and integration testing

I have talked with them quite a bit already, and they are excited to work with us. It was my initial plan to use them extensively in the Logos work, but the ability to integrate them at this point is not feasible while we are in such deep research. Since we plan to use them, I think it is useful to start gaining internal competence and producing useful results with the other products that are ready to be integrated into their platform, and Waku is the obvious choice here.

I would suggest we develop a few key experiments that either help us harden our understanding of fundamental network scale or validate our security assumptions of what the network provides. My immediately lean is towards understanding how far Waku can scale as a function of users within the Status Communities product (obvious more details of what that experiment entails are required). We then work with Kurtosis to get integrated and perform these experiments.

The immediate next step would be scheduling a call with them with a desired experiment to run to map out the steps to get integrated and how we would perform this experiment using them. Their main requirement is that all services needed within an experiment or test are dockerized, which we have (but may need a few updates to get current?).

Discuss

So what I’d like is some thoughts on what you all think about this, what should be prioritized, false assumptions or errors I’ve made in this, level of desire in participating, information around this topic that I’ve missed on yall’s side, whatever else you think you should add.

Rant over.

fryorcraken · June 3, 2022, 6:56am

Thanks for that @petty. I was actually thinking about writing something about js-waku scalability as I have been explaining the same thing to few people. This made me realized I need to be more transparent on the subject of what is going on at the moment.

Allow me to hijack your discussion and provide some information on js-waku’s scalability status and goals.

Scalability Metrics for the Browser

Browser Waku node can be qualified as end points nodes:

A browser node needs to connect to a service node (nwaku, go-waku, NodeJS js-waku).
While it is possible to establish browser to browser connections, it is only possible to do so with an existing connection to a service node (signalling server or waku network).
Only secure websocket is currently supported in js-waku, meaning that the remote service node needs to provide a SSL connection, which includes a certificate that can be accepted by the user’s browser. Something unlikely to happen for Waku nodes running on the desktop or on mobile.

Hence, the first scarce metric for browser nodes are the number of connections available:

Browser nodes consume connections from Waku service nodes.
Browser nodes do not provide connections to other Waku nodes.

(2) is currently true but could be corrected in the future as one browser node could established connections to several browser using a single connection to a Waku service node. However it is unclear how reliable and practical this would be.

In terms of nwaku (and go-waku)'s connection capabilities:

nwaku nodes run by Status are set up to accept a maximum of 150 connections.
We do not know how many connections could a nwaku/go-waku node accept given specific hardware parameters (e.g. raspberry pi, X mem Y CPU).

Some assumptions

The recommended mesh size for Waku Relay is 6, meaning that ideally, a Waku Relay node should have at least 6 connections to other Waku Relay nodes to form a healthy gossipsub mesh.
For nodes that do not use Relay but instead LightPush and Filter[1], we haven’t set recommendations on the number of LightPush/Filter nodes one should connect to. Most likely it should be 2 or higher to ensure reliability and not rely on a single remote node.
For Store, likely to be the same as LightPush/Filter.

The current state of art

Discovery

js-waku’s default bootstrap method connects to a single remote peer from a set of nodes hosted by Status.
js-waku’s bootstrap API enables developers to pass a function, list of peers or ENR tree to retrieve nodes for bootstrapping.
An ENR tree is used for DNS Discovery as defined in EIP-1459. It enables the developer to set a domain name in their dApp and encode a list of nodes in a given domain name which can later be edited, extended without changing the deployed code.
Discv5 is being implemented in nwaku to enable a node to find other nodes outside the DNS Discovery protocol: Ambient Discovery Roadmap (incl Disc v5) · Issue #879 · status-im/nwaku · GitHub
Discv5 is not adapted to a browser context as it takes time and has high connection churn.

Note that DNS Discovery is not currently used by default in js-waku and we haven’t yet provided guides or tutorials for it because of one blocking item: native websocket support in nwaku.

Native Websocket support in nwaku

Currently, only secure websocket transport is supported by js-waku in the browser (future work includes WebRTC and WebTransport)
Currently, js-waku browser nodes connect to Status nodes via websockify. Websockify wraps nwaku’s TCP connection with TLS + websocket (ie, secure websocket).
Native secure websocket support was recently introduced in nwaku, thanks to nim-websocket.
To build an ENR tree, one need the ENR of each node. This is provided by nwaku via JSON RPC or in the logs. This does not work if nwaku is not aware of the port and domain name to use in the ENR, ie, if an external software is used, such as websockify, to provide secure websocket transport.

Limitations

The current limitations are as follows:

Secure websocket is not yet stable in nwaku.
It is not easy to setup an enrtree that includes wss when running nwaku nodes with websockify.
js-waku only connects to 1 remote node because of the limited number of nodes run by Status and the limited number of connections provided by these nodes.
Beyond bootstrapping, there are no peer discovery method available in js-waku.

Another way to see it is that today, if one wants to deploy a service node that would be use by browser nodes they:

Need to deploy nwaku
Need to get a domain name, point it to nwaku
Get a valid SSL cert (letsencrypt works)
Possibly deploy websockify due to low reliability of nim-websocket
Register their node somewhere so that js-waku discovers it:
i. Either hard code the node details in the app’s code base (if they actually are running an app and not just being an operator)
ii. Or include in a domain name used for DNS Discovery

What we are doing about it

Short term we are aiming to get rid of (4) and (5).

Websocket

A lot of effort has been made by the nwaku and nim-websocket maintainer to stabilize the code. We expect drastic improvement with the next nwaku release (v0.10, ETA this month, June '22).
Some of the issues tracking are:

DNS Discovery

The websocket stabilization will enable js-waku to offer DNS Discovery as a default bootstrap method: Use DNS Discovery by default · Issue #517 · status-im/js-waku · GitHub. With this we will also increase the number of Status’ operated boostrap nodes used by js-waku (waku connect + Vac Prod) , allowing an increase of the default number of connections for js-waku.

We will also publish guides on how to setup your own DNS for Discovery, enabling platforms to deploy their own nodes. This is currently possible but cumbersome.
Note DNS used for EIP-1459 can point to each other. For example, a project can run their own nodes and also point to another’s project or Status’ DNS so that a js-waku node can access a greater number of bootstrap nodes.

Connection benchmark

Once nwaku’s websocket implementation is table, we can start running benchmarks to know how many wss connection can a single nwaku node support. Enabling the browser:nwaku node ratio to increase.

Other Discovery

We are designing a peer exchange protocol: New RFC: 34/WAKU2-PEER-EXCHANGE · Issue #495 · vacp2p/rfc · GitHub which would enable browser nodes to leverage discv5. The idea would be for a browser node to do a get peers on a remote waku node, said remote node would then return the peers they found during their last discv5 run.

This would enable a browser to connect to nodes beyond the bootstraps one, increasing the potential pool of nodes and hence available connections.

Conclusion

We made Waku adaptive so that it can run anywhere, from cloud to mobile to browser. Any benchmark or scalability consideration needs to consider:

Which protocols are we measuring
On which platforms are we measuring these protocols.
What topology are we targeting, does this topology matches the Waku Network in practice or just a subset of it?

Contrary to other p2p networks, the Waku Network includes nodes of various resources, including the browser, which makes can make the topology more “heterogeneous” then other p2p network such as Ethereum (1) or Bitcoin.

[1]: More work is needed on filter, currently only work js-waku<>go-waku (when js-waku in the browser) and it is also not (yet) possible to disable Waku Relay in js-waku.

haelius · June 6, 2022, 2:10pm

Thanks, @petty and @fryorcraken!

I agree that kicking off proper scalability studies for Waku v2 should be a high priority.

A couple of comments and a proposal:

Adaptive nodes and scalability

Our concept of “adaptive nodes” running different protocols indeed muddies the question of scalability. The Waku Network, by definition, is heterogenous, as @fryorcraken highlighted above. I also agree with him that the essential scalability (and reliability) question boils down to how many nodes “provide” connections vs nodes that simply “consume” connections.

Nodes would be only consumers of connections if:

they’re not running relay protocol (e.g. nodes running only filter, lightpush and store)
they’re not discoverable
they are running in an environment that somehow restricts incoming connections (e.g. behind a strict NAT, in a browser, etc.)

I believe across the three main Waku clients there are already several parallel efforts to increase the number of nodes providing connections instead of simply consuming connections, including:

a general purpose peer-exchange protocol for discovery
discv5 for Status Desktop clients (already enabled)
more advanced NAT traversal strategies
etc.

Clients that can’t establish a satisfactory number of relay connections may choose to “fallback” to filter and lightpush.

`relay` as main driver for scalability

In my opinion, the most important question we want to answer (especially with an eye on Communities), is how a well-connected relay network scales. This is because:

All other protocols depend firstly on the health of the relay mesh and then on the performance/resources of the service nodes. I’d argue, for example, that the scalability of request-response protocols such as store, filter and lightpush depends primarily on the resource availability of their selected service nodes. Since these protocols are “centralised”, it’s also easier for us to devise our own scalability tests without needing a distributed setup as for relay.
We aim to enable and encourage as many nodes as possible to provide relay connections as summarised above. Having a good understanding of how such a relay network will perform will also inform how we prioritise these “run your nodes” initiatives, such as better discovery, NAT traversal, etc.

In other words, perhaps as a starting point for experiments we can assume a certain level of homogeneity as a starting point: relay nodes that are discoverable and within an environment where they can easily connect to other peers and accept incoming connections.

Proposal for first experiment

My rough proposal for a first experiment in participation with Kurtosis:

Setup

Run a relay network with x amount of nodes using discv5 to establish a well-connected mesh (afaik this approximates a typical Communities setup on Desktop clients).
Publish (from a random subset of nodes) messages of size s at a rate r

Measure

Varying x, s and r, measure:

message delivery reliability
message delivery latency

To measure reliability we can add a simple seq counter to the message payload. Latency can already be measured using the sender timestamp field.

Answer

This will help us answer the questions:

at exactly what messaging load performance, latency and reliability starts to degrade or fall below an acceptable threshold
at what mesh sizes performance, latency and reliability starts to degrade or fall below an acceptable threshold

Other metrics (such as node resource consumption) can form part of the experiment. Ideally, the experiment should be repeated separately for a homogeneous network of nwaku nodes and go-waku nodes to highlight any differences between the clients (and possible areas to improve).

fryorcraken · June 9, 2022, 1:24am

@haelius is making a good point, let’s elaborate what the relay connection metric means for the browser.

Let’s say we have done our benchmarks and we can recommend that a nwaku node running in a droplet can take 10,000 network connections.
Then yes, it could support around 10,000 browser connections (well less, because it would still need to connect to other service nodes).
However, nwaku is configured to 12 mesh peers for Relay. Meaning that the node will maintain a gossipusub mesh of 12 peers. So the 9900 other peers will not be part of the mesh and will not receive messages from nwaku over Relay.

Which means that the second metric we need to look at is the number of peers in gossipsub mesh.

We will need to understand how big a mesh can be and still be efficient. Possibly the result will be that if a peer cannot provide connections then it should not rely on Relay to receive messages.

@haelius what are your thoughts on browser client using Relay if they are not able to provide network connections?

haelius · June 9, 2022, 4:19pm

@fryorcraken, apologies for the many edits on this reply, but don’t want to overcomplicate the picture.
“Providing” connections does not have to be the only requirement for running relay. Nodes can be a healthy part of the relay mesh even if they initiate all their connections themselves. In other words, browser clients can definitely run relay as long as they’re able to establish enough connections for a healthy mesh (thereby they’ll still be contributing connections/routing paths to the network). For this they’ll likely need some discovery mechanism, such as the proposed general peer exchange.
Being well-connected is more difficult if you cannot accept incoming connections or aren’t discoverable, but is not impossible. That said, our total scalability would indeed depend on nodes being discoverable and accepting incoming connections, as these browser nodes would need enough discovered nodes to configure a healthy mesh.

fryorcraken · June 20, 2022, 1:24am

Thanks @haelius. Makes sense.

Continuing the discussion only considering Waku Relay.

While browser clients cannot provide connections to the network, they should still be able to create Waku Relay mesh with other (service node) peers, as long as they can connect to at least 4 peers: The recommended lower bound for outbound degree (D_low);

For this to happen, first we need browser clients to find peers beyond the bootstrap peers. Indeed, as previously stated in the thread, we are working on more dynamic discovery methods for the browser.

Based on these recommendations, let’s review the numbers needed to form healthy meshes.

I am not sure the best way to do these calculations, feedback is welcome.

Assumptions:

In this context, connect, means establishing a network connection and being included in each other gossip sub mesh.
1 browser node should connect to 4 services nodes (D_low)
1 service node should connect to 4 services nodes so that the mesh is somewhat reliable (D_low).
1 node should not connect to more than 12 nodes (D_high)

So if we have 5 services nodes in a mesh, they can each establish connections to 4 other service nodes.

There are 8 “slots” per service node available in their mesh (12 - 4 = 8).
Hence, a total of 5 * 8 = 40 remaining slots.
Because a browser node takes 4 slots then 40 / 4 = 10 browser nodes can connect to those 5 services nodes before the meshes become saturated.

From this I deduce a ratio of 5 service nodes for 10 browser nodes: 1:2.

This ratio does not seem achievable in reality. I would hope that one service node can serve in the scale of thousand browser nodes (got this figure out of my hat).

One way to increase the ratio would be to increase D_high. However, the impact of such change on mesh performance would need to be studied.

Reliability for Browser nodes

Please note that due to the situation descibed above, by default, js-waku only connects to one remote node. This is lower than D_low. Which means that one is likely to see reliability issue when using Waku Relay with default parameters.
As we deliver the features discussed in this thread, we will be able to increase the default number of connection js-waku make so it can reach D_low (4) and even D (6).

haelius · July 11, 2022, 2:14pm

I agree with your assessment above, Frank. I’ve had some conversations elsewhere about this, but haven’t updated the thread here. Especially with reference to addressing this during the offsite, getting everyone on the same page about the current status of the discussion will be helpful. Please highlight anything that I may have missed:

Some nodes (esp. browser nodes) can only initiate connections and not receive any incoming connections.
The relay network can only accommodate a limited number of such nodes while still propagating messages to everyone. The min ratio of two-way:one-way nodes depend on the D_low and D_high config in a network. I don’t believe we could tweak these parameters to where thousands of one-way clients could simply join the network - this would break gossipsub scalability and recreate a flood-based network.
Filter and lightpush may be more appropriate message receiving/pushing protocols for such one-way nodes, provided that the filter connection remains open and is reused for message push responses from the service node to the one-way client node.
Filter implementation from browser node to nwaku node has had some issues during experimentation, but I see no fundamental reason here why it should not work (nwaku already reuses existing connections).
Filter protocol in general can do with some improvements (e.g. an ACK), but could in theory be used immediately. See also: 12/WAKU2-FILTER: Production readiness · Issue #469 · vacp2p/rfc · GitHub
Incentivizing filter and lightpush should see more relay nodes being deployed (even just because applications would like to provide filter and lightpush to their own clients for free). This would also prevent us from just building a flood-routed network that doesn’t scale again (as filter and lightpush cost money or require running your own nodes).

@fryorcraken anything I’m missing?

fryorcraken · July 14, 2022, 1:05am

I think the issue with browser nodes is better defined as follow: browser nodes cannot connect to each other (easily), so it’s a type of node that can have connection with service node but not with other nodes of the same type (browser nodes).

So the question is: is it realistic to use Waku Relay in the browser? It seems not without fixing this issue (browser to browser connection).

What is interesting is that you mentioned Status intents to also use Filter/Light Push?

The issue with these methods is that they reduce privacy feature. When you use light push or filter you can link ip address and content topic.

fryorcraken · August 1, 2022, 7:38am

We agree that there are challenges to be solved to use Waku Relay in the browser in a scalable manner. These challenges also exist for other light nodes that share technical characteristics with the browser.

The recommendation is to use Filter + Light Push until solutions are implemented to make Relay scalable in the browser. Do note that Filter and Light Push come with security implications [1].

To paraphrase Waku v2 scalability studies - #5 by haelius

Browser nodes do not connect to each other with the currently available transport protocols (tcp, websocket).
Browser nodes do not advertise themselves via the currently available discovery protocols (static list, DNS Discovery, discv5).
If browser nodes were to advertise themselves, because of the ephemeral nature of a browser tab being opened, they would likely to poorly contribute to the Relay network.

Mobile node most likely share characteristics (1) and (2).

Here are potential solutions we will explore to remediate to the situation:

WebRTC

Solves (1).

Note: For a WebRTC handshake to happen, each party needs to communicate some signal data to the counterpart.

By integrating WebRTC in browser nodes, and enabling the transfer of signal data over Waku, then browser nodes would be able to connect to other browser nodes and hence remediate to the gossipsub mesh space depletion.

A browser node would still need to have a reliable connection to a service node to transfer the signal data to another browser node.

In terms of security implication, it may be acceptable to use light push + filter to transfer signal data as the service node would learn that said IP is attempting to setup a WebRTC connection but does not learn about the application usage of said IP.

Reviewing Waku v2 scalability studies - #6 by fryorcraken it would mean that a browser node would only need 1 or 2 Relay connections to service nodes and then connect to 4 other browser nodes via WebRTC to reach the recommended D value. Or potentially use Light Push + Filter to pass the signal data so most Relay connections could be to other browsers.

Peer Exchange “Push”

Solves (2).

A sister protocol of 34/WAKU2-PEER-EXCHANGE [2] where a browser node can push its ENR to a service node. This service node can then include this ENR to the next peer exchange response.
This would enable light clients to find each other.

Waku Browser extension

The solutions above do not resolve (3).

With WebRTC and a peer exchange push, it would take several rounds for a browser to connect to another browser and reach the recommended D value:

Actors:

Alice (browser)
Bob (browser)
Carole (service node)

Steps:

Alice connects to Carole
Alice does a peer exchange request to Carole.
Alice gets Bob’s details.
Alice sends connection request over Waku to Bob (WebRTC signal data, over noise channel?)
Bob responds with signal data
Alice connects to Bob

Considering the number of round trip and the potential network latency, it could take in the order of several seconds for Alice to form a health gossipsub mesh.
By this time Alice might have decided to give up and close her browser tab.

A browser extension might remediate to this by taking the steps above upon starting the browser.

Meaning that while the browser is open, the Waku extension can setup connections and ensure that the node has a health gossipsub mesh before the user open the dApp.

This also helps with (2): While a user might open a dApp tab that needs Waku only for few minutes/seconds, their browser is likely to be opened most of the time.

In terms of UX, a user could be offered the following choice:

No extension, dApp uses Filter + Light Push but it comes with privacy drawbacks
No extension, dApp uses Relay but it takes time to setup the dApp when opening the tab
Yes extension, maximize privacy + Waku is ready whenever the dApp is opened

The extension can use WebRTC to connect to other (browser) extensions.
While it is not possible to manage TCP extensions from a browser extension, it is possible to connect to a local REST API.
Meaning that the browser extension could connect to a local Waku node via the REST API to access the Waku network and provide it to the browser dApp.

The user could install a Waku node on their computer or a raspberry pi and leverage it when using a web dApp.

This is similar to connecting your Browser wallet extension to your own Ethereum node.

In term of UX, it adds a (4): Yes extension, Yes Waku service node: Run your own Waku service node, get rewarded, have an excellent connection to the Waku network and leverage it when using your favorite Waku web app.

References

fryorcraken · October 28, 2022, 3:36am

Initial work drafted/tracked here: Waku Network Testing · Issue #2 · waku-org/pm · GitHub