Waku v2 scalability studies

Introduction

Waku is being relied upon more and more everyday, and efforts are now well underway to extend the use cases that it being use for and relied upon.

In line with that, it stands to reason that as an organization that pushes for the adoption of a protocol, we should be well aware of the boundaries in which it operates, and the various methods and reason it degrades or fails. For the lazy, here are the bulletpoints listed at the bottom w.r.t. “future work:”

  • What proportion of Waku v2’s bandwidth usage is used to propagate payload versus bandwidth spent on control messaging to maintain the mesh?
  • To what extent is message latency (time until a message is delivered to its destination) affected by network size and message rate?
  • How reliable is message delivery in Waku v2 for different network sizes and message rates?
  • What are the resource usage profiles of other Waku v2 protocols (e.g.12/WAKU2-FILTER and 19/WAKU2-LIGHTPUSH)?

We have done work in this area before, and outlined potential future work in this article, but there remains a significant amount of additional confidence to be gained around the use and scale of Waku v2.

This post outlines some of the reasoning behind why I think this should be prioritized and a initial proposal on how we start doing on-going testing of the distributed technology we (Vac + Status + Logos) build and maintain.

Previous and Similar Work

There is currently quite a bit of research and effort being put to understanding the scale of the underlying gossip layer (gossipsub) that Waku is built upon by organizations like Filecoin Foundation/Protocol Labs and the Ethereum Foundation. We can of course piggy-back off this work to help gain clarity around the fundamental limitations of what we build upon, but this work only goes so far to give insight to the scale of the abstractions and features we build on top.

Justification of Prioritization

Status Requirements

Arguably the largest current consumer of Waku as infrastructure is the Status clients and real soon, the Communities product offering within them. During conversations with them, they belabored the desire to better understand the following scenarios:

  • What are the current walls Status will hit as a function of user growth
  • What are the main contributing factors to these walls, e.g.
    • number of total communities vs number of total users
    • specific features contribution to total message throughput
    • specific graph structure of the underlying p2p network and gossip parameters
    • payload size
    • required level of availability
    • required ratio of available network services vs clients that only consume
  • How can the application fail safely?
  • What are the key indicators of failure?
    • is it gradual or immediate?
  • Is a scaling wall global to the network or can it be mitigated to an individual community’s growth?

I’ll let the Status client teams add/edit any of that. More generally, Status would like to push the Communities feature as hard as possible while maintaining a threshold quality of service. As I understand it today, almost none of those thresholds are well understood.

Additional Waku Outreach

Recently, we have been spending significant resources to push Waku as infrastructure into different areas of the ecosystem, all of which have varied requirements w.r.t. expected quality. In other words, how they expect to scale out and strain Waku will be different.

A part of these engagements is the feedback loop they provide throughout their deployments and integrations. While this is incredibly valuable and integral to the growth of success in Waku development, it is purely reactionary and in some cases, may hinder the actual production deployment of Waku into these programs because they hit a blocker from some current limitation of Waku what was not understood up front.

Summing up

Having a more fundamental understanding of Waku’s scalability will allow us the following:

  • Better identify current ideal project integrations and the limits of their usage throughout the ecosystem.
  • Help prioritize future research and development based on current bottlenecks of scale weighed against our desired protocol offerings.
  • Build data-driven confidence around the robustness of the current protocols (Maybe an additional tier of RFC specifications can be added that points to a level of battle-testing a given protocol spec has gone through, e.g. “Hardened”)
  • Cultivate a culture of rigor within the broader ecosystem around protocol development
  • Facilitate users of Waku to better understand expectations and limits of an underlying protocol which facilitates their overall product development and trouble-shooting (is this my fault or a known limitation of what I build on?)
  • other stuff that could be named but this is getting long

An Initial Proposal

To be clear, I (program owner of Logos) am fully planning to build in-house resources for “distributed systems testing” to continuously be doing this for the products we offer under the collective. A JD for the initial hire along these lines should be out shortly (If you’re part of the community and this excites you, contact me). I expect that to eventually grow into a team. I am a firm believer that we should work to understand and gain as much confidence as we can around the tech we build and push out, both theoretically and experimentally (I am a physicist at heart still). It is my expectation that Vac will be a strong partner and contributor to this work as a research entity.

In the name of efficiency, I think that working with a platform like Kurtosis would not only allow us to get started quickly, but also develop organizational-wide competence in a platform that can be used in a myriad of other ways, namely:

  • shipping testnets and aiding developers
  • Simulating complex distributed network operations
  • applying traditional standards testing
    • It is inside their planned roadmap to issue a suite a tests that mimic Jepsen analysis.
  • easing the process of users get bootstrapped and configured appropriately
  • harden CI/CD and integration testing

I have talked with them quite a bit already, and they are excited to work with us. It was my initial plan to use them extensively in the Logos work, but the ability to integrate them at this point is not feasible while we are in such deep research. Since we plan to use them, I think it is useful to start gaining internal competence and producing useful results with the other products that are ready to be integrated into their platform, and Waku is the obvious choice here.

I would suggest we develop a few key experiments that either help us harden our understanding of fundamental network scale or validate our security assumptions of what the network provides. My immediately lean is towards understanding how far Waku can scale as a function of users within the Status Communities product (obvious more details of what that experiment entails are required). We then work with Kurtosis to get integrated and perform these experiments.

The immediate next step would be scheduling a call with them with a desired experiment to run to map out the steps to get integrated and how we would perform this experiment using them. Their main requirement is that all services needed within an experiment or test are dockerized, which we have (but may need a few updates to get current?).

Discuss

So what I’d like is some thoughts on what you all think about this, what should be prioritized, false assumptions or errors I’ve made in this, level of desire in participating, information around this topic that I’ve missed on yall’s side, whatever else you think you should add.

Rant over.

4 Likes

Thanks for that @petty. I was actually thinking about writing something about js-waku scalability as I have been explaining the same thing to few people. This made me realized I need to be more transparent on the subject of what is going on at the moment.

Allow me to hijack your discussion and provide some information on js-waku’s scalability status and goals.

Scalability Metrics for the Browser

Browser Waku node can be qualified as end points nodes:

  • A browser node needs to connect to a service node (nwaku, go-waku, NodeJS js-waku).
  • While it is possible to establish browser to browser connections, it is only possible to do so with an existing connection to a service node (signalling server or waku network).
  • Only secure websocket is currently supported in js-waku, meaning that the remote service node needs to provide a SSL connection, which includes a certificate that can be accepted by the user’s browser. Something unlikely to happen for Waku nodes running on the desktop or on mobile.

Hence, the first scarce metric for browser nodes are the number of connections available:

  1. Browser nodes consume connections from Waku service nodes.
  2. Browser nodes do not provide connections to other Waku nodes.

(2) is currently true but could be corrected in the future as one browser node could established connections to several browser using a single connection to a Waku service node. However it is unclear how reliable and practical this would be.

In terms of nwaku (and go-waku)'s connection capabilities:

  • nwaku nodes run by Status are set up to accept a maximum of 150 connections.
  • We do not know how many connections could a nwaku/go-waku node accept given specific hardware parameters (e.g. raspberry pi, X mem Y CPU).

Some assumptions

  • The recommended mesh size for Waku Relay is 6, meaning that ideally, a Waku Relay node should have at least 6 connections to other Waku Relay nodes to form a healthy gossipsub mesh.
  • For nodes that do not use Relay but instead LightPush and Filter[1], we haven’t set recommendations on the number of LightPush/Filter nodes one should connect to. Most likely it should be 2 or higher to ensure reliability and not rely on a single remote node.
  • For Store, likely to be the same as LightPush/Filter.

The current state of art

Discovery

  • js-waku’s default bootstrap method connects to a single remote peer from a set of nodes hosted by Status.
  • js-waku’s bootstrap API enables developers to pass a function, list of peers or ENR tree to retrieve nodes for bootstrapping.
  • An ENR tree is used for DNS Discovery as defined in EIP-1459. It enables the developer to set a domain name in their dApp and encode a list of nodes in a given domain name which can later be edited, extended without changing the deployed code.
  • Discv5 is being implemented in nwaku to enable a node to find other nodes outside the DNS Discovery protocol: Ambient Discovery Roadmap (incl Disc v5) · Issue #879 · status-im/nwaku · GitHub
  • Discv5 is not adapted to a browser context as it takes time and has high connection churn.

Note that DNS Discovery is not currently used by default in js-waku and we haven’t yet provided guides or tutorials for it because of one blocking item: native websocket support in nwaku.

Native Websocket support in nwaku

  • Currently, only secure websocket transport is supported by js-waku in the browser (future work includes WebRTC and WebTransport)
  • Currently, js-waku browser nodes connect to Status nodes via websockify. Websockify wraps nwaku’s TCP connection with TLS + websocket (ie, secure websocket).
  • Native secure websocket support was recently introduced in nwaku, thanks to nim-websocket.
  • To build an ENR tree, one need the ENR of each node. This is provided by nwaku via JSON RPC or in the logs. This does not work if nwaku is not aware of the port and domain name to use in the ENR, ie, if an external software is used, such as websockify, to provide secure websocket transport.

Limitations

The current limitations are as follows:

  • Secure websocket is not yet stable in nwaku.
  • It is not easy to setup an enrtree that includes wss when running nwaku nodes with websockify.
  • js-waku only connects to 1 remote node because of the limited number of nodes run by Status and the limited number of connections provided by these nodes.
  • Beyond bootstrapping, there are no peer discovery method available in js-waku.

Another way to see it is that today, if one wants to deploy a service node that would be use by browser nodes they:

  1. Need to deploy nwaku
  2. Need to get a domain name, point it to nwaku
  3. Get a valid SSL cert (letsencrypt works)
  4. Possibly deploy websockify due to low reliability of nim-websocket
  5. Register their node somewhere so that js-waku discovers it:
    i. Either hard code the node details in the app’s code base (if they actually are running an app and not just being an operator)
    ii. Or include in a domain name used for DNS Discovery

What we are doing about it

Short term we are aiming to get rid of (4) and (5).

Websocket

A lot of effort has been made by the nwaku and nim-websocket maintainer to stabilize the code. We expect drastic improvement with the next nwaku release (v0.10, ETA this month, June '22).
Some of the issues tracking are:

DNS Discovery

The websocket stabilization will enable js-waku to offer DNS Discovery as a default bootstrap method: Use DNS Discovery by default · Issue #517 · status-im/js-waku · GitHub. With this we will also increase the number of Status’ operated boostrap nodes used by js-waku (waku connect + Vac Prod) , allowing an increase of the default number of connections for js-waku.

We will also publish guides on how to setup your own DNS for Discovery, enabling platforms to deploy their own nodes. This is currently possible but cumbersome.
Note DNS used for EIP-1459 can point to each other. For example, a project can run their own nodes and also point to another’s project or Status’ DNS so that a js-waku node can access a greater number of bootstrap nodes.

Connection benchmark

Once nwaku’s websocket implementation is table, we can start running benchmarks to know how many wss connection can a single nwaku node support. Enabling the browser:nwaku node ratio to increase.

Other Discovery

We are designing a peer exchange protocol: New RFC: 34/WAKU2-PEER-EXCHANGE · Issue #495 · vacp2p/rfc · GitHub which would enable browser nodes to leverage discv5. The idea would be for a browser node to do a get peers on a remote waku node, said remote node would then return the peers they found during their last discv5 run.

This would enable a browser to connect to nodes beyond the bootstraps one, increasing the potential pool of nodes and hence available connections.

Conclusion

We made Waku adaptive so that it can run anywhere, from cloud to mobile to browser. Any benchmark or scalability consideration needs to consider:

  • Which protocols are we measuring
  • On which platforms are we measuring these protocols.
  • What topology are we targeting, does this topology matches the Waku Network in practice or just a subset of it?

Contrary to other p2p networks, the Waku Network includes nodes of various resources, including the browser, which makes can make the topology more “heterogeneous” then other p2p network such as Ethereum (1) or Bitcoin.

[1]: More work is needed on filter, currently only work js-waku<>go-waku (when js-waku in the browser) and it is also not (yet) possible to disable Waku Relay in js-waku.

3 Likes

Thanks, @petty and @froy!

I agree that kicking off proper scalability studies for Waku v2 should be a high priority.

A couple of comments and a proposal:

Adaptive nodes and scalability

Our concept of “adaptive nodes” running different protocols indeed muddies the question of scalability. The Waku Network, by definition, is heterogenous, as @froy highlighted above. I also agree with him that the essential scalability (and reliability) question boils down to how many nodes “provide” connections vs nodes that simply “consume” connections.

Nodes would be only consumers of connections if:

  • they’re not running relay protocol (e.g. nodes running only filter, lightpush and store)
  • they’re not discoverable
  • they are running in an environment that somehow restricts incoming connections (e.g. behind a strict NAT, in a browser, etc.)

I believe across the three main Waku clients there are already several parallel efforts to increase the number of nodes providing connections instead of simply consuming connections, including:

  • a general purpose peer-exchange protocol for discovery
  • discv5 for Status Desktop clients (already enabled)
  • more advanced NAT traversal strategies
  • etc.

Clients that can’t establish a satisfactory number of relay connections may choose to “fallback” to filter and lightpush.

relay as main driver for scalability

In my opinion, the most important question we want to answer (especially with an eye on Communities), is how a well-connected relay network scales. This is because:

  • All other protocols depend firstly on the health of the relay mesh and then on the performance/resources of the service nodes. I’d argue, for example, that the scalability of request-response protocols such as store, filter and lightpush depends primarily on the resource availability of their selected service nodes. Since these protocols are “centralised”, it’s also easier for us to devise our own scalability tests without needing a distributed setup as for relay.
  • We aim to enable and encourage as many nodes as possible to provide relay connections as summarised above. Having a good understanding of how such a relay network will perform will also inform how we prioritise these “run your nodes” initiatives, such as better discovery, NAT traversal, etc.

In other words, perhaps as a starting point for experiments we can assume a certain level of homogeneity as a starting point: relay nodes that are discoverable and within an environment where they can easily connect to other peers and accept incoming connections.

Proposal for first experiment

My rough proposal for a first experiment in participation with Kurtosis:

Setup

  • Run a relay network with x amount of nodes using discv5 to establish a well-connected mesh (afaik this approximates a typical Communities setup on Desktop clients).
  • Publish (from a random subset of nodes) messages of size s at a rate r

Measure

Varying x, s and r, measure:

  1. message delivery reliability
  2. message delivery latency

To measure reliability we can add a simple seq counter to the message payload. Latency can already be measured using the sender timestamp field.

Answer

This will help us answer the questions:

  • at exactly what messaging load performance, latency and reliability starts to degrade or fall below an acceptable threshold
  • at what mesh sizes performance, latency and reliability starts to degrade or fall below an acceptable threshold

Other metrics (such as node resource consumption) can form part of the experiment. Ideally, the experiment should be repeated separately for a homogeneous network of nwaku nodes and go-waku nodes to highlight any differences between the clients (and possible areas to improve).

2 Likes

@haelius is making a good point, let’s elaborate what the relay connection metric means for the browser.

Let’s say we have done our benchmarks and we can recommend that a nwaku node running in a droplet can take 10,000 network connections.
Then yes, it could support around 10,000 browser connections (well less, because it would still need to connect to other service nodes).
However, nwaku is configured to 12 mesh peers for Relay. Meaning that the node will maintain a gossipusub mesh of 12 peers. So the 9900 other peers will not be part of the mesh and will not receive messages from nwaku over Relay.

Which means that the second metric we need to look at is the number of peers in gossipsub mesh.

We will need to understand how big a mesh can be and still be efficient. Possibly the result will be that if a peer cannot provide connections then it should not rely on Relay to receive messages.

@haelius what are your thoughts on browser client using Relay if they are not able to provide network connections?

@froy, apologies for the many edits on this reply, but don’t want to overcomplicate the picture.
“Providing” connections does not have to be the only requirement for running relay. Nodes can be a healthy part of the relay mesh even if they initiate all their connections themselves. In other words, browser clients can definitely run relay as long as they’re able to establish enough connections for a healthy mesh (thereby they’ll still be contributing connections/routing paths to the network). For this they’ll likely need some discovery mechanism, such as the proposed general peer exchange.
Being well-connected is more difficult if you cannot accept incoming connections or aren’t discoverable, but is not impossible. That said, our total scalability would indeed depend on nodes being discoverable and accepting incoming connections, as these browser nodes would need enough discovered nodes to configure a healthy mesh.

Thanks @haelius. Makes sense.

Continuing the discussion only considering Waku Relay.

While browser clients cannot provide connections to the network, they should still be able to create Waku Relay mesh with other (service node) peers, as long as they can connect to at least 4 peers: The recommended lower bound for outbound degree (D_low);

For this to happen, first we need browser clients to find peers beyond the bootstrap peers. Indeed, as previously stated in the thread, we are working on more dynamic discovery methods for the browser.

Based on these recommendations, let’s review the numbers needed to form healthy meshes.

I am not sure the best way to do these calculations, feedback is welcome.

Assumptions:

  • In this context, connect, means establishing a network connection and being included in each other gossip sub mesh.
  • 1 browser node should connect to 4 services nodes (D_low)
  • 1 service node should connect to 4 services nodes so that the mesh is somewhat reliable (D_low).
  • 1 node should not connect to more than 12 nodes (D_high)

So if we have 5 services nodes in a mesh, they can each establish connections to 4 other service nodes.

There are 8 “slots” per service node available in their mesh (12 - 4 = 8).
Hence, a total of 5 * 8 = 40 remaining slots.
Because a browser node takes 4 slots then 40 / 4 = 10 browser nodes can connect to those 5 services nodes before the meshes become saturated.

From this I deduce a ratio of 5 service nodes for 10 browser nodes: 1:2.

This ratio does not seem achievable in reality. I would hope that one service node can serve in the scale of thousand browser nodes (got this figure out of my hat).

One way to increase the ratio would be to increase D_high. However, the impact of such change on mesh performance would need to be studied.

Reliability for Browser nodes

Please note that due to the situation descibed above, by default, js-waku only connects to one remote node. This is lower than D_low. Which means that one is likely to see reliability issue when using Waku Relay with default parameters.
As we deliver the features discussed in this thread, we will be able to increase the default number of connection js-waku make so it can reach D_low (4) and even D (6).

1 Like