Towards a working Waku v2 infrastructure

haelius · September 26, 2022, 11:44am

(with a special focus on Status integration)

This post highlights some of the main challenges in establishing a Waku v2 infrastructure that works for multiple platforms, but with a special focus on Status products due to our close collaboration. Status is also the likeliest application to dogfood Waku v2 at a scale large enough to answer some of the open questions identified in this post.

The items listed below is not (close to) exhaustive and only considers infrastructure related work. General maintenance and usability of the different Waku v2 clients (such as fixing bugs, memory leaks, better APIs, etc.) is an ongoing effort, but not mentioned below.

NOTE: I apologise for the length of this post. TL;DR involves skipping to the summary at the end

The basics

Waku Relay: the foundation of any Waku v2 network

A strong relayer network is the most basic underpinning of any Waku v2 network. All other Waku services builds on the assumption that the underlying message routing works well and is scalable.

To get to this point, some crucial considerations include:

each peer in the network must be able to establish and maintain at least six good connections to other peers
a peer would not reliably route messages to more than ~6 peers - this includes “core”/bootstrap/fleet nodes (this aspect of Waku relay is fundamental to GossipSub’s scalability and is one of the most significant impacting factors on how Waku v2 infrastructure must be established).
a good connection therefore does not just refer to the longevity and quality of the connection itself, but also to the diversity of the connections. To emphasise the same point: if more than 6 peers depend on connectivity to the same bootstrap fleet or any centralised node, message routing will fail.
properly configured, randomised discovery mechanism(s) are therefore necessary to allow peers to connect to each other with enough connection diversity to allow proper message propagation through the network
all discovery mechanisms are affected by networking conditions of both individual peers and the infrastructure as a whole: e.g. some peers may be discoverable, but unable to accept incoming connections and are therefore useless to other peers; others may not be easily discoverable e.g. due to discv5 NAT limitations, but could establish good p2p connections, etc.
the discovery mechanism must be coupled with a connection management scheme that can properly prioritise “better” connections, prioritise incoming/outgoing connections as needed and that generally improves diversity and reliability. This becomes a very subtle exercise when multiple clients with different restrictions participate in the same network.

Work done or in progress

As the health of the Waku Relay network is such foundational concern, much of our effort has been and is in this domain.

Examples of more recent work items:

Discovery mechanisms: We’ve researched and developed several discovery mechanisms which can be used in combination/separately in different environments. Notably DNS discovery is useful for bootstrapping connections and Discovery v5 for continuous discovery. This involved significant research and modelling to select and implement discovery mechanisms most appropriate to the principled, privacy-preserving p2p infrastructure we’re creating. GossipSub Peer Exchange, for example, was implemented as an option, but is disabled by default due to privacy/security concerns. Waku Peer Exchange, a secure discovery mechanism more appropriate for resource restricted environments, has been specified and is in an advanced stage of development. Other discovery mechanisms are also on the roadmap to provide more options embedded in infrastructure (e.g. rendezvous) each with their own set of advantages and tradeoffs.
Multiple transports: Including websockets (done), WebRTC, WebTransport (on roadmap). Needed in some environments, e.g. browsers, for connectivity. Also required for non-relay protocols.
Running relay from a browser: Since browser to browser connections are typically difficult/slow to establish, this has become a focus area in itself with various solutions (including more appropriate transports) being investigated. To run Waku relay (successfully) from a browser at scale, it is essential that browser clients can discover and connect to each other.
Connection management: Basic connection management is in place, including keeping idle connections alive, reestablishing lost connections under some conditions, etc.
Network testing/benchmarking: Local simulations were done to test some theoretical performance assumptions of Waku relay (and GossipSub) in practice with positive results. Collaboration with Kurtosis to create more advanced network simulations and benchmarking is scheduled to start soon.

Open questions

This is not an exhaustive list (at all):

How well does discv5 work at scale? E.g. can peers establish “good” connections fast enough compared to how long they’re online and expect to start seeing messages?
How well does the recommended discovery mechanism (Waku discv5 & peer exchange) work on different platforms and network environments? In the existing roadmap we have identified NAT traversal techniques to incorporate in Waku. Prioritisation depends on gauging the need from different platforms/clients trying to run Waku v2 in different environments
How should peer scoring work for Waku v2? Basic GossipSub peer scoring is in place, but many subtleties in Waku v2 could (will) necessitate more sophisticated peer scoring mechanisms. What we implement here depends again largely on practical use cases from different platforms/clients. Some examples could include low scores for filter nodes that omit messages, store nodes with slow response times, etc.
Does connection management work at scale? Existing connection management is very rudimentary. The larger question of connection management, prioritising connections, reestablishing lost connections, etc. can be very complex. Although network testing/simulation may provide some answers here, only dogfooding in a “real” environment would provide the conditions needed to confidently answer this question.
Is a single pubsub topic enough? This relates to scalability. Topic sharding has been roadmapped as an option, but prioritisation depends on a combination of at-scale dogfooding/network testing.

Status product concerns

An understanding of the above (including the open questions) is necessary. Much of the future work and priorities of Waku v2 depend on early feedback from Status dogfooding. A well-functioning relay network is not simply something that Status products build upon, but forms an integral component of its p2p products. Waku v2 provides the building blocks to achieve this, but choosing which building blocks to use, configuring them correctly, dogfooding them with Status products’ traffic volume, client constraints/requirements, iterating on development to reach specific requirements, etc. should be a focused effort on the Status product side.

A couple of examples, related to the open questions listed above:

development should have a strong awareness of the p2p character of the network in product development (each peer asking: how many connections can I maintain? how many “good” connections can I maintain? how well is discovery working? do I have contingencies (e.g. fallback to other protocols) when network quality low?). It’s probably a good idea for much of this information to be visible to advanced product users. A Waku p2p application is never really built agnostically “on top of” the network - the network is an essential part of the application.
much of relay health depends on proper configuration, sensible bootstrapping, infrastructure deployment and monitoring (e.g. to get DNS discovery lists deployed, ensure discv5 is functioning correctly and nodes bootstrap their DHTs as expected, etc.)
closely related, establish ownership of infrastructure, keys to sign DNS bootstrap lists, expanding lists when necessary, monitoring, configuration that affects client performance, etc. Although this is something the Waku v2 team can advise on and help get established, owning the configuration and the subtleties for Status-specific use cases has to reside in-team (it is also not merely a devops role, as it requires insight into the p2p protocols).
what is the latency of the bootstrap process (i.e. time until a node has established stable connections to network)? This involves at least connecting to bootstrap nodes and finding good peers via discv5. Is current latency acceptable for Desktop, Web, Mobile clients?
dogfooding connection management within the context of the variety of Status clients/networking conditions between Web vs Mobile vs Desktop. Should certain connections (e.g. incoming or outgoing) be prioritised?
closely related, what makes a “good peer” from Status POV? Should peer scoring be adapted to support Status-specific requirements?
what impact does it have on Status applications having to share the same pubsub topic (i.e. share network infrastructure) with other applications?

Other Waku v2 services

Store

Store is used by platforms for historical message storage.

Work already done/in progress

Other than developing and adapting the store protocol itself, similar work was done in go-waku/nwaku to support two main store service implementations, with some early investigations into other potential store technologies for more advanced use cases.

an in-memory store (with disk backup) that is performant, but more suitable for short-term storage due to high cost of memory modules. A previous dogfooding exercise led to major improvements in this store and it is used in some existing environments.
a disk-only, sqlite store with some performance trade-offs for cheaper, longer-lived data storage on disk. This was mainly developed as a feature required by Status (“30 day storage”), but is likely to be useful to other future platforms. Optimisation is WIP. Early-stage dogfooding has started on this implementation to get query performance to where it is acceptable for Status use cases.
a pluggable store interface to allow for future store implementations (an existing conversation on the Store Roadmap considers for example RocksDB, Redis, etc.)

Open questions

what is a sensible store data model? E.g. what should be considered a sensible order to store messages and return them when queried? What message fields are considered a minimum vector to define message uniqueness? Much of this relates to the optionality of the timestamp field - an impractical choice but based on the principle of allowing for better anonymity preservation if a platform so chooses.
how well does the sqlite store perform under different loads and for different query use cases?
what (if any) other store implementations should be prioritised and when?

Status product concerns

Most dogfooding issues (so far) has been related to the performance of the 30-day store. Enhancing query performance has been on the Store Roadmap for some time, but specific tasks are prioritised based on feedback from dogfooding, iteration, measurements, explicit requirements etc. in close to real-world conditions. Much progress has been made here already, but to summarise outstanding questions:

what are the query performance requirements for different clients? If existing implementations fundamentally cannot work at scale, we need to prioritise/fast-track any workarounds, consider other store technologies, etc.
how would clients find good store service nodes and what is the contingency if they can’t connect to these?
should Status configure store nodes that store only Status application or provide a general store service to the network, especially as more and more platforms/applications start sharing network infrastructure? What is the impact of centralising Status storage to our own fleets?

Filter and lightpush

Although these protocols are separate they are often used in conjunction in resource-restricted environments as an alternative to relay. For example, they are core to an initial strategy to address Waku Relay scalability challenges in browsers.

Work already done/in progress

Basic development and implementation of these protocols, fixes to get this working on multiple clients, interclient tests, integration into testnets/toychat, etc.
A roadmap for improvements to the protocol that we believe necessary.

Open questions

both of these protocols implies significant anonymity and privacy tradeoffs. There’s ongoing work to provide different options and mitigations here (each coming again with another set of tradeoffs).
what is the scalability of these protocols? Neither protocol has been integrated into proper dogfooding efforts yet.
how urgent/critical are the proposed improvements on the roadmap? What have we missed in this roadmap?

Status concerns

be aware of the known limitations of these protocols. Early dogfooding is required to derisk and prioritise any critical improvements that become necessary.
be aware of the centralising aspect of these protocols. Filter and lightpush is most appropriate in environments where there is “no other choice” due to environmental constraints. In a principled p2p design the first choice should be to function as a relay peer.
be aware of the anonymity tradeoffs inherent to these protocols and the existing WIP to improve on this.

Bridging Waku v1 <> Waku v2

The bridge is a Status-specific infrastructure component that bridges messages from Waku v1 to Waku v2 and vice versa. It is envisioned to be a temporary element while Status products are migrated from Waku v1 to Waku v2.

Work already done/in progress

We’ve developed a bridge which has been showed to work and can bridge in both directions. A previous dogfooding exercise led to many improvements. Waku v2 has helped Status to configure bridges for their own fleets.

Open questions/Status concerns

Since the bridge is a Status-only component, all open questions are also of concern to Status

who takes ownership of deployed bridge instances? Waku product teams can help with initial configuration, debugging, etc. but in-depth insight into how well the bridge is performing may only materialise on product level. The bridge belongs to Status infrastructure.
the bridge is a single point of failure. Do we need contingencies? One possible mitigation strategy, for example, is to have two bridges deployed, but this has not been tested and may lead to unexpected behaviour. How urgent is a failover strategy?
dogfooding has been limited, e.g. mostly testing Waku v1 traffic being bridged to Waku v2 (with only a few messages being bridged in the opposite direction). What happens if the majority of traffic moves to Waku v2 and bridging is skewed in the opposite direction?

Summary

The document above aims to document some of the challenges we have faced and risks we have identified in getting Waku v2 to “work”. We are working with many platforms to help prioritise those parts of infrastructure development that will be most useful to them, debug what’s not working well enough and helping them decide which Waku building blocks are most appropriate for them. Each decision comes with its own risks and tradeoffs which requires some insight into the working of this p2p infrastructure.

It should be clear that there’s a very large number of moving parts at play here, with many limitations and open questions already known. We need to understand how to prioritise addressing these to support Status use cases.

From our perspective this implies:

Early integration/dogfooding in various Status products to derisk infrastructure unknowns as much as possible. This should not be (solely) dependent on rolling out new Status features, but can start with the existing products (e.g. mobile app).
Someone(s) dedicated to this integration effort in the Status product teams. The Waku v2 infrastructure team has limited Status core knowledge and this is likely to diminish even more over time as we broaden our focus on infrastructure and generalised messaging.
Clear Waku v2 requirements and timelines as output from Status dogfooding efforts.
Ownership within the Status product teams of Waku as a key part of the offering. This includes understanding the p2p nature of the products, the Waku building blocks chosen by Status, the bootstrapping infrastructure, configuration, etc.

haelius · September 26, 2022, 11:55am

Some immediate omissions in the post above, include work related to RLN/spam protection, decentralized captcha, fault-tolerant store and, more generally, the role of the Waku product team in promoting certain principles in implementations (e.g. not having centralizing forces/fleet dependencies, anonymity, etc.)