Waku is being relied upon more and more everyday, and efforts are now well underway to extend the use cases that it being use for and relied upon.
In line with that, it stands to reason that as an organization that pushes for the adoption of a protocol, we should be well aware of the boundaries in which it operates, and the various methods and reason it degrades or fails. For the lazy, here are the bulletpoints listed at the bottom w.r.t. “future work:”
- What proportion of Waku v2’s bandwidth usage is used to propagate payload versus bandwidth spent on control messaging to maintain the mesh?
- To what extent is message latency (time until a message is delivered to its destination) affected by network size and message rate?
- How reliable is message delivery in Waku v2 for different network sizes and message rates?
- What are the resource usage profiles of other Waku v2 protocols (e.g.
We have done work in this area before, and outlined potential future work in this article, but there remains a significant amount of additional confidence to be gained around the use and scale of Waku v2.
This post outlines some of the reasoning behind why I think this should be prioritized and a initial proposal on how we start doing on-going testing of the distributed technology we (Vac + Status + Logos) build and maintain.
There is currently quite a bit of research and effort being put to understanding the scale of the underlying gossip layer (gossipsub) that Waku is built upon by organizations like Filecoin Foundation/Protocol Labs and the Ethereum Foundation. We can of course piggy-back off this work to help gain clarity around the fundamental limitations of what we build upon, but this work only goes so far to give insight to the scale of the abstractions and features we build on top.
Arguably the largest current consumer of Waku as infrastructure is the Status clients and real soon, the Communities product offering within them. During conversations with them, they belabored the desire to better understand the following scenarios:
- What are the current walls Status will hit as a function of user growth
- What are the main contributing factors to these walls, e.g.
- number of total communities vs number of total users
- specific features contribution to total message throughput
- specific graph structure of the underlying p2p network and gossip parameters
- payload size
- required level of availability
- required ratio of available network services vs clients that only consume
- How can the application fail safely?
- What are the key indicators of failure?
- is it gradual or immediate?
- Is a scaling wall global to the network or can it be mitigated to an individual community’s growth?
I’ll let the Status client teams add/edit any of that. More generally, Status would like to push the Communities feature as hard as possible while maintaining a threshold quality of service. As I understand it today, almost none of those thresholds are well understood.
Recently, we have been spending significant resources to push Waku as infrastructure into different areas of the ecosystem, all of which have varied requirements w.r.t. expected quality. In other words, how they expect to scale out and strain Waku will be different.
A part of these engagements is the feedback loop they provide throughout their deployments and integrations. While this is incredibly valuable and integral to the growth of success in Waku development, it is purely reactionary and in some cases, may hinder the actual production deployment of Waku into these programs because they hit a blocker from some current limitation of Waku what was not understood up front.
Having a more fundamental understanding of Waku’s scalability will allow us the following:
- Better identify current ideal project integrations and the limits of their usage throughout the ecosystem.
- Help prioritize future research and development based on current bottlenecks of scale weighed against our desired protocol offerings.
- Build data-driven confidence around the robustness of the current protocols (Maybe an additional tier of RFC specifications can be added that points to a level of battle-testing a given protocol spec has gone through, e.g. “Hardened”)
- Cultivate a culture of rigor within the broader ecosystem around protocol development
- Facilitate users of Waku to better understand expectations and limits of an underlying protocol which facilitates their overall product development and trouble-shooting (is this my fault or a known limitation of what I build on?)
- other stuff that could be named but this is getting long
To be clear, I (program owner of Logos) am fully planning to build in-house resources for “distributed systems testing” to continuously be doing this for the products we offer under the collective. A JD for the initial hire along these lines should be out shortly (If you’re part of the community and this excites you, contact me). I expect that to eventually grow into a team. I am a firm believer that we should work to understand and gain as much confidence as we can around the tech we build and push out, both theoretically and experimentally (I am a physicist at heart still). It is my expectation that Vac will be a strong partner and contributor to this work as a research entity.
In the name of efficiency, I think that working with a platform like Kurtosis would not only allow us to get started quickly, but also develop organizational-wide competence in a platform that can be used in a myriad of other ways, namely:
- shipping testnets and aiding developers
- Simulating complex distributed network operations
- They are currently working heavily to assist in simulating the Ethereum merge event
- applying traditional standards testing
- It is inside their planned roadmap to issue a suite a tests that mimic Jepsen analysis.
- easing the process of users get bootstrapped and configured appropriately
- harden CI/CD and integration testing
I have talked with them quite a bit already, and they are excited to work with us. It was my initial plan to use them extensively in the Logos work, but the ability to integrate them at this point is not feasible while we are in such deep research. Since we plan to use them, I think it is useful to start gaining internal competence and producing useful results with the other products that are ready to be integrated into their platform, and Waku is the obvious choice here.
I would suggest we develop a few key experiments that either help us harden our understanding of fundamental network scale or validate our security assumptions of what the network provides. My immediately lean is towards understanding how far Waku can scale as a function of users within the Status Communities product (obvious more details of what that experiment entails are required). We then work with Kurtosis to get integrated and perform these experiments.
The immediate next step would be scheduling a call with them with a desired experiment to run to map out the steps to get integrated and how we would perform this experiment using them. Their main requirement is that all services needed within an experiment or test are dockerized, which we have (but may need a few updates to get current?).
So what I’d like is some thoughts on what you all think about this, what should be prioritized, false assumptions or errors I’ve made in this, level of desire in participating, information around this topic that I’ve missed on yall’s side, whatever else you think you should add.