Waku v2 discv5 Roadmap Discussion

ksr · January 22, 2022, 4:44am

Waku v2 discv5 Roadmap Discussion

I am planning to write a GitHub issue describing and tracking milestones of a Waku v2 discv5 roadmap.
This post is for discussing the content of this issue.
As this issue will track specific milestone sub-issues, it could also be setup as an epic (as discussed in the 2022 Q1 strategy meeting).
The next section is a first draft of this issue.

For the current state of Waku v2 peer discovery, see current state of waku v2 peer discovery.

Organizational

Should the waku discv5 roadmap issue be in
- the nim-waku repo and track only nim-waku implementation issues
- vacp2p/research and track all research and implementation issues
- vacp2p/research and only track research issues (no implementation issues)
Should the issue
- list our current view on future stages, or
- should we manage discussion about future stages in the forum post and only update the roadmap issue once we decide on the respective next stage

separate Waku discv5 network

The Roadmap draft below reflects the approach of having a separate Waku discv5 network.

Advantages

query efficiency (strong)

A separate network avoids the needle-in-the-haystack problem and allows for efficient queries.
Having to search for random nodes until finding one that supports Waku does not leverage the DHT structure to its full extend.

This gets even more weight when introducing capability discovery, because this includes searching for specific nodes. With randomly distributed Waku v2 nodes over the Etherium discv5 network, the O(log(n)) search time cannot be guaranteed.

(DHTs allow retrieving random samples from a distributed set of nodes.
Being part of the Etherium discv5 corresponds to using the DHT as a means for sampling from a large and relatively resilient set of nodes with only a fraction supporting Waku.
On the other hand, having a separate network allows directly sampling from a set of Waku nodes.)

not “polluting” the Etherium discv5 network (weak)

Waku nodes that do not explicitly want to take part in the Etherium discv5 network will not waste resources of the Etherium network.

Disadvantages

loss of anonymity (weak)

This is a weak argument, imho, as the weak anonymity gain comes at the cost of making querying less efficient. Anonymity is only gained by “obscurity” because nodes supporting Waku can still be listed.
Imho, it is better to research methods of protecting anonymity in later stages.

easier to attack with DoS attacks, e.g. eclipse attack

The smaller the DHT, the less resources are necessary for mounting a successful DoS/eclipse attack.
We should (theoretically) analyze the query efficiency loss when being part of the ethereum discv5 network and come back to this option if we cannot find more efficient means of preventing these types of attacks in the future.

Eclipse attacks are made harder by not having a DHT structure managing Waku nodes (DHT discoverability of the Waku capability), and eclipse attacks target this structure.
As said above, when being part of the discv5 network, the DHT structure is only used to sample from a set of random nodes that support Waku with a low probability.
Waku nodes are not actually managed in a DHT structure; they are just scattered within a much larger set of nodes which are in a DHT structure.

Waku v2 Discv5 Roadmap (issue draft)

(All of the following is open for discussion and subject to change. The direct phrasing is just to keep it simpler.)

stage 1 :: Working discv5 based Peer Discovery for Waku v2

The goal of the first stage is a Waku v2 node implementation supporting peer discovery via discv5.
We envision a Waku v2 discv5 DHT separate from the Etherium discv5 DHT.
The implementation for this stage will be based on a nim-eth discv5 feature branch, which generalizes discv5 setup by allowing to choose a different protocol-id, as suggested by @kdeme in nim-waku issue #770.
Waku v2 will use the protocol ID d5waku.
Having a different protocol-id allows an easy check of all incoming messages; messages with a protocol-id different from d5waku are ignored.

mile stones + related issues and PRs

[ ] nim-eth discv5 feature branch supporting a configurable protocol-id
- nim-waku integration :: #770 which has been addressed in PR #435
  introduced validating nodes based on ENRs, which is not sufficient to avoid leakage of Waku nodes into other discv5 networks see #770 discussion.
[ ] nim-waku implementation using nim-eth discv5 feature branch
[ ] nim-waku local test network successfully running Waku v2 discv5
[ ] …

stage 2 :: Efficiency Considerations

reduce the load on low power devices

Mobile nodes and browsers should use the DHT to query for peers and do not answer queries.
They should indicate this behavior in messages and not be taken into routing tables.
Weak nodes can retrieve bootstrap nodes via DNS or other external sources.
An idea worth considering is stronger nodes offering a service similar to DNS cache resolvers, which perform iterative querying on behalf of the weak node.
In this scenario the mobile node just asks one of the bootstrap nodes retrieved via DNS and directly gets the answer in the following response. This, however, raises privacy considerations.
Mobile nodes might also maintain a simple routing table for caching strong nodes that are not manged in the DNS. Nodes could indicate their interest in offering this service in their responses.
This would require incentivization to be feasible (see future stages).

future stages

The purpose of this section is listing research tasks that (might) have to be addressed to provide a reliable secure peer discovery layer for Waku.
We will decide on the order these should be addressed in after completing the first two stages.
Still, we will keep security and privacy in mind when addressing the first two stages; however, the focus in the first two stages is on getting working results.

incentivization

Nodes that are capable of running the full DHT protocol should be incentivised to do so.

capability discovery

find nodes holding messages of a certain time range

security

defense eclipse attacks

eclipse attacks + Sybil attack
- can be run by a less powerful attacker controlling a small number of nodes
eclipse attack :: attacker controls p% of the peers in a network
- defense goal :: a retrieved list of randomly selected peers should only contain O(p%) evil nodes

defense against other DoS attacks

privacy

hiding the query target

security analysis and attacker model formalization

Tor model (20% of the nodes are malicious)
AS-level passive attacker
Dolev-Yao model

ksr · January 24, 2022, 10:14am

In today’s meeting, we reached consensus on having a separate Waku v2 discv5 network.
This should be achieved via a separate protocol-id.
This would also leave the option to join the Etherium discv5 network by going back to protocol-id=discv5 before the release, in case we want that for a nimbus/WavuV2 bundle.
Are there potential problems/issues we missed?

haelius · January 24, 2022, 12:06pm

Thanks for this clear post, @ksr. Great to also have the arguments for/against a separate discv5 network in one place and I agree with the tentative conclusion that a separate network is better.

I think we covered most of the open questions in the PM call, but I’ll add a couple of comments.

Should the waku discv5 roadmap issue be in…

Summary of discussed during call:

vacp2p/research for the larger “roadmap” issue
vacp2p/rfc for the specifics around the protocol change/introduction of new protocol id
nim-waku for any implementations

I think we only need issue(s) for stage 1 at this time - the issues themselves can briefly mention the future steps in the roadmap (or link to an issue that does), just so we don’t lose sight of where (we think) we’re going.

Having a different protocol-id allows an easy check of all incoming messages; messages with a protocol-id different from d5waku are ignored

Any value in still validating nodes before adding to the routing table or would this become completely superfluous? I imagined validation as a type of easy “first trial” before breaking changes in the protocol, but I think this is unnecessary now.
Assuming we can measure how many requests we receive with the “wrong” protocol-id, as an indication of how many waku nodes still made it to the eth discv5 dht?

Great to have our eyes on possible solutions that may work in restricted environments!

ksr · January 24, 2022, 3:44pm

Any value in still validating nodes before adding to the routing table or would this become completely superfluous? I imagined validation as a type of easy “first trial” before breaking changes in the protocol, but I think this is unnecessary now.

Yes, I think we do not need any further validation.
Allowing to choose a protocol-id is a minimal change that allows establishing separate discv5-ish discovery networks. It makes “validation” very easy. Every message can be checked if it has the matching protoco-id. Of course, it does not protect against malicious nodes, which we will consider in future stages.
Further, this approach generalizes the nim-eth discv5 implementation allowing other consumers establishing their own network, too.
We could still check the ENR value. I will comment on this again once I read the code and implemented a basic version of this. Maybe I am overlooking a detail at the moment.

Assuming we can measure how many requests we receive with the “wrong” protocol-id, as an indication of how many waku nodes still made it to the eth discv5 dht?

Yes. We should definitely log this.
I assume, some discv5 implementations do/will not check the protocol-id and thus will accept leaked nodes in their routing tables. I think adding more protocol incompatabilities to avoid that is not worth it.

fryorcraken · January 31, 2022, 6:10am

An idea worth considering is stronger nodes offering a service similar to DNS cache resolvers, which perform iterative querying on behalf of the weak node.
In this scenario the mobile node just asks one of the bootstrap nodes retrieved via DNS and directly gets the answer in the following response. This, however, raises privacy considerations.

Could the peer-exchange protocol be used instead?

Mobile nodes might also maintain a simple routing table for caching strong nodes that are not manged in the DNS. Nodes could indicate their interest in offering this service in their responses.
This would require incentivization to be feasible (see future stages).

This is an option however it would go away every time a user wipe their browser local storage.

Nodes that are capable of running the full DHT protocol should be incentivised to do so.

Is there an intrinsic incentivisation available around the fact that a good discv5 participant can be found in the discv5 and hence find “consumer” for the other incentivized protocols (.e.g Waku Store)?

ksr · February 1, 2022, 9:56am

Could the peer-exchange protocol be used instead?

Do you refer to this peer-exchange proposal?
This is a possibilty. As far as I understand peer-exchange, it could be orthogonal to the query process I described.
The actual discovery of (new) random nodes can be transparent to the querying light node.
So when a light node queries a full DHT node, the DHT node could use peer-exchange to

send (a subset of) its own peers, or
send a set of nodes randomly selected via DHT queries

The first solution would be faster, but might cause hot spots.
This would allow the light nodes (browser/mobile) to just implement one method of retrieving peers, for instance peer-discovery as you suggested, and the full DHT node could decide how to actually get these peers.
Maybe (in a future iteration), we could allow the light node to send an optional indication on which query method (provide own peers, do random query) is preferred and the DHT node could confirm this in the response.

This is an option however it would go away every time a user wipe their browser local storage.

Yes. After a wipe, they would have to use the DNS again to find bootstrap nodes.
I guess, that should be fine. Wdyt?

Is there an intrinsic incentivisation available around the fact that a good discv5 participant can be found in the discv5 and hence find “consumer” for the other incentivized protocols (.e.g Waku Store)?

Very good idea! I will put this into the discv5 research roadmap issue once I write it. Afaik, this should work. We should definitely look into this once we address incentivisation.

ksr · March 1, 2022, 2:53pm

A first basic version of the selectable protocol-id implementation (issue, PR)
has been completed and is ready for beta testing.
We already had a successful interoperability test with go-waku; thanks @rramos :).

This issue comprises comments with important feedback for the Waku discv5 roadmap, which I copy here to move the discussion to a central place:

@arnetheduck:

leakage into the main discv5 network (or other discv5 networks)

what is the downside here?

basically, a core feature of any discovery protocol is its ability to withstand attacks - a large number of nodes serving the data is one of the ways to achieve this - on the other hand, a false positive seems nearly harmless - you open a connection to that node, note that it didn’t support waku after all, and disconnect - it’s “almost” a no-op.

another way to say the same thing: running a waku-specific discovery network as well as publishing on the “main” discovery network as well as running DNS discovery etc is the way to work around as many types of obstacles as possible, in the interest of securing a wide selection of peers no matter the network conditions

@ksr:

leakage into the main discv5 network (or other discv5 networks)

what is the downside here?

basically, a core feature of any discovery protocol is its ability to withstand attacks - a large number of nodes serving the data is one of the ways to achieve this - on the other hand, a false positive seems nearly harmless - you open a connection to that node, note that it didn’t support waku after all, and disconnect - it’s “almost” a no-op.

Thank you for the feedback. I agree @arnetheduck. I assume the overhead associated with leaking would be quite low; especially if we exclude mobile nodes from discv5. The strong argument for having a separate discovery network is query efficiency see Waku v2 discv5 Roadmap. I should have stated this more clearly in this issue.

Assuming Waku is part of the Ethereum discv5 network:
The fraction of nodes supporting Waku within this network is small, which leads to a needle-in-a-haystack problem.
Each random node set returned from a query contains Waku capable nodes with a certain probability which might be very low.
This problem gets more significant if we want to introduce capability discovery in the future.
Queries are basically executed as random walks, not leveraging the O(log(n)) hops structured overlays offer.
For queries satisfied by a large number of nodes this is OK; but more specific queries would be inefficient.

Filtering Waku nodes via ENR before inserting them into the routing table would still not solve this problem, imho.
This would only help significantly, if these Waku supporting nodes were stable. But these stable nodes would be stable in a separate network, too.Assuming a respective churn rate, this would still converge towards random walk efficiency.

With regards to attacks, I agree that being part of the Ethereum discv5 network mitigates eclipse attacks; but at the cost of overlay routing efficiency that structured P2P overlays offer.

another way to say the same thing: running a waku-specific discovery network as well as publishing on the “main” discovery network as well as running DNS discovery etc is the way to work around as many types of obstacles as possible, in the interest of securing a wide selection of peers no matter the network conditions

We could think about using both Waku2 discv5 and Ethereum discv5. Following the adaptive nodes idea, nodes could choose to (1) not take part in discv5 at all, (2) be part in Waku2 discv5, and (3) be part of both Waku2 discv5 and Ethereum discv5 (maintaining two separate routing tables).
In case quicker discovery methods are exhausted, stronger nodes can walk the Ethereum discv5 network and search for Waku2 supporting nodes. If they find Waku capable nodes, they can insert these into the Waku2 discv5 routing table.

If there are no strong objections, we could go for the separate Waku discv5 only option first, and after thorough testing and dogfooding, decide which route to go.

Any opinions on this?

@arnetheduck:

The fraction of nodes supporting Waku within this network is small, which leads to a needle-in-a-haystack problem.

It’s not so much of a needle-in-a-haystack problem, as a needle problem It’s clearly the case that lookups will feel more snappy in the case where only waku nodes populate the system, but it also renders the setup easier to shut down - even telegram for example supports multiple discovery methods, including via sms.

Hence, it makes sense to make room for multiple discovery strategies until the needle has become a football - when you reach the football size, what should the lookup situation be?

This would only help significantly, if these Waku supporting nodes were stable

Distributed systems typically come with stable bootnodes - one way of getting an initial set of waku nodes more quickly is to tweak these bootnodes to deliver “more” waku nodes than other nodes

But these stable nodes would be stable in a separate network, too

In a separate network, if the stable nodes are taken down, there aren’t many options on the table - if instead the information is also disseminated to a wider network, it becomes more difficult to shut it down, mainly because you can no longer selectively shut down one network and not the other.

This is often the case with communications systems: you want to shut down the chat, but if that means also shutting down ethereum, the economics are different.

Basically, the same mechanism that makes it trivial for waku nodes to run a separate network is the trivial network rule you need to put in your firewall to make it not work.

I would generally consider this an important property to bake into the design early, when building a resilient system, and the discovery process is the first link in the resiliency chain.

@kdeme:
It is obviously better to have and use only one discovery network, but it needs to be usable for Waku node lookups.
I think perhaps one valid critique is that this has not really been assessed in practice? Or has it?
The Ethereum discv5 network consists of +10K nodes. How many Waku2 nodes would be running at start? How long would it take to find them? Is that usable (on mobile)? What when combined with the other discovery methods? I don’t have a good view right now on the state of the Waku2 project in “production”.

It is also important to think well about what brings which security guarantee exactly.

Filtering nodes before adding them to the routing table will drop your “1 network” security completely imo. It would become much easier to eclipse your routing table, especially with a low amount of non malicious Waku nodes.
This is the reason why I have so far prefered the clean separation, at least it is clear then.

Having one discovery network also doesn’t set you free from eclipses on the next layer. When discovery can find very few nodes, you will need to have other measures in place (which you ideally have anyhow). For example, if you don’t have a good incoming connection limit set on libp2p, and outgoing connections are barely happening due to slow discovery, eclipse becomes easier.

Or, if one queries for nodes from a bunch of nodes in the routing table. While all will/might return nodes, most will get filtered out on the waku ENR field. I think it is clear here that makes it more vulnerable to one malicious node returning a lot of Waku nodes, compared to the others. One would typically have to do something like sort all returned nodes based on target distance and keep only the n closests (which is what a lookup normally does).

Anyway, I assumed that the process would be to, for now, use a separated network, as the main network is unusable (however is it really?), to eventually move to the main discovery network when either there are enough nodes, or there is discovery topic registry implemented, or both. However, there is the risk that moving from discovery network at a later stage is something that might not happen, ever.

Distributed systems typically come with stable bootnodes - one way of getting an initial set of waku nodes more quickly is to tweak these bootnodes to deliver “more” waku nodes than other nodes

This could be a good initial help, until there are more nodes (It does feel a bit like a hack though). It adds extra reliance on the bootstrap nodes (centralization), but might be the lesser evil compared to a new network? However, I’m unsure on how to implement it atm, one would have to be careful not to open up a possibility of eclipsing the bootstrap node. (Node filtering should happen on the outgoing data, not the incoming).

Another point that is differently from eth2: there is no need to continuously find new nodes (that is, because of the subnet walking in eth2). So once you have a decent set of nodes, traffic on discv5 would be just to maintain the routing table.

@jm-clius
Another perspective is that the Waku v2 integration effort in Status will soon need a general discovery mechanism that will work across multiple clients, but with a very small number of production nodes at the beginning. The separated network seems to me to provide the most practical first step with the least amount of uncertainty to achieve this, while we investigate how usable discv5 over the main network would be. Agree that this will need active prioritisation from our side to ensure this does happen (or that we at least have experimental backing of our assumptions).

ksr · March 2, 2022, 5:16pm

Here is a rough estimation on the overhead introduced by the needle-in-a-haystack problem.
(This problem occurs if we use the existing Ethereum discv5 network.
Imo, a hybrid approach would be the best. I will describe two possibility in a follow-up post.)

Given the following parameters:

$n$ number of total nodes participating in discv5
$p$ percentage of nodes supporting Waku
$W$ the event of having at least one Waku node in a random sample
$k$ the size of a random sample (default = 16)
$\alpha$ the number of parallel queries started
$b$ bits per hop
$q$ the number of queries

A query takes $log_{2^b}n$ hops to retrieve a random sample of nodes.

$P(W) = 1 - (1-p/100)^k$ is the probability of having at least one Waku node in the sample.

$P(W^q) = 1 - (1-p/100)^{kq}$ is the probability of having at least one Waku node in the union of $q$ samples.

Expressing this in terms of $q$, we can write:
$P(W^q) = 1 - (1-p/100)^{kq} \iff q = log_{(1-p/100)^k}(1-P(W^q))$

How many hops to get a Waku node with 90% probability

Example 1

Assuming $n=200000$ and $p=10%$, which corresponds to 20000 Waku nodes.

$0.9 = 1 - (1-10/100)^{16q} => q \approx 1.36$

Which means, 2 queries would be enough to get a Waku node with high probability.
However, when deploying Waku discv5, the number of waku nodes will be way smaller.

Example 2

Assuming $n=200000$ and $p=0.1%$, which corresponds to 200 Waku nodes.

$0.9 = 1 - (1-0.1/100)^{16q} => q \approx 144$

With 200 Waku nodes, we would need 150 queries. Which leads to $~150 * 18 = 2700$ overlay hops.
Choosing $b=3$ would reduce the number to $\approx150 * 6 = 900$; even when choosing $\alpha = 10$ we would have to wait at least 90 RTTs.
Note that this is just for retrieving a single Waku node; ideally, we want at least 3 Waku nodes for bootstrapping Waku relay.

Estimate in Discv5 docu

The discv5 docu roughly estimates $p=1%$ to be the threshold for accptably efficient random walk discovery. This is in line with this post.

$0.9 = 1 - (1-0.1/100)^{16q} => q \approx 14$

Plot

x-achsis: $p$
y-achsis: $q$

$q = f(p) = log_{(1-p/100)^k}(1-P(W^q))$ for $P(W^q) = 0.9$

plot_query_efficiency_01

Takeaway

The number of necessary queries is dependent on the percentage $p$ of Waku nodes.
The number of hops per query is logarithmically dependent on $n$.
Random walk searching is very inefficient for a small $p$.
However, random walks are more resilient against attacks (see comments above), which is why I will suggest a hybrid solution in a follow up post.

Feedback?

(math mode seems not to work in the forum)
Wdyt? Should I go into more detail?
I could present the math in a more rigorous way if desired.

ksr · March 2, 2022, 8:16pm

I opened a discv5 issue on the discovery efficiency vs resilience tradeoff.

fryorcraken · March 4, 2022, 12:52am

Do you refer to this peer-exchange proposal?
This is a possibilty. As far as I understand peer-exchange, it could be orthogonal to the query process I described.
The actual discovery of (new) random nodes can be transparent to the querying light node.
So when a light node queries a full DHT node, the DHT node could use peer-exchange to

send (a subset of) its own peers, or

send a set of nodes randomly selected via DHT queries

The first solution would be faster, but might cause hot spots.
This would allow the light nodes (browser/mobile) to just implement one method of retrieving peers, for instance peer-discovery as you suggested, and the full DHT node could decide how to actually get > these peers.
Maybe (in a future iteration), we could allow the light node to send an optional indication on which query method (provide own peers, do random query) is preferred and the DHT node could confirm this in the response.

Ah thank you, I was only aware of peer exchange when gossipsub sub prune and had in mind to do something similar that works outside the gossipsub/waku relay context.

I think Simple, minimal peer exchange protocol (PEX) is a good start and we can extend later to add a logic that piggy backs from discv5 events.

I don’t think the lookup operation is relevant at this stage for Waku.

PEX proposes for a stream to be opened where upon connection, each peer send their own advertisement record.

Once this is done, each peer could then send their own connected peers (or a subset of) (1).

Then, whenever the service node discover a node via discv5, it could forward this node details via PEX (2).

So in short, (1) + (2) of your proposal.

We could start this way which means that the light peer relies on their service node do perform discv5 at some point so that the light node can get added peers (2).

In the future, if needs be, we could then extend the protocol similar to the Lookup operation: A light client could ask for more peers, which in turns could trigger a discv5 search by the service node.
This lookup operation can be useful if a client is looking for a specific capability (e.g. Waku Store, Waku Filter, Waku Light push) or even transport (e.g. secure websocket, webrtc direct).
Hence, we may want to be able to specify capabilities as part of the lookup operation.

We also need to be wary of possible DDOS attacks if too many light clients trigger too many discv5 search at the same time.

Yes. After a wipe, they would have to use the DNS again to find bootstrap nodes.
I guess, that should be fine. Wdyt?

Yes I think it’s fine, it just means we need to optimize for quick discovery of new node from a light client pov.

Meaning that “(1) send a subset of its own peers” is critical so that as soon as PEX is setup, the light client can get new peers.
ie, the light client does not need to wait for a discv5 search to terminate before it can get peers.