IFT Research Call, January 29th 2025 - Testing Distributed Systems

Phil · January 30, 2025, 1:56pm

This is a transcript of the discussion after the call. Feel free to continue the discussion in this topic.

Good results:

Q: How do we agree what “good” looks like for our measurements? What’s your proposal on that? Do you see it as being the project that’s going to use it? Do you see the research publishing it or can you tell me about that?

More precisely, do you see the shadow project as being a representative and do you see it as being a situation where it represents the concepts of, for example delays, dropout of relays of peer connectivity and how you got kind of tweaks and buttons to increase the badness of the network or the goodness of the network to simulate different levels of quality and then consistently show that back to the person that’s actually testing.

A: It’s complicated because, at the end, I think it should be a mix between the team that develops the software and in our case, the DST, we can also provide the feedback. For example, in the case of Waku, they have been actively interested in what’s the bandwidth usage that discovery v5 is using and had no idea we worked on that beforehand. Then, from our side, we discovered issues that we reported, for example the message latency that was mentioned - it’s something that we noticed over the course of the weeks. So we started investigating that by ourselves.

And regarding shadow, it depends on what you want to do with it, Because the problem with the CPU, shadow assumes that you have infinite CPU. So if, for some reason, the problem that you might have is because it’s taking too much time to compute something, you will not be able to see it in shadow. It depends on the intention of the analysis you want to do. If it’s regarding bandwidth consumption, maybe it is more suitable for that. If it’s because you want to analyze why the network is misbehaving and maybe some messages are being dropped, then I would not go with shadow just in case because you are not sure if it’s because of all that. And from my perspective is a huge drawback. But again, depends on what you want to do.

Q: So what you’re saying is for particular use cases we need to almost select the target tool set and you can help us to do that, and then if there are certain elements of stress that I want to put on my simulation, different tool sets will do that but you could technically agree those stress tests and simulations in advance of the project going down a certain direction that could prove and it could effectively almost a bit like a wind tunnel. You can use some predictions to tell you what could work as opposed to go down the road and implement something and find out it doesn’t work. Is that correct?

A: I would say so, yes.

Network mode:

Q: So my question is kind of technical. You mentioned that you’re using this basically to simulate the network for thousands of nodes, but the network itself how is it modeled? What is the model usually chosen for the network?

A: It depends if you are talking about Kubernetes or you’re talking about shadow because if we are talking about Kubernetes which is the results that I presented the network is not modeled it is a real network basically.

Q: It is a real network. And the properties of this real network are unknown or?

A: Yes. you can set up those properties. You can add the latency you want if you want latency. You can set up drop package if you want. You can set up almost everything.

Q: Maybe let me ask the other way. So if you model, let’s say, if a node sends a message to the network, kind of like a broadcast scenario, how long it will take that all the nodes actually see the message and then one needs to choose a model of a network so let’s say we cannot have just a part of the internet or something and we would model it as a random network or something like this. So if it’s a random network supposedly you choose a random graph - which models are usually used for this? What type of graphs are used to model it?

A: If you want to analyze the time a message took to be broadcasted to all the nodes and you don’t know at first, you don’t really know the graph you are working on right because this depends on how nodes establish the connections. Actually this is why I am interested in doing this tool to extract the node topology, because at first in the example of Waku for example I do not really know what I do is I set up a couple of nodes that will act as bootstrap and then all the nodes will connect to those bootstrap nodes and then the network will be formed. So I am not really aware if I have isolated clusters if the network structure looks more of an Erdős–Rényi network or Barabasi or something else - I have no information about the graph so it depends.

A: With scripts you could also fix a certain topology let’s say for the libp2p tests If you connect the peers in a specific way then you can also kind of enforce a specific topology. To maybe solve the confusion: what Alberto meant here, for the Waku tests mainly the topology grows organically it’s just actually what would organically happen and that’s very interesting to look at these topologies that form. We could also fix topologies and with shadow you can fix topologies yeah maybe Farooq can actually say something about the shadow topologies that we’re using.

A: Just to add to the topology that basically comes from the routing protocol that you are using, for example be it gossipsub or it could be a random graph, could be single overlay it could be multiple overlays mainly as a test scenario - what we can do is we generally inject enough peers in the network first of all and then we distribute the latency and bandwidths to create a reasonable number of groups that can actually model whatever the kind of network we want to test on. That can be done both in case of shadow and in case of Kubernetes. For shadow we have one for example this almost clear scenario where we can define different kind of latencies bandwidths and these kind of scenarios and I think similar kind of scripts are also available with the DST they kind of distribute the available bandwidth to the peers and latency to the peers to segregate it.

A: But one point I definitely intend to add is one thing that really also needs to be looked into is how many clusters are there. If I guess Alberto can move back to the slide where certain clusters are forming that becomes the biggest nightmare for any routing protocol. If only few peers are actually connecting a large number of groups they are responsible for laying it on behalf of other peers. So yes, that kind of testing becomes very essential and again it can only be inferred based on the actual assumption of the test scenario like exact use case where the application is to be run - what should be the number of peers or what should be the expected number of peers? What should be the internal latencies? Many of such data can be also perceived from generally published internet stats like what are the common latencies available within some other countries or what is the largest latencies in the far regions. Putting in those latencies to actually reflect the network can definitely become a better place.

A: If you would want to test a fixed specific topology you can do that too in both. So for Kubernetes you just have to a bit more manually actually script that and for shadow you can do fixed topologies.

Maybe just adding one more thing regarding the topologies is with gossipsub they would connect peers you have peer connections that form one topology then you have the full message mesh which is another topology and you have the gossip topology. So there’s layers of topology and it also depends on which topology we talk about. For the gossipsub test we need to fix the just peer connection topology but the mesh topology builds itself. For the Waku test even the peer connection topology gets built organically because they detect each other using diskv5 and Waku discovery methods and we want to evaluate that part as well.