There are a couple of initiatives that folds under the same topic: using specific API technology across the application stack as a common language.
More specifically, Prometheus, REST API and FFI API.
Telemetry
During 2024, we invested effort in reviving the Status telemetry service to better learn about Status usage of Waku. The goal being to improve reliability, scalability and performance of the Status application and protocols.
One of the learning is that in a complex decentralized system, attempting to infer message reliability from user report is hard.
For example, measuring reliability this way:
- Alice sends message, reports to telemetry service she sent the message
- Bob receives Alice’s message, reports to telemetry service he received it,
- Carol does not receive Alice’s message, reports to telemetry service she did not receive it (?!),
- We can state we have 50% reliability (no we cannot).
You can infer from step 3 that it is a complex problem to solve. Hence, all new telemetry we added were much simpler and streamlined:
- Alice reports how many peers she is connected to;
- Alice reports dial failures and reasons when connecting to peers;
- Alice reports message she discovered via a store query;
- etc
As we are tackling bandwidth measurements via telemetry we can look at number and size of messages sent during a period of times, to better understand how much traffic is produced by a single app instance.
Simpler metrics allows us to use the Prometheus monitoring solutions to handle them. This can enable us to look at metrics from an instance running locally, e.g. by feeding metrics generated by Status Desktop to a local Grafana instance.
Telemetry at all layers
Prometheus is already heavily used in nwaku, to help us monitor nodes we are running. Which means we can access all those metrics when swapping go-waku with nwaku in Status app (go-waku also has some metrics).
As we use matured those metrics to monitor nwaku node fleets, we can further reap this effort by using them to improve the Status app.
Furthermore, by integration Prometheus at all layers of the stack, we can enable further monitoring and investigation of higher level behaviour (chat protocol); such as counting messages sent by a given feature, etc.
Such an approach can be taken beyond analysis of software behaviour for improvements purposes. Having Prometheus events emitted from Status Communities protocol could enable a community owner to have insight on their own community by running pre-defined Grafana instance locally:
- Rate of members joining
- Activities per channel
- Most active members
- etc
Similar to what Discord metrics offer to large servers, but fully local and sovereign.
Aggregation
We are also considering aggregating metrics from users’ instances into a common instance, to replace the custom telemetry server (fully opt-in).
Again, any metrics we may add to measure reliability or performance, can easily become available to any user, whether IFT CC or not, for their own local instance; Potentially even instance running on a mobile on the same local network.
This becomes particularly interesting once Waku incentivization matures, giving an easy insight to user on the performance of their node.
With Grafana, this can be done without modifying the App UI. Anyone can create a new dashboard, outside the app release cycle.
Next steps
The Waku team is shifting their focus on performance of Status chat protocols, to ensure scalability and apply solutions such as RLN.
For the new measurements that need to be done, opting for a Prometheus first approach will allow us to measure and iterate faster, as well as opting for future proof solutions. As any metrics added may be useful in various context, which is not the case for the telemetry server metrics.
REST API
While it isn’t a priority to on-board new node operators, we do provide some light best effort support to community node operators. This is primarily to help us with dogfooding, experimentating with new Waku features, as well as having some infra for developers building on Waku.
From those efforts, there is a clear need for a multi-tool cli [1] [2].
Ideally, a waku cli that can feeds from nwaku REST API would help investigate issues by providing:
- current connectivity health state,
- RLN state: membership registered or not,
- specific log entries,
- general node information.
As well as triggering specific actions: connect to this store peer, get current ENR for node, provide list of discovered ENRs, etc.
Most of those actions are not appropriate to Prometheus. When they are, they may be better formatted on the command line than a Grafana screenshot
Looking beyond the scope of node operator; we can see how such tool would be useful to investigate Waku in the context of Status. Look at the node management tab of Status Desktop app (or even the “peer count” screen of Status Mobile).
To update the tabs/screens, a Status release has to happen and full stack change (from exposing an API to updating the Qt and React UI) need to happen.
A CLI that can consume the REST API of the waku node running in Status app can permit further investigation and debugging with minimal effort.
This is especially true when adding new protocol to the Status stack. For example:
- SDS (e2e reliability): dump the current [3] incoming buffer to see if a message has been received, and if so, what is the missing causal history
- RLN: dump the current state of the membership as seen onchain (registered, expired, etc). Coupled with Prometheus stats on number of message blocked from sending due to rate limt.
message <id>
: dump information about a given message: when received, whether discarded, etc.- Trigger a store request with specific details: to understand if a message is not seen or is not properly processed by the app.
Some of those actions may be useful to migrate to the GUI for all users. Most won’t, but will be useful for IFT CC when investigating specific problems.
REST API and FFI
From the example listed, we can see that the API is likely to be useful from within (ffi/Golang) the app as well. This lead to a point that would need its own article: the merge of REST API and FFI.
To enable such a powerful CLI, feeding from the REST API. The REST API needs to be easy and cost-effective to maintain. The easier way to do so is to ensure that the REST API is a thin layer over the FFI API. So that a new method available on the FFI API (sds_get_incoming_buffer
) is easily, and automatically exposed on the REST API.
There are a number of technical challenges, ie, how to make the API from various libraries (SDS, MVDS, nwaku, chat) available on one REST end point.
References
- [1] nwaku-compose/setup_wizard.sh at master · waku-org/nwaku-compose · GitHub
- [2] feat: interactive command-line for running node by darshankabariya · Pull Request #136 · waku-org/nwaku-compose · GitHub
- [3] docs: add SDS protocol for scalable e2e reliability by jm-clius · Pull Request #108 · vacp2p/rfc-index · GitHub
edit: typo