Prometheus, REST and FFI: Using APIs as common language

fryorcraken · November 28, 2024, 3:52am

There are a couple of initiatives that folds under the same topic: using specific API technology across the application stack as a common language.

More specifically, Prometheus, REST API and FFI API.

Telemetry

During 2024, we invested effort in reviving the Status telemetry service to better learn about Status usage of Waku. The goal being to improve reliability, scalability and performance of the Status application and protocols.

One of the learning is that in a complex decentralized system, attempting to infer message reliability from user report is hard.
For example, measuring reliability this way:

Alice sends message, reports to telemetry service she sent the message
Bob receives Alice’s message, reports to telemetry service he received it,
Carol does not receive Alice’s message, reports to telemetry service she did not receive it (?!),
We can state we have 50% reliability (no we cannot).

You can infer from step 3 that it is a complex problem to solve. Hence, all new telemetry we added were much simpler and streamlined:

Alice reports how many peers she is connected to;
Alice reports dial failures and reasons when connecting to peers;
Alice reports message she discovered via a store query;
etc

As we are tackling bandwidth measurements via telemetry we can look at number and size of messages sent during a period of times, to better understand how much traffic is produced by a single app instance.

Simpler metrics allows us to use the Prometheus monitoring solutions to handle them. This can enable us to look at metrics from an instance running locally, e.g. by feeding metrics generated by Status Desktop to a local Grafana instance.

Telemetry at all layers

Prometheus is already heavily used in nwaku, to help us monitor nodes we are running. Which means we can access all those metrics when swapping go-waku with nwaku in Status app (go-waku also has some metrics).
As we use matured those metrics to monitor nwaku node fleets, we can further reap this effort by using them to improve the Status app.

Furthermore, by integration Prometheus at all layers of the stack, we can enable further monitoring and investigation of higher level behaviour (chat protocol); such as counting messages sent by a given feature, etc.

Such an approach can be taken beyond analysis of software behaviour for improvements purposes. Having Prometheus events emitted from Status Communities protocol could enable a community owner to have insight on their own community by running pre-defined Grafana instance locally:

Rate of members joining
Activities per channel
Most active members
etc

Similar to what Discord metrics offer to large servers, but fully local and sovereign.

Aggregation

We are also considering aggregating metrics from users’ instances into a common instance, to replace the custom telemetry server (fully opt-in).

Again, any metrics we may add to measure reliability or performance, can easily become available to any user, whether IFT CC or not, for their own local instance; Potentially even instance running on a mobile on the same local network.

This becomes particularly interesting once Waku incentivization matures, giving an easy insight to user on the performance of their node.

With Grafana, this can be done without modifying the App UI. Anyone can create a new dashboard, outside the app release cycle.

Next steps

The Waku team is shifting their focus on performance of Status chat protocols, to ensure scalability and apply solutions such as RLN.
For the new measurements that need to be done, opting for a Prometheus first approach will allow us to measure and iterate faster, as well as opting for future proof solutions. As any metrics added may be useful in various context, which is not the case for the telemetry server metrics.

REST API

While it isn’t a priority to on-board new node operators, we do provide some light best effort support to community node operators. This is primarily to help us with dogfooding, experimentating with new Waku features, as well as having some infra for developers building on Waku.

From those efforts, there is a clear need for a multi-tool cli [1] [2].
Ideally, a waku cli that can feeds from nwaku REST API would help investigate issues by providing:

current connectivity health state,
RLN state: membership registered or not,
specific log entries,
general node information.

As well as triggering specific actions: connect to this store peer, get current ENR for node, provide list of discovered ENRs, etc.

Most of those actions are not appropriate to Prometheus. When they are, they may be better formatted on the command line than a Grafana screenshot

Looking beyond the scope of node operator; we can see how such tool would be useful to investigate Waku in the context of Status. Look at the node management tab of Status Desktop app (or even the “peer count” screen of Status Mobile).

To update the tabs/screens, a Status release has to happen and full stack change (from exposing an API to updating the Qt and React UI) need to happen.

A CLI that can consume the REST API of the waku node running in Status app can permit further investigation and debugging with minimal effort.

This is especially true when adding new protocol to the Status stack. For example:

SDS (e2e reliability): dump the current [3] incoming buffer to see if a message has been received, and if so, what is the missing causal history
RLN: dump the current state of the membership as seen onchain (registered, expired, etc). Coupled with Prometheus stats on number of message blocked from sending due to rate limt.
message <id>: dump information about a given message: when received, whether discarded, etc.
Trigger a store request with specific details: to understand if a message is not seen or is not properly processed by the app.

Some of those actions may be useful to migrate to the GUI for all users. Most won’t, but will be useful for IFT CC when investigating specific problems.

REST API and FFI

From the example listed, we can see that the API is likely to be useful from within (ffi/Golang) the app as well. This lead to a point that would need its own article: the merge of REST API and FFI.

To enable such a powerful CLI, feeding from the REST API. The REST API needs to be easy and cost-effective to maintain. The easier way to do so is to ensure that the REST API is a thin layer over the FFI API. So that a new method available on the FFI API (sds_get_incoming_buffer) is easily, and automatically exposed on the REST API.

There are a number of technical challenges, ie, how to make the API from various libraries (SDS, MVDS, nwaku, chat) available on one REST end point.

References

edit: typo

vpavlin · November 28, 2024, 3:05pm

on using Prometheus as it allows to combine metrics from various libraries (from libp2p to nwaku to Status…) and expose them in one place - which can be then scraped and displayed locally or pushed to some remote aggregation server.

For the REST API - assuming our OpenAPI spec is up to date, we could just try to generate the client using some OpenAPI tool to get some basic CLI? A move towards using FFI as a base for REST API makes probably sense mid/long term though

fryorcraken · November 29, 2024, 12:45am

Examples

Some existing examples that are worth looking into:

bitcoin-cli/bitcoind, lncli/lnd: Daemon runs and exposes JSON RPC API, CLI is available to consume the API.
rad Radicle Seeder's Guide mode: one binary that does it all, including controlling the daemon

At this point in time, I would not go into having the CLI start/stop waku because I expect waku to either:

run in a status app
be deployed via docker (service node)

Do we do it?

Prometheus

In terms of preferring Prometheus metrics in status-go, at least for chat protocol:

I believe it is clearly a high value/low effort item for local investigation. It is something we need to do to under chat protocol message rate and apply rate limit for scalability.

The value for aggregation may be questioned. I think we’ll need to look at specific needed outputs, and assess the value of fully moving away from the telemetry service for those items.
It may also be interesting to do the aggregation over Waku (apps sends metrics over Waku, common grafana instance collect them over Waku) so that it can be more aligned with our principles

REST API + CLI

This seems to be a more high effort. There are a few tasks involved:

REST API subsystem routing: REST API server router should be able to easily delegate specific route to Waku subsystems.

For example, a POST request on /sync/foo/bar route should redirect to Waku sync (RBSR) subsystem that can handle the /foo/bar/ part (the top router should not need to know about /foo/bar.

When taking in account subsystems such as SDS, which will be integrated directly in status-go, some work may be needed first to have libwaku provided SDS as a top layer.

REST API / C-binding alignment: The REST API, FFI API and libwaku API should all be the same API translated in different framework. So that when a new API is created in libwaku, the work to expose it in REST and FFI is minimal and systematic.

It means merging the JSON RPC [1] and bindings [2] specs in one, as no discrepancy is expected between those APIs.

This makes it a high effort. The question is on value.

The immediate value would be an added tool to investigate Status app related issues. Both on mobile and desktop (assuming both apps can expose a REST API, local lan for mobile).

Are there specific information that may be interesting to access when an issue is reported? Especially internally?

IFT CC or Status QA complains about message X missing: could a CLI command that check the state of message X across all subsystems help investigate? This could include a translation of the message envelop id (chat) to waku message id (routing)
could some waku cli command be added after a status e2e run to dump information useful to understand run failures? e.g. id of messages “seen” at relay/filter/store layer, dump of message id to envelop id of all messages to further investigation?

I do not have the necessary all hands experience to assess the value.

So to the team: Do you see immediate value on such a CLI, if so? What specific functionalities would be most useful?

If uncertain, we could start with a “dirty” PoC that enable said functionalities. Once value is confirmed, we can then schedule the proper work listed above.

References

SionoiS · November 29, 2024, 12:34pm

Metrics over Waku would be ideal.

Having REST, CLI and C bindings all provide access to a common core API that would be fantastic. The only problem is that depending on what information we want adding a nwaku sub-system that interact with others like Relay, Archive, etc is not that easy. We may need to re-architecture nwaku somewhat. This is something I wanted to do for a long time but had no reason to until now.

What I have in mind would be message passing a la GO between nwaku sub-systems.

Although, I agree that the first step should be assessing the value of APIs like this.

fryorcraken · December 2, 2024, 12:13am

At this point in time, I would say that if there a piece of information that is cross subsystem, then the CLI would do that part.

Let’s say we want to know all we can about message id 123.

The CLI would be the to do the following REST API calls:

SDS: do you know about 123? what’s its causal history, is it in the causal history of other messages
relay: did you receive or send message 123? if so when?
store: did you proceed with store confirmaiton for 123 or did you retrieve it from a store node? if so what store node
etc

Then the CLI would be the one to synthetize the information.

Meaning that on daemon (wakunode2) side, the subsystems API can remain separate.

Do note that p2p reliability is a susbsystem in itself that consumes other subsystems and should provide its own higher level API eg:

yes I used 2 light push node to send message 1234
yes I receive a store confirmation etc

But this would not include SDS so the CLI would still need to do several REST API calls.

SionoiS · December 3, 2024, 12:49pm

I disagree here. For a clean separation, any subsystem that compute stuff should be in nwaku not outside. Rest and CLI should be dumb interfaces to subsystems.

Ivansete · December 3, 2024, 3:09pm

I would not implement a new CLI app and I would not start a REST service in Status apps. We need to reduce the # of components and not add more

I suggest having a stats API that can be consumed by status-go, through libwaku.
That stats API would return a json string will all the desired stats and, given that is a json string, it can be easily extended without changing the stats API signatures.

The Status Desktop app already has a mechanism to manage the node. We just need to extend this mechanism to bring the operations we need.

We can even create a super generic API that just accepts a json string and returns a json string. And libwaku handles the actions underneath according to the input. That wouldn’t require additional changes in the Status apps.

Summarizing, I think we need to have a clearer need to create a separate CLI app. We have nowadays tools that are valid for an experienced developer. I think we just need to extend them.

Nevertheless, if everyone agrees to create such new CLI app I will be happy to help on its implementation ofc

gabrielmer · December 3, 2024, 3:11pm

Regarding the debugging CLI/REST API, the idea sounds cool and definitely an improvement - however, my personal opinion is to develop it once we find that we miss such a tool to solve a concrete problem we experience. In other words, don’t add complexity while current tools suffice us, and once we find that for an existing problem we are missing such a tool, then design a generalized solution (which would likely look similar to the CLI/REST API being proposed).

But without a specific case to solve, I personally wouldn’t invest resources on it

fryorcraken · December 6, 2024, 2:21am

Apologies to the reader as I took this discussion topic internally for planning, there was no good reason to do so.

The tentative conclusion of the decision is as follows:

Metrics

Yes, it would be good to have connectivity statistic available in the Status app, to better understand the connectivity state of the Waku node for debugging purposes. The Prometheus metrics seem to be well suited for that, as a lot of the metrics we are looking for (number of peers discovered, number of peer connected over time, etc) already exist in Prometheus.

However, we still need to determine if it makes sense to expose Prometheus in Status app. I would suggest to have an advanced option toggle to enable Prometheus, or local metrics, so it does not affect performance for the normal user.

In terms of consuming the metrics, the first step would be to have a docker image that runs Grafana, listening by default on Status prometheus (some manual config may be needed).

This would allow CCs to investigate and debug easily.

It may make sense at a later stage to integrate Grafana, or some other way to display Prometheus metrics, directly in the app. Especially when needed as a product: similar to Discord community metrics in terms of Community trends (most active users and channels, periods of activity, when did most user joined, etc).

Log tracing

Most of the other information we are interested in, should be accessible by parsing logs.
Whether it’s matching Waku message id with Status envelope id, understanding if a message was received via filter, relay or store, etc.

Hence, the second proposal would be to build such a tool, that can consume logs from Status app (both app and Waku/geth logs) and Kibana (Status fleet) to help a message, and display information in an easy consumable manner.

Waku CLI

With the 2 deliverables above, we should be able to make investigation much easier. Hence, at this point in time we do not see the benefit in building a Waku CLI that consumes nwaku’s REST API.

We can review this idea at a later stage, if clear values rises.

fryorcraken · December 6, 2024, 10:11am

The discussion discarded the need of a waku-cli (REST API client) to assist with Status app investigation, in favour of metrics and log parser.

However, I maintain that all Waku APIs should be aligned as described in Prometheus, REST and FFI: Using APIs as common language

More concretely:

Define a C API for a specific action, as done in [1], for example:

extern int waku_publish(char* messageJSON){}

This method:

a. Publishes a message over relay or light push, depending on the mode of the node (edge vs relay)
b. May use several light push service node if in edge mode
c. Emits several events such as (skipping how this is handled, this could be done at node level or function level, too detail for the illustration):
i. error
ii. light push ack received
iii. store confirmation received (if enabled at node level)

Have a method in libwaku (nim), that implement and exposes this C API
Have a method in the Golang bindings, that would be

func publish(instance *WakuInstance, msg *pb.WakuMessage)

Have a method in js-waku, that would be

class WakuNode {
  public async publish(
      msg: WakuMessage
  )
}

Have a REST API route, that would be:

POST /waku/publish
{
  "msg": {..}. // messageJSON
}

Meaning that no matter the language being used, the Waku API remains consistent, with a goal to keep extensibility cost as low as possible.

The need is clear regarding Golang and C API alignment, the aim is to have clear boundary between Status and Waku (Golang) code and ensure the Waku API is as simple as possible. Similarly for Rust.

Regarding js-waku and C, the benefit would be developer experience and future-proofing: if the need arises, a full stack JS SDK that is nwaku based in NodeJS/React Native and js-waku in the browser would be facilitated by an existing alignment of the API.

In any case, we know that we were guilty of providing complex API with too much protocol level details [2]. Spec’ing the API is a way to ensure we are deliberate about our choices.

Need for REST API

Long term

As native applications integrate Waku, and users end up running several embedded Waku nodes on their devices; we expect a need to run a single Waku node, that can be used by several applications.
In this context, a REST API aligned with the C API would make sense. Allowing several applications to share one Waku node and save bandwidth.

Short/mid term

Short and mid term, the need is not as evident. One argument would be that a REST API is easier to use than a C one. Hence, for developers that do not have access to a Waku SDK in their languages (yet), may prefer usage of REST.
However, it is a weak argument as most languages allow for the usage of C libraries.

Conclusion

At this point in time, I do not see a need to tidy up the REST API. However, if we were to do so, aligning the REST API with the common Waku API seems an obvious choice.

References

kaichao · December 6, 2024, 10:39am

Running shared nodes across different applications looks like a developer feature, for application users, it adds an extra step to download and run waku node first. It could be feasible only if we have a convenient and generic UI for it.

The most power I see from REST API or WebSocket API is that, such interface has a clear boundary between client and service node, this separate concern is meaningful because normal users don’t know how to run a service node efficiently or has the resource to do it, advanced / dev users can easily set up the service nodes for normal users through a incentivized and discoverable network, or a federated network via social contacts.

More detail on notion.

Alberto · December 9, 2024, 11:44am

Hi! Regarding “Log Tracing”, I have a tool that I think it could be easily adapted for both local and Kibana cases. Should we set up a conversation for this? Or where do you want to talk about it?

vpavlin · December 9, 2024, 11:45am

Prometheus metrics are already collected and managed in all the layers of the stack. This toggle would hence only enable or disable /metrics endpoint and from that perspective, I’d say it would be best to just have it enabled by default:)

REST API seems to be currently used/useful only for debugging purposes - especially because nodes usually run via Docker and the REST is the only easily accessible interface, so adding to the debugging features of REST would make sense to me.

I strongly disagree with this statement:) This is a common pattern - look at Docker or Ollama - these both are local client-server applications where clients (CLI or apps) interact with a single server via UNIX socket or REST to make management easier and avoid resource overconsumption (e.g. storage for images/models). I believe we (both Waku and Codex) will have to take similar approach where eventually the SDK is a thin client talking to a Waku / Codex node via REST or other means locally.

fryorcraken · December 12, 2024, 2:57am

Are there concern around performance impact?

I’d say disabled by default for now, especially on mobile. Play with it and review at a later stage.

My point was more around whether we align the REST API with Messaging API, and whether there is a need for it.

My conclusion is yes, because we want DST and QA to do Waku testing at messaging API level by default. But not because we want to push the REST API to the market per se. Re debugging, it seems that metrics + log parser would be the highest value tools first.

I think it’s too early to state what will be the best scenario. I could see a scenario where a Waku app such as Status can provide Waku node for other apps… It also depends on how fragmentation we have re clusters.

vpavlin · December 12, 2024, 9:01am

Yeah, there is no point in having it enabled on mobile, I am only talking about desktop. I don’t see any performance risk - prometheus does not do anything (AFAIU) until the endpoint is queried at which point it generates the content. So I would not be worried about performance on desktop even if it is enabled.

Agree, as I mentioned in Discord, current REST implementation is also hard to use when there are multiple apps trying to use it (which would make sense for a daemon with exposed REST API…), so alighnment with Messaging API with multiple clients using it at the same time would make sense to me.

Sure, but the communication would still probably happen over local network (REST or socket) to keep the resource consumption low. Anyway, we can park it for later:)