Archiving Waku message on Codex

fryorcraken · November 25, 2024, 1:43am

Throwing some ideas out there. Trying to build a mental model on how we can solve devices offline “for too long”. Keen to get some opinions. Experimenting and dogfooding will likely drive final design.

Scenario

Alice was offline for a while, she comes back online and start retrieving latest messages from store node.

Thanks to SDS [1], she realizes she is missing message 12345 from SDS’ causal history.

Retrieving from store

Alice tries to retrieve message 12345 from store nodes, as per SDS.

How many store nodes should she try before giving up?

Can she infer the timestamp of 12345?

by looking at timestamp of other messages in the same SDS channel?
by having message ordering embedded in the message id?

If she can infer the timestamp, and if store nodes expose their oldest message/time range: she can save queries and just look (probably connecting and asking time range) for a store node likely to have the message

Solution 1: ask for re-broadcast

Alice cannot find the message, she asks in the group for a re-broadcast:

Original sender may be the only one that tries to re-broadcast
OR some logic for all group member to wait a random time and if such random time hit and no-one re-broadcast, proceed with re-broadcast
re-broadcast means high bandwidth usage
what if Alice missed thousands of messages? rate limit still applies

My conclusion:

has potential
but high impact on bandwidth makes it unlikely to be viable
if Alice was offline long enough to miss a message not in store, then how likely it is she missed lots of messages?
Still not clear how a rebroadcast would work as it would change the Waku message id, which I believe is used in the SDS causal history.

Solution 2: Alice checks on Codex (app logic)

Alice cannot find the message, was it stored on Codex?

Some assumptions:

Messages are bundled on Codex after a while, similarly to how it is done for Status communities and BitTorrent
Retrieval from BitTorrent is not driven by known missing message (SDS), but regular check from archive afaik, meaning high latency.
There is some way for Alice to find the right bundle
Alice needs to be able to directly access Codex network (not currently possible from the web, is it from mobile?)
If Alice pays for download (I think not implemented now but future plan) she needs to be certain her message is in there

Solution 3: Store node checks on Codex

There is a generalized protocol for messages to be archived on Codex
A store node can check if a message is on codex
Who pays for the generalized archiving → could be handled by app owner, store node just knows how to read codex archive
who pays for download from codex - probably needs to be tied in with Waku incentivization
Solves problem of accessibility from the web

Further thoughts

Keen to understand how 2 & 3 seem feasible from Codex and Waku PoV.

References

[1] Scalable Data Sync specs docs: add SDS protocol for scalable e2e reliability by jm-clius · Pull Request #108 · vacp2p/rfc-index · GitHub

vpavlin · November 25, 2024, 5:56am

This is a reasonable summary. I don’t think we can avoid 2 and/or 3 as 1 brings quite a few issues - bandwidth, original sender being gone from the group, noone else willing/able to rebroadcast…

I think both direct (a user from SDS channel uploads the log os message) and indirect (Store nodes bundle and archive messages) are viable and complimentary.

In case of Qaku.app someone has to be the authority to upload to Codex (the QA creator). With SDS anyone would be potentially able to upload as long as there is a way to prove the log contains all the message - this also allows app devs to handle regular archiving and charging for it via some kind of subscription.

Relying on store nodes checking Codex is interesting as well, although to me it feels that could/should be a separate protocol to keep Store interactions simple and fast. Additional protocol focused on archival and fetching of SDS channels would make more sense to me.

I belive both are planned, mobile being a bit closer since it is mainly about compiling for ARM and providing bindings to a language like Go (which is both planned AFAIK, but not sure about timeframes)

IMO the download payments would either be handled by having a locally running Codex node with a token loaded wallet attached or by an APP providing some kind of cache to fetch the app related content from Codex network - and then it is up to the user how much decentralization and privacy they need/want. But there are probably other approaches to be investigated

fryorcraken · November 26, 2024, 12:23am

An interesting note is that when using BitTorrent, then WebTorrent can be used to retrieve data directly from the browser. If the data is “pinned” on Codex by the admin via a web interface, that would be ideal. But if the admin has to run a node/app for now, and users are able to retrieve archived data via the web, it can be a good mid-way compromise.

SionoiS · December 5, 2024, 2:13pm

I will mention that Waku Sync could be used with slight modification (to storage).

Similar to performance profile to Solution 1

fryorcraken · April 10, 2025, 6:19am

Following up on No store in Messaging API - #13 by haelius

tl;dr: thanks to SDS, we should be able to limit the usage of store time range queries to only recent past. So it can grab the tail of a SDS log, and then proceed with hash queries to get messages listed in causal history.

This allows us to assume that old messages are identified by their waku message id.
It them becomes possible for an archive message to be regular transmitted. Very similar to what @vpavlin did for Qaku. The archive message could contain a list of:

Codex cid
Bloom filter of Waku message ids contained in the bundle

This ways, it enables users to know whether they are likely to find their missing messages in Codex archive at the given cid.

@vpavlin wdyt?

vpavlin · April 14, 2025, 7:04am

Yeah, I think this sounds good. I did not realyl think about adding the msg ids to the snapshot in Qaku since the Q&A log is small and you are alsways pulling the whole Q&A from snapshot basically. But for longer logs or those that are produced and persist across a long period of time and hence might make sense for them to be split in multiple CIDs/datasets, this sounds like a sensible approach