Improving Content Topics: A Coordinate-Based Approach for Better Privacy, Distribution and Dev Ex

fryorcraken · July 17, 2025, 5:31am

Improving Content Topics: A Coordinate-Based Approach for Better Privacy and Distribution

Current State and Problems

The current content topic format in Waku follows this structure as defined in 23/WAKU2-TOPICS and RELAY-SHARDING:

/{generation}/{application-name}/{version-of-the-application}/{content-topic-name}/{encoding}

For example: /0/myapp/1/mytopic/cbor

Whilst this format serves basic functionality, it introduces several significant challenges:

1. Complexity for Developers
Developers must navigate autosharding decisions early in development:

Should they use a fixed app name (concentrating all users in one shard)?
Should they use dynamic app names for multi-shard scaling?

This creates an unnecessary barrier to entry and potential lock-in scenarios with difficult upgrade paths.

2. Privacy Concerns
The application name appears in clear text within content topics, completely removing plausible deniability. Any observer can definitively identify which application a given IP address is using based on filter subscriptions, store queries, or light push messages.

3. Uneven Traffic Distribution
When applications use static app-name values, they may create uneven traffic distribution in a shared network, with some shards becoming heavily loaded whilst others remain underutilised.

Proposed Solution: Coordinate-Based Content Topics

I propose a new approach based on application “coordinates” that addresses all these challenges simultaneously, with a simplified content topic format.

New Content Topic Format

/1/<topic>

Where:

1 is the autosharding version (generation)
<topic> is the numeric coordinate generated for a user

For example: /1/33024

Core Concept

Application Coordinate: Each application generates a unique coordinate within a defined space (e.g., 0-65535)
Distance Parameter: Applications define a statistical distance within which their content topics are generated
Content Topic Generation: Individual content topics are generated randomly within the distance range of the application coordinate
Coordinate-Based Sharding: Sharding is based on the content topic coordinate, meaning that topics close together are more likely to be in the same shard, simplifying the autosharding approach
Shared Topic Space: Multiple applications may statistically generate identical content topics, enabling plausible deniability

Example Scenario

Consider a network with 8 shards (content topic space 0-65535, each shard covering ~8192 topics):

“CoolGame” has coordinate 32768 with distance 4096
“SecureChat” has coordinate 40960 with distance 4096

Alice (CoolGame user): Gets content topic 36864 → /1/36864 (distance 4096 from coordinate)
Bob (SecureChat user): Gets content topic 36864 → /1/36864 (distance 4096 from coordinate)

Despite using different applications, Alice and Bob share the same content topic, providing perfect plausible deniability.

Benefits

Simplified Developer Experience

Generate a random coordinate once during project setup (tooling provided)
Choose a distance parameter (sensible defaults available)
Generate user content topics automatically (utilities provided)

Enhanced Privacy

Multiple applications can share content topics
No clear text application identifiers
True plausible deniability for users

Improved Traffic Distribution

Statistical distribution prevents hot spots
Configurable distance allows scaling control
Better load balancing across shards

Optimised Shard Usage

Applications can guarantee maximum shard usage (based on distance vs shard size)
Predictable scaling characteristics

Proposed Algorithms

1. Coordinate Generation

function generateAppCoordinate(spaceSize = 65536) {
    return Math.floor(Math.random() * spaceSize);
}

2. Content Topic Generation

function generateContentTopic(appCoordinate, distance, spaceSize = 65536) {
    // Calculate start of available range
    const rangeStart = appCoordinate - distance;

    // Generate coordinate within range
   return rangeStart + Math.floor(Math.random() * (2 * distance));
}

3. Shard Mapping

function contentTopicToShard(contentTopic, totalShards, spaceSize = 65536) {
    const shardSize = Math.floor(spaceSize / totalShards);
    return Math.floor(contentTopic / shardSize);
}

// Example: content topic 36864 with 8 shards
// shardSize = 65536 / 8 = 8192
// shard = Math.floor(36864 / 8192) = 4

Scaling Scenarios

This coordinate-based approach handles network scaling elegantly:

Scenario 1: Network-Wide Shard Increase (8 → 16 shards)

When the network decides to increase shards from 8 to 16 due to overall traffic growth:

Previous: Each shard covered ~8192 topics (65536/8)
New: Each shard covers ~4096 topics (65536/16)
Impact: Content topic /1/36864 moves from shard 4 to shard 9
Migration: All applications automatically benefit from reduced per-shard traffic
Developer Action: If the number of shards is defined in a unique source of truth, such as a smart contract, with time-based information (e.g., from block N of the chain), then applications can automatically upgrade with minimum disruption

Applications with a large distance parameters (see Scenario 2), could have their shard range span reduced (flooring to 2).

Scenario 2: Application-Specific Scaling

When an application experiences high user traffic but network shards remain constant (8 shards):

Option A: Increase Distance Parameter

Current: distance 4096 (max 2 shards)
New: distance 8192 (max 3 shards)
Effect: Spreads users across more shards, reducing per-shard load

Option B: Multiple Coordinate Ranges

Deploy additional coordinate ranges for the same application
Each range operates independently with its own distance parameter
Users are assigned to different ranges during onboarding

Developer Considerations:

Monitor per-shard traffic for your application’s coordinate range
Adjust distance parameters based on actual usage patterns
Consider user experience impact of spreading across more shards

Implementation Considerations

Backward Compatibility: This could be introduced as generation 1 (/1/<topic>) whilst maintaining support for the current generation 0 format.

Tooling Requirements:

Coordinate generation utilities
Content topic generation libraries
Shard calculation helpers
Migration tools for existing applications

Configuration Options:

Default distance parameters for different application types
Space size configuration for different network scales
Shard count adaptation algorithms

Request for Feedback

This proposal represents a significant improvement to Waku’s content topic system, addressing developer experience, privacy, and network performance simultaneously.

Key questions for discussions:

What are your thoughts on this proposal and the simplification of the content topic?
Do you see potential issues with this model in terms of new app scalability, especially re Status/Chat SDK? Comparing to old model?
Do you agree with the overall simplification? Reducing code and dev ex complexity.
The proposed algorithms are extremely sample, is there different/specific algorithms you would like to see employed instead?
Do you see any reason why we should proceed with this change?

vpavlin · July 17, 2025, 8:08am

I do like the simplification and added privacy of this!

On the other hand I am probably missing something - for the simplest case where 2 people want to communicate - and they already are “connected” - i.e. know each other’s identifiers, how would they deterministically derive a content topic to publish and subscribe to?

Or for another case - e.g. Qaku - right now I set the content topic based on generated QA ID, so this would flip it - I would set the QA ID based on generated content topic?

fryorcraken · July 17, 2025, 8:29am

Response: Improving Developer Experience with Conversation IDs

After reviewing the coordinate-based proposal, I’ve identified a developer experience issue that needs addressing.

The Problem with Random Generation

The current proposal generates random content topics within an application’s coordinate range. However, this creates a burden for developers:

Storage Requirement: Developers must generate the content topic first, then save it for future reference
State Management: Applications need to track which content topic corresponds to which user session, conversation, or context
SDK Limitations: Waku SDK cannot handle this storage automatically since developers need content topics for future references to specific application events

Proposed Solution: Conversation ID Abstraction

A better developer experience would be to entirely abstract the content topic and let developers generate and handle their own identifiers. Let’s call this a conversation ID.

Developer Workflow

Developer generates or chooses a conversation ID (string) - could be a username, chat room name, session ID, etc.
Developer passes this conversation ID to the Waku API
A content topic is deterministically generated from the conversation ID for this given context (coordinate, distance)

Deterministic Content Topic Generation Algorithm

The conversationIdToContentTopic function would be internal/private to the Waku SDK. When developers use waku.subscribe, they can pass their conversation ID, app coordinate and distance, and internally, the SDK will convert those to a content topic.

// Internal SDK function (not exposed to developers)
function conversationIdToContentTopic(conversationId, appCoordinate, distance, spaceSize = 65536) {
    // Hash the conversation ID to get a consistent value
    const hash = sha256(conversationId);
    
    // Extract a 32-bit value from the hash
    const dataView = new DataView(hash.buffer.slice(0, 4));
    const hashValue = dataView.getUint32(0, false);
    
    // Map hash to the available range around app coordinate
    const rangeSize = 2 * distance;
    const offset = hashValue % rangeSize;
    const rangeStart = appCoordinate - distance;
    
    // Calculate content topic with wrapping for negative values
    let contentTopic = rangeStart + offset;
    if (contentTopic < 0) contentTopic += spaceSize;
    
    return contentTopic;
}

Example Usage

// Developer code - no need to handle content topics directly
const myAppCoordinate = 32768;
const distance = 4096;

// Subscribe using conversation IDs - SDK handles conversion internally
await waku.subscribe({
  conversationId: "user:alice:chat",
  appCoordinate: myAppCoordinate,
  distance: distance
});

await waku.subscribe({
  conversationId: "game:session:xyz123", 
  appCoordinate: myAppCoordinate,
  distance: distance
});

// Same conversation ID parameters always map to the same content topic internally

Benefits

Simplified Developer Experience

No need to store randomly generated content topics
Developers use identifiers meaningful to their application
Same conversation ID always maps to the same content topic

Maintained Privacy

Hash function ensures conversation IDs aren’t reversible from content topics
Multiple applications can still generate the same content topic (if their hash+coordinate+distance combinations align)
No clear text application or conversation identifiers in the network

Preserved Distribution Properties

SHA-256 provides good distribution across the coordinate range
Distance parameter still controls shard spreading
Applications maintain predictable maximum shard usage

Better SDK Integration

SDK can handle the conversion transparently
Developers work with familiar string identifiers
Backwards compatible with existing applications

Migration Path

This approach could be implemented as an additional API method:

// Current approach (still supported)
const encoder = createEncoder({ contentTopic: "/1/33024" });
await waku.subscribe({ contentTopic: "/1/33024" });

// New approach - SDK handles conversion internally
await waku.subscribe({ 
  conversationId: "user:alice:chat",
  appCoordinate: 32768,
  distance: 4096 
});

This maintains the benefits of the coordinate-based system whilst significantly improving the developer experience by eliminating the need to manage randomly generated content topics.

fryorcraken · July 17, 2025, 8:30am

Lol. Yes, I thought the same, see my reply above I was preparing while you wrote yours

vpavlin · July 17, 2025, 8:43am

Yeah, that sounds better:)

I guess this would also mean we (and the devs) would not be able to really track bandwidth consumed by an app in the network (which also makes sense from the privacy perspective, just pointing it out).

IIUC this could still result in a single user of a single app needing to subscribe to multiple shards as given the distance, appCoordinate and a bunch of conversationIds the topics could land in various shards - I guess that is not a problem per se (quite the contrary from the privacy perspective), just a consideration regarding bandwidth and needing to maintain connections to various service nodes based on shards - as always, adds complexity while improving privacy and bandwidth distribution in the network.

But in general yes, I like this, I think it could work

fryorcraken · July 17, 2025, 11:16am

In general the design is actually much simpler than before.

We can’t stop an app straddling over two shards (or at least, I can’t find a way, but maybe there is some algorithm that could make it).

But that’s just 2 shards. So we would just need to make the assumption that one relay instance would be on 2 shards.

fryorcraken · July 17, 2025, 12:32pm

One more potential design is decoupling content topic and auto sharding.

If we propose a model where we do not expose the content topic to the developer, then it may allow for flexibility and control on sharding. Where the developer may opt for the number of target shards for their users. For example, specifying the usage of 3 shards, on a network of 32 shards.

Instead of using the distance to statically spread over 3 shards, the app coordinate and conversation id can be combined to deterministically select a shard, and generate a content topic with 2 separate operations.

haelius · July 17, 2025, 2:49pm

Right, interesting idea, but I’m wondering if we’re not mixing up content topic (designed for filter use cases, not routing) into the routing layer.

I kind of regret how we approached autosharding. It was never meant to force us to start thinking of content topics as an extension of routing. It was simply a convenience function where we told applications, “hey, if you cannot or do not want to make decisions about routing, we can use your filter design to come up with a relatively sensible (if naive) default routing strategy”. This is mostly useful for smaller apps on Waku. Larger, longer-lived apps should probably come up with their own, considered sharding strategy.

Thoughts on the problem statement:

Autosharding exists mostly for those who do not want to make routing decisions. Usually the default should be for a simple app to use a single shard. Autosharding provide an easy way to do this or explicitly design namespaces that can afford to be in a separate shard. Apps that can afford to use more than a single shard should be able to be explicit about this - I either want to use 1 shard or 2 shards. Apps shouldn’t “probabilistically” be assigned a number of shards for their app. The difference between 1 and 2 shards for routing is massive.

This is known issue with content topics, but is generally considered an acceptable tradeoff for apps to improve the filtering for their clients.
Content topics right now can be K-anonymised by just suggesting that apps use:

/0/<random number from 0-7>/1/<random number from 0 to 1023>

We can even provide the utilities for apps to have these random numbers autogenerated from app-legible output (like your conversation ID suggestion). But content topics are already useable in this way. We should indeed improve our documentation and suggested defaults here, clearly explaining the tradeoffs.

I don’t really see how an app coordinate will improve this. We divide apps more or less evenly between shards, the same way we would with an app coordinate and content topics close in distance. The reason for uneven traffic would be because a specific app on a shard has unusually more traffic than those on other shards. This would still be the case if this app used a coordinate system.

And on the content:

Coordinate-Based Sharding: Sharding is based on the content topic coordinate, meaning that topics close together are more likely to be in the same shard, simplifying the autosharding approach

Again, this uses content topics as routing extension, not tags for filtering. An app would generally not want its content topics to “probabilistically” be routed in 1 or 2 shards - the difference between these two scenarios is huge. An app with very little traffic but a large number of filter use cases (legitimately so) would have a much higher probability of being routed on several shards than a high-rate app with only one or two content topics. This implies again that filter design will now be coupled with routing, which we don’t want.

Shared Topic Space: Multiple applications may statistically generate identical content topics, enabling plausible deniability

This is bad if unintentional and already possible if intentional. The whole point of content topics is to design exactly how much traffic your filter clients would need to subscribe to for a functioning app. We do not want clients to unintentionally now always have to query pages of a portion of all network traffic if they wanted to e.g. query for their past 5 messages, which could have been on a unique, short-lived content topic. If apps want k-anonymity (quite reasonably), we can do a simple modulo-hash of their topics (or conversation IDs as you suggest) into a k-anonymity set. Or they can simply reuse content topics for several conversations as Status already does. The design already allows this, but I agree that we can provide some tooling by default here.

To me this reads: apps can guarantee maximum routing shards by artificially limiting their filter use cases because of this routing constraint.

Last thing I would mention is that dynamic content topics and shards (i.e. assigned content topics and shards changing over time due to e.g. network scaling) is something we’ve tried to avoid with autosharding design. Once an app has an assigned shard it should remain on that shard indefinitely (even if new network shards are released), until it explicitly makes the decision to migrate some or all of its traffic to the new shard(s) - which is why we include a version and implicit generation number in content topics. This is because of the huge cost involved with migrating an entire routing infrastructure (nodes, store history, etc.) to a new shard. We shouldn’t automatically break e.g. app-specific stores by just releasing new network shards. Similarly, if the release of new network shards or content topics for an app now also break the installed filters (because all content topics suddenly change), migration will come at a large cost, especially for historical data and existing clients.

haelius · July 17, 2025, 3:50pm

I realised that my ramble above may need a TL;DR:

We should remember that content topics are programmable anonymity trade-offs to allow for filtering and should not be designed for routing. They can already be made k-anonymous in the current format.
I think we can certainly provide utilities to allow k-anonymous content topics by converting app-level “conversation IDs” to a limited namespace content topics, knowing that this is not appropriate for all filter use cases. We can indeed do better here and provide some privacy-by-default utilities.

SionoiS · July 17, 2025, 4:49pm

I spend a good deal of time thinking about all theses problems too.

AFAIK nothing can manage shard traffic it’s too dynamic.

Having higher layer deal with shard changes is a cluster fuck in waiting.

Also, what Hanno said!

P.S.
The only alternative is to not use a PubSub system…
cough cough waku sync cough cough

jazzz · July 17, 2025, 9:47pm

No clear text application or conversation identifiers in the network

Love the focus on privacy here, however I don’t think this moves the needle. Static hashing of an identifier is equivalent to masking. Any adversary could unmask the coordinate in the future. As the output value is stable, once its unmasked it provides no privacy protections during operation until it is rotated. At best this is obscurity, helpful but doesn’t represent a meaningful change given the finite set of applications.

jazzz · July 17, 2025, 10:38pm

Mapping the internal identifiers evenly across an apps shards, doesn’t really allow for application specific context around access patterns. Application developers know which data is likely to be accessed together, and previously could manually assign it to shards to optimize for this. While the vast majority of developers will not need this, as written the approach above removes the option for those who do.

jazzz · July 17, 2025, 10:48pm

Content_topic strategy is a major tool for app developers to manage breaking changes within their apps and protocols.

It’s only transparent assuming clients are all using the same version.

Changing any portion of the conversation generation code would require a major version bump to Waku, as same clients operating with different generation code would not be able to interop. As @haelius mentioned dynamic shard mapping, creates expensive client side upgrades for app developers.

fryorcraken · July 18, 2025, 2:02am

Today I Learned, thanks team.

fryorcraken · July 18, 2025, 2:23pm

My latest work fix!: Introduce routing info concept and fix nwaku master tests by fryorcraken · Pull Request #2471 · waku-org/js-waku · GitHub indeed shows that using the content topic for both filter and routing introduces a duality to this artefact, which is a smell. As it is best for a given concept to do one thing well, and one thing only.

I understand the need for the developer to have some control, but I am wary of it. It is always best to assume the developer wants to acquire as little knowledge as possible from about the inner work of Waku.

Another good indicator is whether we are able to provide good default to the developer, so that they can code quickly.
Followed by good APIs, so they can scale easily.

In the debate above, it seems that we have two types of developers:

scale later
vs scale now

For a “scale later” (1) developer, then having some tooling where a developer can randomly select and hardcode a shard is enough.
In this instance, autosharding is overkill.

For a “scale now” (2) developer, which I assume app/chat team is, then I do not believe it is clear enough how they should proceed as they have the choice to either manipulate app-name or app-version fields to spread their users on several shard.
Which one should they change? How should they do it?

And if you are not ready to wage on this matter, then it seems that autosharding was prematured? (which I guess it was as our hand was forced there).

I am most curious about a path forward where we can either simplify or improve the dev ex around sharding.

fryorcraken · August 15, 2025, 2:35am

New idea.

Trying to re-formulate the issue we have with content topic usage, filter performance and privacy.

Discrete filtering

A filter client subscribes by using a discrete set of content topic.
Meaning that the service needs to remember that client Alice is interesting in content topic 1, 2, 3, 4. And Bob in 3, 5, 6, 7.

AFAIK, the issue with filter performance is that now, a map needs to remain in memory, where the keys are discrete content topic values. Meaning it is proportional to the number of active subscription.

If we have 100 clients, each subscribing for 100 content topics, we need to have a map of 10,000 keys (with each 1 value which is the connection details back to the client).

Bucketing

Yet, on the other hand, for privacy purposes we use the content topics at buckets. Looking at Status chat protocol, we uniformly hash the chat key, select first 4 bytes to have 50k buckets.

This (potentially) provides some plausible deniability and privacy, because when Alice subscribe to content topic 3, she can say that some of the messages are not for her.

The issue with that, is that until you have a factor of 50k users, then the bucket overlap is very unlikely, making it worth for filter performance and privacy

In summary, we want some continuity in terms of traffic distribution, for privacy purposes, yet, we use a discrete model for subscription which leads to performance issue.

My question is whether we can use a continuous model for subscription purposes, so that the cost of subscription can be reduced on service node, while providing (better) privacy/plausible deniability properties.

Could this be done via bloom filter subscription?

As a client, instead of asking to subscribe to 100 content topics, I can just pass a bloom filter to the service node.

When I want to subscribe to more or less content topics, I just need to update the bloom filter locally and pass it to the service node.

The service node only needs to maintain 1 bloom filter per client, much more efficient.

The client has actually more control on how many messages they get, as depends on how they setup the bloom filter. Some parameters could even be passed over the wire to the service node, to enable more customisation.

This would give a lot of freedom to the application developer to set their content topics as they see fit.

This does not solve the problem of store performance re high number of content topics. Different approaches for store are available:

less time range + content topic queries, more message hash queries thanks to SDS
Usage of Waku sync instead of time range.

SionoiS · August 18, 2025, 11:49am

Can you elaborate a bit more? We talked about it but it was late for me

prem · August 19, 2025, 6:05am

Do we need to save the filter data this way? Wouldn’t it be easier to have data structures mapping content-topics to clients for message push and another one to maintain subscriptions (clients to content-topics)?
And i think the current implementation already has it this way.
That would reduce the load and also help us scale better.

I don’t think filter would be a bottleneck for content-topics usage unless we scale to tens of thousands of them. The bottleneck with store iirc is the DB not being able to handle that query size.

fryorcraken · August 21, 2025, 4:31am

Are we saying there is no bottleneck on filter? didn’t you look into it and flagged that it was taking several seconds to setup a filter subscription for communities?

fryorcraken · August 21, 2025, 4:33am

Alice: edge node, Bob: service node

Alice wants to listen to 100 content topics
Alice creates local bloom filter where she adds all content topics
Alice send a filter subscribe request to Bob, in the request, she only pass the bloom filter
Bob tests content topics of messages he receives via relay against Alice’s bloom filter
If positive, Bob forwards the message to Alice.