Improving Content Topics: A Coordinate-Based Approach for Better Privacy, Distribution and Dev Ex

Improving Content Topics: A Coordinate-Based Approach for Better Privacy and Distribution

Current State and Problems

The current content topic format in Waku follows this structure as defined in 23/WAKU2-TOPICS and RELAY-SHARDING:

/{generation}/{application-name}/{version-of-the-application}/{content-topic-name}/{encoding}

For example: /0/myapp/1/mytopic/cbor

Whilst this format serves basic functionality, it introduces several significant challenges:

1. Complexity for Developers
Developers must navigate autosharding decisions early in development:

  • Should they use a fixed app name (concentrating all users in one shard)?
  • Should they use dynamic app names for multi-shard scaling?

This creates an unnecessary barrier to entry and potential lock-in scenarios with difficult upgrade paths.

2. Privacy Concerns
The application name appears in clear text within content topics, completely removing plausible deniability. Any observer can definitively identify which application a given IP address is using based on filter subscriptions, store queries, or light push messages.

3. Uneven Traffic Distribution
When applications use static app-name values, they may create uneven traffic distribution in a shared network, with some shards becoming heavily loaded whilst others remain underutilised.

Proposed Solution: Coordinate-Based Content Topics

I propose a new approach based on application “coordinates” that addresses all these challenges simultaneously, with a simplified content topic format.

New Content Topic Format

/1/<topic>

Where:

  • 1 is the autosharding version (generation)
  • <topic> is the numeric coordinate generated for a user

For example: /1/33024

Core Concept

  1. Application Coordinate: Each application generates a unique coordinate within a defined space (e.g., 0-65535)
  2. Distance Parameter: Applications define a statistical distance within which their content topics are generated
  3. Content Topic Generation: Individual content topics are generated randomly within the distance range of the application coordinate
  4. Coordinate-Based Sharding: Sharding is based on the content topic coordinate, meaning that topics close together are more likely to be in the same shard, simplifying the autosharding approach
  5. Shared Topic Space: Multiple applications may statistically generate identical content topics, enabling plausible deniability

Example Scenario

Consider a network with 8 shards (content topic space 0-65535, each shard covering ~8192 topics):

  • “CoolGame” has coordinate 32768 with distance 4096
  • “SecureChat” has coordinate 40960 with distance 4096

Alice (CoolGame user): Gets content topic 36864 → /1/36864 (distance 4096 from coordinate)
Bob (SecureChat user): Gets content topic 36864 → /1/36864 (distance 4096 from coordinate)

Despite using different applications, Alice and Bob share the same content topic, providing perfect plausible deniability.

Benefits

Simplified Developer Experience

  • Generate a random coordinate once during project setup (tooling provided)
  • Choose a distance parameter (sensible defaults available)
  • Generate user content topics automatically (utilities provided)

Enhanced Privacy

  • Multiple applications can share content topics
  • No clear text application identifiers
  • True plausible deniability for users

Improved Traffic Distribution

  • Statistical distribution prevents hot spots
  • Configurable distance allows scaling control
  • Better load balancing across shards

Optimised Shard Usage

  • Applications can guarantee maximum shard usage (based on distance vs shard size)
  • Predictable scaling characteristics

Proposed Algorithms

1. Coordinate Generation

function generateAppCoordinate(spaceSize = 65536) {
    return Math.floor(Math.random() * spaceSize);
}

2. Content Topic Generation

function generateContentTopic(appCoordinate, distance, spaceSize = 65536) {
    // Calculate start of available range
    const rangeStart = appCoordinate - distance;

    // Generate coordinate within range
   return rangeStart + Math.floor(Math.random() * (2 * distance));
}

3. Shard Mapping

function contentTopicToShard(contentTopic, totalShards, spaceSize = 65536) {
    const shardSize = Math.floor(spaceSize / totalShards);
    return Math.floor(contentTopic / shardSize);
}

// Example: content topic 36864 with 8 shards
// shardSize = 65536 / 8 = 8192
// shard = Math.floor(36864 / 8192) = 4

Scaling Scenarios

This coordinate-based approach handles network scaling elegantly:

Scenario 1: Network-Wide Shard Increase (8 → 16 shards)

When the network decides to increase shards from 8 to 16 due to overall traffic growth:

  • Previous: Each shard covered ~8192 topics (65536/8)
  • New: Each shard covers ~4096 topics (65536/16)
  • Impact: Content topic /1/36864 moves from shard 4 to shard 9
  • Migration: All applications automatically benefit from reduced per-shard traffic
  • Developer Action: If the number of shards is defined in a unique source of truth, such as a smart contract, with time-based information (e.g., from block N of the chain), then applications can automatically upgrade with minimum disruption

Applications with a large distance parameters (see Scenario 2), could have their shard range span reduced (flooring to 2).

Scenario 2: Application-Specific Scaling

When an application experiences high user traffic but network shards remain constant (8 shards):

Option A: Increase Distance Parameter

  • Current: distance 4096 (max 2 shards)
  • New: distance 8192 (max 3 shards)
  • Effect: Spreads users across more shards, reducing per-shard load

Option B: Multiple Coordinate Ranges

  • Deploy additional coordinate ranges for the same application
  • Each range operates independently with its own distance parameter
  • Users are assigned to different ranges during onboarding

Developer Considerations:

  • Monitor per-shard traffic for your application’s coordinate range
  • Adjust distance parameters based on actual usage patterns
  • Consider user experience impact of spreading across more shards

Implementation Considerations

Backward Compatibility: This could be introduced as generation 1 (/1/<topic>) whilst maintaining support for the current generation 0 format.

Tooling Requirements:

  • Coordinate generation utilities
  • Content topic generation libraries
  • Shard calculation helpers
  • Migration tools for existing applications

Configuration Options:

  • Default distance parameters for different application types
  • Space size configuration for different network scales
  • Shard count adaptation algorithms

Request for Feedback

This proposal represents a significant improvement to Waku’s content topic system, addressing developer experience, privacy, and network performance simultaneously.

Key questions for discussions:

  1. What are your thoughts on this proposal and the simplification of the content topic?
  2. Do you see potential issues with this model in terms of new app scalability, especially re Status/Chat SDK? Comparing to old model?
  3. Do you agree with the overall simplification? Reducing code and dev ex complexity.
  4. The proposed algorithms are extremely sample, is there different/specific algorithms you would like to see employed instead?
  5. Do you see any reason why we should proceed with this change?
1 Like

I do like the simplification and added privacy of this!

On the other hand I am probably missing something - for the simplest case where 2 people want to communicate - and they already are “connected” - i.e. know each other’s identifiers, how would they deterministically derive a content topic to publish and subscribe to?

Or for another case - e.g. Qaku - right now I set the content topic based on generated QA ID, so this would flip it - I would set the QA ID based on generated content topic?

1 Like

Response: Improving Developer Experience with Conversation IDs

After reviewing the coordinate-based proposal, I’ve identified a developer experience issue that needs addressing.

The Problem with Random Generation

The current proposal generates random content topics within an application’s coordinate range. However, this creates a burden for developers:

  1. Storage Requirement: Developers must generate the content topic first, then save it for future reference
  2. State Management: Applications need to track which content topic corresponds to which user session, conversation, or context
  3. SDK Limitations: Waku SDK cannot handle this storage automatically since developers need content topics for future references to specific application events

Proposed Solution: Conversation ID Abstraction

A better developer experience would be to entirely abstract the content topic and let developers generate and handle their own identifiers. Let’s call this a conversation ID.

Developer Workflow

  1. Developer generates or chooses a conversation ID (string) - could be a username, chat room name, session ID, etc.
  2. Developer passes this conversation ID to the Waku API
  3. A content topic is deterministically generated from the conversation ID for this given context (coordinate, distance)

Deterministic Content Topic Generation Algorithm

The conversationIdToContentTopic function would be internal/private to the Waku SDK. When developers use waku.subscribe, they can pass their conversation ID, app coordinate and distance, and internally, the SDK will convert those to a content topic.

// Internal SDK function (not exposed to developers)
function conversationIdToContentTopic(conversationId, appCoordinate, distance, spaceSize = 65536) {
    // Hash the conversation ID to get a consistent value
    const hash = sha256(conversationId);
    
    // Extract a 32-bit value from the hash
    const dataView = new DataView(hash.buffer.slice(0, 4));
    const hashValue = dataView.getUint32(0, false);
    
    // Map hash to the available range around app coordinate
    const rangeSize = 2 * distance;
    const offset = hashValue % rangeSize;
    const rangeStart = appCoordinate - distance;
    
    // Calculate content topic with wrapping for negative values
    let contentTopic = rangeStart + offset;
    if (contentTopic < 0) contentTopic += spaceSize;
    
    return contentTopic;
}

Example Usage

// Developer code - no need to handle content topics directly
const myAppCoordinate = 32768;
const distance = 4096;

// Subscribe using conversation IDs - SDK handles conversion internally
await waku.subscribe({
  conversationId: "user:alice:chat",
  appCoordinate: myAppCoordinate,
  distance: distance
});

await waku.subscribe({
  conversationId: "game:session:xyz123", 
  appCoordinate: myAppCoordinate,
  distance: distance
});

// Same conversation ID parameters always map to the same content topic internally

Benefits

Simplified Developer Experience

  • No need to store randomly generated content topics
  • Developers use identifiers meaningful to their application
  • Same conversation ID always maps to the same content topic

Maintained Privacy

  • Hash function ensures conversation IDs aren’t reversible from content topics
  • Multiple applications can still generate the same content topic (if their hash+coordinate+distance combinations align)
  • No clear text application or conversation identifiers in the network

Preserved Distribution Properties

  • SHA-256 provides good distribution across the coordinate range
  • Distance parameter still controls shard spreading
  • Applications maintain predictable maximum shard usage

Better SDK Integration

  • SDK can handle the conversion transparently
  • Developers work with familiar string identifiers
  • Backwards compatible with existing applications

Migration Path

This approach could be implemented as an additional API method:

// Current approach (still supported)
const encoder = createEncoder({ contentTopic: "/1/33024" });
await waku.subscribe({ contentTopic: "/1/33024" });

// New approach - SDK handles conversion internally
await waku.subscribe({ 
  conversationId: "user:alice:chat",
  appCoordinate: 32768,
  distance: 4096 
});

This maintains the benefits of the coordinate-based system whilst significantly improving the developer experience by eliminating the need to manage randomly generated content topics.

1 Like

Lol. Yes, I thought the same, see my reply above I was preparing while you wrote yours :slight_smile:

Yeah, that sounds better:)

I guess this would also mean we (and the devs) would not be able to really track bandwidth consumed by an app in the network (which also makes sense from the privacy perspective, just pointing it out).

IIUC this could still result in a single user of a single app needing to subscribe to multiple shards as given the distance, appCoordinate and a bunch of conversationIds the topics could land in various shards - I guess that is not a problem per se (quite the contrary from the privacy perspective), just a consideration regarding bandwidth and needing to maintain connections to various service nodes based on shards - as always, adds complexity while improving privacy and bandwidth distribution in the network.

But in general yes, I like this, I think it could work

In general the design is actually much simpler than before.

We can’t stop an app straddling over two shards (or at least, I can’t find a way, but maybe there is some algorithm that could make it).

But that’s just 2 shards. So we would just need to make the assumption that one relay instance would be on 2 shards.

One more potential design is decoupling content topic and auto sharding.

If we propose a model where we do not expose the content topic to the developer, then it may allow for flexibility and control on sharding. Where the developer may opt for the number of target shards for their users. For example, specifying the usage of 3 shards, on a network of 32 shards.

Instead of using the distance to statically spread over 3 shards, the app coordinate and conversation id can be combined to deterministically select a shard, and generate a content topic with 2 separate operations.

Right, interesting idea, but I’m wondering if we’re not mixing up content topic (designed for filter use cases, not routing) into the routing layer.

I kind of regret how we approached autosharding. It was never meant to force us to start thinking of content topics as an extension of routing. It was simply a convenience function where we told applications, “hey, if you cannot or do not want to make decisions about routing, we can use your filter design to come up with a relatively sensible (if naive) default routing strategy”. This is mostly useful for smaller apps on Waku. Larger, longer-lived apps should probably come up with their own, considered sharding strategy.

Thoughts on the problem statement:

Autosharding exists mostly for those who do not want to make routing decisions. Usually the default should be for a simple app to use a single shard. Autosharding provide an easy way to do this or explicitly design namespaces that can afford to be in a separate shard. Apps that can afford to use more than a single shard should be able to be explicit about this - I either want to use 1 shard or 2 shards. Apps shouldn’t “probabilistically” be assigned a number of shards for their app. The difference between 1 and 2 shards for routing is massive.

This is known issue with content topics, but is generally considered an acceptable tradeoff for apps to improve the filtering for their clients.
Content topics right now can be K-anonymised by just suggesting that apps use:

/0/<random number from 0-7>/1/<random number from 0 to 1023>

We can even provide the utilities for apps to have these random numbers autogenerated from app-legible output (like your conversation ID suggestion). But content topics are already useable in this way. We should indeed improve our documentation and suggested defaults here, clearly explaining the tradeoffs.

I don’t really see how an app coordinate will improve this. We divide apps more or less evenly between shards, the same way we would with an app coordinate and content topics close in distance. The reason for uneven traffic would be because a specific app on a shard has unusually more traffic than those on other shards. This would still be the case if this app used a coordinate system.


And on the content:

  1. Coordinate-Based Sharding: Sharding is based on the content topic coordinate, meaning that topics close together are more likely to be in the same shard, simplifying the autosharding approach

Again, this uses content topics as routing extension, not tags for filtering. An app would generally not want its content topics to “probabilistically” be routed in 1 or 2 shards - the difference between these two scenarios is huge. An app with very little traffic but a large number of filter use cases (legitimately so) would have a much higher probability of being routed on several shards than a high-rate app with only one or two content topics. This implies again that filter design will now be coupled with routing, which we don’t want.

  1. Shared Topic Space: Multiple applications may statistically generate identical content topics, enabling plausible deniability

This is bad if unintentional and already possible if intentional. The whole point of content topics is to design exactly how much traffic your filter clients would need to subscribe to for a functioning app. We do not want clients to unintentionally now always have to query pages of a portion of all network traffic if they wanted to e.g. query for their past 5 messages, which could have been on a unique, short-lived content topic. If apps want k-anonymity (quite reasonably), we can do a simple modulo-hash of their topics (or conversation IDs as you suggest) into a k-anonymity set. Or they can simply reuse content topics for several conversations as Status already does. The design already allows this, but I agree that we can provide some tooling by default here.

To me this reads: apps can guarantee maximum routing shards by artificially limiting their filter use cases because of this routing constraint. :smiley:

Last thing I would mention is that dynamic content topics and shards (i.e. assigned content topics and shards changing over time due to e.g. network scaling) is something we’ve tried to avoid with autosharding design. Once an app has an assigned shard it should remain on that shard indefinitely (even if new network shards are released), until it explicitly makes the decision to migrate some or all of its traffic to the new shard(s) - which is why we include a version and implicit generation number in content topics. This is because of the huge cost involved with migrating an entire routing infrastructure (nodes, store history, etc.) to a new shard. We shouldn’t automatically break e.g. app-specific stores by just releasing new network shards. Similarly, if the release of new network shards or content topics for an app now also break the installed filters (because all content topics suddenly change), migration will come at a large cost, especially for historical data and existing clients.

1 Like

I realised that my ramble above may need a TL;DR:

  1. We should remember that content topics are programmable anonymity trade-offs to allow for filtering and should not be designed for routing. They can already be made k-anonymous in the current format.
  2. I think we can certainly provide utilities to allow k-anonymous content topics by converting app-level “conversation IDs” to a limited namespace content topics, knowing that this is not appropriate for all filter use cases. We can indeed do better here and provide some privacy-by-default utilities.

I spend a good deal of time thinking about all theses problems too.

AFAIK nothing can manage shard traffic it’s too dynamic.

Having higher layer deal with shard changes is a cluster fuck in waiting.

Also, what Hanno said! :smiley:

P.S.
The only alternative is to not use a PubSub system…
cough cough waku sync cough cough :slight_smile:

1 Like

No clear text application or conversation identifiers in the network

Love the focus on privacy here, however I don’t think this moves the needle. Static hashing of an identifier is equivalent to masking. Any adversary could unmask the coordinate in the future. As the output value is stable, once its unmasked it provides no privacy protections during operation until it is rotated. At best this is obscurity, helpful but doesn’t represent a meaningful change given the finite set of applications.

1 Like

Mapping the internal identifiers evenly across an apps shards, doesn’t really allow for application specific context around access patterns. Application developers know which data is likely to be accessed together, and previously could manually assign it to shards to optimize for this. While the vast majority of developers will not need this, as written the approach above removes the option for those who do.

Content_topic strategy is a major tool for app developers to manage breaking changes within their apps and protocols.

It’s only transparent assuming clients are all using the same version.

Changing any portion of the conversation generation code would require a major version bump to Waku, as same clients operating with different generation code would not be able to interop. As @haelius mentioned dynamic shard mapping, creates expensive client side upgrades for app developers.

1 Like

Today I Learned, thanks team.

2 Likes

My latest work fix!: Introduce routing info concept and fix nwaku master tests by fryorcraken · Pull Request #2471 · waku-org/js-waku · GitHub indeed shows that using the content topic for both filter and routing introduces a duality to this artefact, which is a smell. As it is best for a given concept to do one thing well, and one thing only.

I understand the need for the developer to have some control, but I am wary of it. It is always best to assume the developer wants to acquire as little knowledge as possible from about the inner work of Waku.

Another good indicator is whether we are able to provide good default to the developer, so that they can code quickly.
Followed by good APIs, so they can scale easily.

In the debate above, it seems that we have two types of developers:

  1. scale later
  2. vs scale now

For a “scale later” (1) developer, then having some tooling where a developer can randomly select and hardcode a shard is enough.
In this instance, autosharding is overkill.

For a “scale now” (2) developer, which I assume app/chat team is, then I do not believe it is clear enough how they should proceed as they have the choice to either manipulate app-name or app-version fields to spread their users on several shard.
Which one should they change? How should they do it?

And if you are not ready to wage on this matter, then it seems that autosharding was prematured? (which I guess it was as our hand was forced there).

I am most curious about a path forward where we can either simplify or improve the dev ex around sharding.