Breaking changes and roll out strategies

fryorcraken · September 24, 2024, 6:19am

A series of breaking changes for chat protocols are expecting to come from current studies:

I suggest a template strategy to adopt, to ensure that expectations are set and problems avoided.

While the focus of this post is the handling of breaking changes for the Status app, this can also be used as a guideline for any application using Waku.

Generic roll-out strategy

I usually use these steps when applying breaking changes in any type of application where software is rolled out to the user (ie, not a website).
The breaking change is old protocol to new protocol:

Read and write are generic terms, and could be replace with subscribe and send in the context of Waku and Chat.

Step	App Version	Read	Write
Initial	`N`	Old protocol	Old protocol
Preparation	`N+1`	Old and new protocol	Old protocol
Switch	`N+2`	Old and new protocol	New protocol
Clean-up	`N+3`	New protocol	New protocol

The steps above are related to the code. In terms of data, a migration step may be needed before clean-up to convert data in the system to the new protocol format.

+1, +2, etc in the table may not refer to the exact app version bump, see next section.

Prerequisite

First we need to define the support period for an application. This is usually defined in a time (e.g. 1 year) or release (e.g. 3 major releases).

Timeline

N is the last version to only read from the old protocol. Only when N’s support is ended can we proceed with the switch step.

N+1 is the last version to write in the old protocol. Only when N+1’s support is ended can we proceed with the clean-up step.

For example, if support is set to 1 year from release date, and release happen every month:

Time from `N`	Step	App Version	Read	Write
0	Initial	`N`	Old protocol	Old protocol
1m	Preparation	`N+1`	Old and new protocol	Old protocol
1y	Switch	`N+2`	Old and new protocol	New protocol
1y1m	Clean-up	`N+3`	New protocol	New protocol

Status Communities

Status Communities are mostly discreet (apart from request to join flow), which enables a faster roll out strategy:

Switch can be enabled to newly created communities. The gap between preparation and switch can be reduced or eliminated. However, only users with the latest app version can join the community.
Enabling new protocol at community creation can be a switch at first, and enabled by default later (default setting is a product decision).
Migration and clean-up step can be scheduled based on product needs by answering the question: what is the value of migrating pre-existing communities to the new protocol?
A property on the community information can used to signal the usage of either protocols and enable correct handling.

Step	App Version	Community creation
Initial	`N`	Old protocol by default
Preparation + Switch	`N+1`	New protocol opt-in switch
Crystallize (optional)	`N+2`	New protocol by default
Migrate (optional)	`N+2`	Existing communities migrate from old to new
Clean-up	`N+3`	Remove old protocol support from code

The restrictions are:

N+2 should not be done until N is EOL (end-of-life)
N+3 should not be done until N+1 is EOL

Examples

Some examples of expected (not committed) breaking changes are:

Communities created on common shard (app key DoS protection) to communities created on dedicated shard (pre-share key DoS protection)
New content topic strategy
Using e2e reliability protocol
All messages in one shard to messages segregated based on retention needs

One-to-one and private group chats

_ For the sake of brevity, we will refer as one-to-one chats for both one-to-one direct message and private groups_.

One-to-one chats do not offer the same flexibility as communities in terms of breaking changes. Yet, there are some options to dogfood and rollout breaking changes.

Experimentation and dogfooding

When rolling a new breaking changes, a feature switch can enable this changes in the app, with a caveat of no backward compatibility. This can help dogfooding and experimenting with the change, without proceeding with the migration yet.

In this scenario, users that enable the change are aware of the lack of backward compatibility and may switch and back and forth.

This should be considered in a similar manner to changing target fleet. Where the app must be restarted.

Once the feature is stabilized then the generic roll-out strategy can be applied:

read: old and new protocol vs bridging

Depending on the change, this may mean subscribing to both old and new content topics, or to different shards.

For example, assuming the change is migrating one-to-one chat from a single shard to a range of shards. In this case, the application would subscribing to both the single shard and the new shard range for a period of time.

Another strategy previously implemented is bridging: Status runs a software to route/convert messages from the old protocol to the new protocol.

Bridging comes with a number of caveats that are often directly related to the breaking change itself:

scalability: if the new protocol is brought in to increase scalability, it means that bridging messages from new to old is likely to aggravate the scalability issues of teh old protocol.
security: if the new protocol brings new security features (e.g. RLN), then it may not be possible to bridge. Indeed, messages going from a non-RLN relay network to RLN relay network would need RLN proofs. If a bridge attaches all proof for free, then abuse may come from the non-RLN network, negating the intended effect of the breaking change.

Which is why I would recommend against bridging strategy.

Data Migration

Data migration usually consists in running a database script to update data format.
In the case of Chat, it could mean copying messages stored with a specific shard or content topic value, to another shard or content topic.

Migration strategy is similar to bridging, where the effect of converting/copying data from one network to another may negate the intend effects.

It also assumes that all store nodes can be migrating. Which becomes less likely as the network becomes more decentralised.

Hence, I would also recommend against deploying migration strategies.

Proposed Next steps

Waku team to draft roll out proposal for non-backward compatible protocols. e2e reliability protocol for communities is likely the first candidate
Status team to define support term for Status apps in terms of versions or time scale
Waku chat and Status teams to review how a community could be flagged with a specific feature, to help manage incompatibility. E.g. attribute in community invitation link + description message
Waku chat and Status teams to consider UX around incompatible communities, a generic solution is likely to be enough regardless of the migration. e.g. “This community has been created with a new version for the app, please update to join”.

prem · September 24, 2024, 8:43am

As mentioned here we can also have a smaller migration approach depending on the type of change and how it is being planned e.g: content-topic change for communities.

samuel · October 16, 2024, 7:45pm

I can give details on the process we used in the past for managing breaking changes in the protocol.

Basing breaking changes on versions and timeframe gave undesirable results. A key point that needs to be considered is we don’t have control over what version of the app our users use. Example some users do not install the app via an app store, and so the user is responsible for managing their own version upgrade.

I really like the principle of your timeframe based proposal.

fryorcraken:

For example, if support is set to 1 year from release date, and release happen every month:

Time from N Step App Version Read Write

0 Initial N Old protocol Old protocol

1m Preparation N+1 Old and new protocol Old protocol

1y Switch N+2 Old and new protocol New protocol

1y1m Clean-up N+3 New protocol New protocol

What we found to have the most success was to also monitor Waku Telemetry to see what percentage of users were using an incompatible deprecated version.

If we combine the proposed Time from N condition with an AND percentage of users on App Version =< N+1 is less than X%, ( Where X is some level of user base acceptable to leave with an unsupported version.) we can ensure an acceptable level of users will lose expected compatibility.

This requires we define the percentage of user base, in the past we’ve used 5% as an acceptable cut off value.

fryorcraken · October 17, 2024, 3:39am

Thanks for this insight.

We’ll have to find a balance between being able to upgrade the app and reducing impact on users. This is not purely a technical problem either and have to be coupled with communication to the users and community.

samuel · October 17, 2024, 9:05am

I agree. I see in your example that Time from N for N+2 has the example of 1y. I presume you picked that the length of time because it is sufficiently long for enough of the user base to upgrade to at least N1, assuming no other insight into the user base version distribution.

By using the version metrics as an input, coupled with active community communications, we could increase the turn around of transitions between breaking changes. Thus making Time_from_N variable. We could use the formula:

if X% < userbase_percent_N1 OR (Time_from_N => 1y AND X% < userbase_percent_N1)

Patryk · December 20, 2024, 1:29pm

The discussion about breaking changes and rollout strategies has also taken place under: feat_: use content-topic override for all community filters by chaitanyaprem · Pull Request #5993 · status-im/status-go · GitHub.

TLDR:

Gradual rollouts with usage thresholds (e.g., 90% adoption) alone can lead to undefined timelines, complicating development and planning.
To address this, a time-based guideline is proposed: backward compatibility guarantees expire after 6 months.
A version gap threshold (e.g., two intermediary releases) is proposed as a fallback if usage thresholds cannot be reliably gathered, ensuring safe migration before enforcing breaking changes. However, this adds complexity and may not always be necessary.

A formal representation of the above is provided below:

Step	App Version	Read	Write
Initial	N	old protocol	old protocol
Preparation	N+1	old and new protocol	old protocol
Switch	N+x	old and new protocol	new protocol
Clean-up	N+y	new protocol	new protocol

where y>x.

Switch to N+x when:
time(N+x)-time(N) >= A months OR
usage of N+1 >= B% OR
(x-1) >= C (version gap threshold)

Switch to N+y when:
time(N+y)-time(N+1) >= A months OR
usage of N+x >= B% OR
(y-x) >= C (version gap threshold)

Proposed parameters: A=6 months, B=90%, C=2 versions gap threshold.

fryorcraken · December 29, 2024, 10:34pm

Another issue with this parameter is who is going to track the data and feed it back to the Waku teams to clear up the fact that we can proceed with the next step?

It adds another layer of coordination. Knowing that we are still learning to coordinate on releases themselves.

I would suggest to stick to “6 months” for now and see how it goes.

Cc @pablo

Patryk · January 7, 2025, 8:35am

The condition is “OR”: if the data cannot be tracked or the cost is too high, we will default to “6 months” regardless.

volo · January 9, 2025, 3:39am

Hey @fryorcraken,

How often do we need to expect breaking changes in the Status app or chat protocols? If we stick to a monthly release cadence for Status App? Should we expect breaking changes each release?

fryorcraken · January 9, 2025, 3:44am

No, I would not expect breaking changes every release.

I tried to list upcoming changes above. However, it needs to be updated:

SDS/e2e reliability for Status Communities is unlikely to be a breaking changes afaik cc @shash256 @haelius
Next breaking change is ongoing content topic update for Status Communities
Then we want to change the shards being used by Status Communities

These are short term/2025 H1 changes.

Next would be for 1:1 chats:

potentially changing shards/applying auto-sharding
RLN
Changing encryption scheme

Another type of breaking change would apply to the installs of one user across their devices:

pair/sync protocol
user settings backup

volo · January 9, 2025, 4:38pm

got it! thanks for explaining this! it is clear now.

fryorcraken · February 5, 2025, 9:32am

@Patryk

With a releaes every month, 2 version gap would be 3 months right?

32 Jan
33 Feb
34 Mar

34 - 32 >= 2
deprecate Jan’s release in March.

fryorcraken · February 7, 2025, 1:01am

Forcing Users to Upgrade

There is one more topic we haven’t addressed in the domain of breaking changes, which is “forcing users to upgrade”.

We are building FOSS peer-to-peer protocols and apps.

In this context, a breaking change is not about ensuring users “run the latest version”. It is about stating that version A does not work with version B.

For Chat protocols, it most likely mean that if Alice uses version A and Bob uses version B, then Alice and Bob may not be able to chat.

Which is why we are having this whole thread to agree on managing breaking changes.

However, at a Waku layer, it should be easier to manage because we rely on gossipsub that is self-healing, and we should design and build our light protocols in a way that is also self-healing.

So that while Alice’s node and Bob’s node may not be able to connect directly to each other. The heterogeneity of the network should be such as, they can receive and send messages to/from each other.

The obstacles to this statement are any centralized infrastructure for Waku. Right now, we have two of those:

Store
Bootstrap

We are planning work to experiment with decentralised store, but we will likely need to upgrade/replace our peer discovery protocol.

We aren’t looking at decentralizing bootstrap just yet.

So in summary, if Alice runs v1, Carol v2 and Bob v3.
v1 and v3 are not compatible.
But v2 contains bridging strategies, messy code, to be compatible to both v1 and v3.
Then Alice could connect to Carol, and Bob to Carol, meaning Alice and Bob can still exchange messages.

And as long as Alice is not the last one to run v1, then it will work.

And v3 can be “clean” without backward compatibility hacks.

Meaning now it’s a social problem, if Alice and her friend continue to run v1 forever, then it should still work for them.

The fact that the can’t do it yet, shows that we need to work harder in decentralizing Waku and Status.

Finally, attempting to introduce any “force update mechanism” would inherently introduce a centralized point of authority that would determine “what the correct version is”. This is against our and FOSS principles.

Let’s review some examples to illustrate this point.

Breaking change in nwaku filter service

This PR in nwaku changes filter behaviour and breaks compatibility with go-waku (as a client).

Does it matter from a Status app POV? It should not.

Status Desktop in relay mode provide filter services (go-waku).
Status Mobile uses filter as a client (go-waku)
Status Desktop in light mode uses filter a a client (go-waku)
boot*status.prod fleet nodes provide filter services (nwaku)

So, deploying this breaking change in nwaku means that Status apps in light mode (all mobile, some desktop), cannot rely anymore on the fleet for filter, and can only rely on other desktop nodes.

The question then is: “how critical are the filter services provided by the status fleet?”

Looking at this graphana it looks like the answer is “not much”.
It would be good to have a metrics that works and return the number of active subscriptions to confirm.

The real lesson here is that we cannot release Status Desktop (nwaku) before we release Status Mobile (nwaku). As doing so will deplete the number of filter servers that work with go-waku client.

Store v2 to v3, store queries capped to 24 hours

The challenge with store breaking changes is that store is centralized. So indeed, we need to approach as a classic client-server scenario.

In this case, it means giving enough opportunity to the user to upgrade to a newest version before we pull the rug from under their feet.

For Store v2, we can see that there is no traffic. Which is not surprising as store v3 was integratied in one of the very early beta public version of the Status app (2.31). Hence, removal of store v2 can happen now.

Let’s summarize the impact of query capping:

store queries are now capped to 24 hours in client side (Status app), by chunking any query longer than 24 hours, to several 24 hours queries.
store queries are (soon) capped to 24 hours in server side (nwaku), meaning an error will be returned if a client send a query of more than 24 hours.

Queries were up to 30 days, especially for community.
Queries are made since the last successful store query. So if user does not open the app for a week, then a one week query would be triggered.

Hence, if we deploy this on nwaku, users of 2.31 and older will not be able to retrieve past messages.

The reason for this change is that long queries strongly impact store performance for all users. And because store is currently centralized, it’s a much needed improvement.

We can then apply what was discussed in this thread. Because of the performance impact of this change, I suggest:

2.32.0-beta - queries are truncated in the app.
2.33.0 (14 Feb): do nothing
2.34.0 (March? not yet scheduled): Once successfully released, deploy change in nwaku fleet. Meaning users have to upgrade to 2.32.0 or above.

Why arent’y users upgrading

A point that we did not address in the discussion so far, is “why would a user not upgrade?”.

Beyond not taking the time to do it, or not having proper marketplace app auto-update. The main reason for a user to not upgrade from version 1 to version 2 is that version 2 breaks something they like.

Which is why it is better to leave a few versions buffer for breaking changes. So that a user is not forced to upgrade to version 2 and can wait version 3 that hopefully fix their problem (which may be unrelated to the breaking change).

Patryk · February 7, 2025, 5:56pm

Initial Release (N=32) - January

Preparation (N+1=33) - February (compatible with N=32)
Intermediary (N+2=34) - March (compatible with N=32)

Switch (N+3=35) - April (incompatible with N=32, compatible with N=33 & N=34)
Intermediary (N+4=36) - May (incompatible with N=32, compatible with N=33 & N=34)

Clean-up (N+5=37) - June (incompatible with N=33 & N=34, compatible with N=35 & N=36)

Since (x-1) >= 2 and x=3 (April), deprecate January’s release in April.
Since (y-x) >= 2 and x=3, y=5 (June), deprecate March’s release in June.

Another way to think about this is that, at any point in time, the current release remains compatible with the two previous releases.

fryorcraken · February 11, 2025, 4:28am

Breaking changes brought to Status app by both Chat protocol and Waku protocol modification will be tracked here.