Breaking changes and roll out strategies

A series of breaking changes for chat protocols are expecting to come from current studies:

I suggest a template strategy to adopt, to ensure that expectations are set and problems avoided.

While the focus of this post is the handling of breaking changes for the Status app, this can also be used as a guideline for any application using Waku.

Generic roll-out strategy

I usually use these steps when applying breaking changes in any type of application where software is rolled out to the user (ie, not a website).
The breaking change is old protocol to new protocol:

Read and write are generic terms, and could be replace with subscribe and send in the context of Waku and Chat.

Step App Version Read Write
Initial N Old protocol Old protocol
Preparation N+1 Old and new protocol Old protocol
Switch N+2 Old and new protocol New protocol
Clean-up N+3 New protocol New protocol

The steps above are related to the code. In terms of data, a migration step may be needed before clean-up to convert data in the system to the new protocol format.

+1, +2, etc in the table may not refer to the exact app version bump, see next section.

Prerequisite

First we need to define the support period for an application. This is usually defined in a time (e.g. 1 year) or release (e.g. 3 major releases).

Timeline

N is the last version to only read from the old protocol. Only when N’s support is ended can we proceed with the switch step.

N+1 is the last version to write in the old protocol. Only when N+1’s support is ended can we proceed with the clean-up step.

For example, if support is set to 1 year from release date, and release happen every month:

Time from N Step App Version Read Write
0 Initial N Old protocol Old protocol
1m Preparation N+1 Old and new protocol Old protocol
1y Switch N+2 Old and new protocol New protocol
1y1m Clean-up N+3 New protocol New protocol

Status Communities

Status Communities are mostly discreet (apart from request to join flow), which enables a faster roll out strategy:

  • Switch can be enabled to newly created communities. The gap between preparation and switch can be reduced or eliminated. However, only users with the latest app version can join the community.
  • Enabling new protocol at community creation can be a switch at first, and enabled by default later (default setting is a product decision).
  • Migration and clean-up step can be scheduled based on product needs by answering the question: what is the value of migrating pre-existing communities to the new protocol?
  • A property on the community information can used to signal the usage of either protocols and enable correct handling.
Step App Version Community creation
Initial N Old protocol by default
Preparation + Switch N+1 New protocol opt-in switch
Crystallize (optional) N+2 New protocol by default
Migrate (optional) N+2 Existing communities migrate from old to new
Clean-up N+3 Remove old protocol support from code

The restrictions are:

  • N+2 should not be done until N is EOL (end-of-life)
  • N+3 should not be done until N+1 is EOL

Examples

Some examples of expected (not committed) breaking changes are:

  • Communities created on common shard (app key DoS protection) to communities created on dedicated shard (pre-share key DoS protection)
  • New content topic strategy
  • Using e2e reliability protocol
  • All messages in one shard to messages segregated based on retention needs

One-to-one and private group chats

_ For the sake of brevity, we will refer as one-to-one chats for both one-to-one direct message and private groups_.

One-to-one chats do not offer the same flexibility as communities in terms of breaking changes. Yet, there are some options to dogfood and rollout breaking changes.

Experimentation and dogfooding

When rolling a new breaking changes, a feature switch can enable this changes in the app, with a caveat of no backward compatibility. This can help dogfooding and experimenting with the change, without proceeding with the migration yet.

In this scenario, users that enable the change are aware of the lack of backward compatibility and may switch and back and forth.

This should be considered in a similar manner to changing target fleet. Where the app must be restarted.

Once the feature is stabilized then the generic roll-out strategy can be applied:

read: old and new protocol vs bridging

Depending on the change, this may mean subscribing to both old and new content topics, or to different shards.

For example, assuming the change is migrating one-to-one chat from a single shard to a range of shards. In this case, the application would subscribing to both the single shard and the new shard range for a period of time.

Another strategy previously implemented is bridging: Status runs a software to route/convert messages from the old protocol to the new protocol.

Bridging comes with a number of caveats that are often directly related to the breaking change itself:

  • scalability: if the new protocol is brought in to increase scalability, it means that bridging messages from new to old is likely to aggravate the scalability issues of teh old protocol.
  • security: if the new protocol brings new security features (e.g. RLN), then it may not be possible to bridge. Indeed, messages going from a non-RLN relay network to RLN relay network would need RLN proofs. If a bridge attaches all proof for free, then abuse may come from the non-RLN network, negating the intended effect of the breaking change.

Which is why I would recommend against bridging strategy.

Data Migration

Data migration usually consists in running a database script to update data format.
In the case of Chat, it could mean copying messages stored with a specific shard or content topic value, to another shard or content topic.

Migration strategy is similar to bridging, where the effect of converting/copying data from one network to another may negate the intend effects.

It also assumes that all store nodes can be migrating. Which becomes less likely as the network becomes more decentralised.

Hence, I would also recommend against deploying migration strategies.

Proposed Next steps

  1. Waku team to draft roll out proposal for non-backward compatible protocols. e2e reliability protocol for communities is likely the first candidate
  2. Status team to define support term for Status apps in terms of versions or time scale
  3. Waku chat and Status teams to review how a community could be flagged with a specific feature, to help manage incompatibility. E.g. attribute in community invitation link + description message
  4. Waku chat and Status teams to consider UX around incompatible communities, a generic solution is likely to be enough regardless of the migration. e.g. “This community has been created with a new version for the app, please update to join”.
5 Likes

As mentioned here we can also have a smaller migration approach depending on the type of change and how it is being planned e.g: content-topic change for communities.

I can give details on the process we used in the past for managing breaking changes in the protocol.

Basing breaking changes on versions and timeframe gave undesirable results. A key point that needs to be considered is we don’t have control over what version of the app our users use. Example some users do not install the app via an app store, and so the user is responsible for managing their own version upgrade.

I really like the principle of your timeframe based proposal.

What we found to have the most success was to also monitor Waku Telemetry to see what percentage of users were using an incompatible deprecated version.

If we combine the proposed Time from N condition with an AND percentage of users on App Version =< N+1 is less than X%, ( Where X is some level of user base acceptable to leave with an unsupported version.) we can ensure an acceptable level of users will lose expected compatibility.

This requires we define the percentage of user base, in the past we’ve used 5% as an acceptable cut off value.

1 Like

Thanks for this insight.

We’ll have to find a balance between being able to upgrade the app and reducing impact on users. This is not purely a technical problem either and have to be coupled with communication to the users and community.

1 Like

I agree. I see in your example that Time from N for N+2 has the example of 1y. I presume you picked that the length of time because it is sufficiently long for enough of the user base to upgrade to at least N1, assuming no other insight into the user base version distribution.

By using the version metrics as an input, coupled with active community communications, we could increase the turn around of transitions between breaking changes. Thus making Time_from_N variable. We could use the formula:

if X% < userbase_percent_N1 OR (Time_from_N => 1y AND X% < userbase_percent_N1)

1 Like