End-to-end reliability for scalable distributed logs

haelius · June 7, 2024, 2:44pm

Well, the long store retention would still be necessary/helpful for time-based queries to catch up to real time after offline periods. This would still be by far the most efficient way to recover missed messages - i.e. detect any offline periods and perform a time-based query for the gap. The reliability mechanism above can be used though to eventually reach consistency by ensuring all “dependencies” (i.e. full causal history) for received messages have been met, iteratively. As long as you receive any message within a causality chain you can go back in history, causal dependency after causal dependency, until you reach the historical point to which you have synced previously. This may require an iterative query process which will be both slow and expensive, so it should preferably only be used for recent history and in cases where we didn’t detect offline periods.

UPDATE: I do think it may be worth investigating a process that checks dependencies of all previously unchecked messages in the local log up to the “beginning of history” to ensure these have all their dependencies met. Large pages of messages received via time-based queries may or may not have complete causal histories. We could treat these chunks of historical logs as “unchecked” and only mark as “complete” when a parallel process has verified causal dependencies.