Carnot questions

collincusce · September 6, 2023, 5:07pm

Hi, I was directed here for more questions about Carnot.

So far it seems plausible. At first I didn’t think it was logn but was nlogn on a per-node basis, but on further inspection I see why it’s logn. Now I’m looking at the edge cases. I typically start with the CFT scenario before looking at the BFT cases, and my first question with that is:

What happens when a top level parent’s entire committee takes a big old liveness dump?

Let’s say it can’t meet quorum. The entire half branch of the tree is dropped and the network would halt. You can’t even do a view change without a GST rule-based reshuffle, and even then the entire network would be unlikely to even have 51% let alone 2/3rds required for a decision. I notice that under VII.-E (Safety failure analysis) there’s a description of why this would be unlikely to cause a safety issue, and I agree, but it doesn’t address a network halt scenario, does it? Are you able to do view changes without consensus? I’m very unclear about timeouts, plz halp!

If this is answered in the paper, apologies, sometimes it’s difficult to keep everything straight after first reading.

Next question is about decision latency, aka time-to-finality. I propose a decision (typically a block in a chain but in honesty is could be literally anything as this is unmarried to data structure), and each committee from the leaf onward proposes a decision (either colored or accept/reject). The parents then vote if the QC reached quorum etc etc. Each “round” takes up to some “round timeout” there will be at most log(n) timeouts for each decision, and so if timeout is, say, 1 second, and you have 1024 nodes, you’d need 10 seconds for each block to come to decision. Not bad, better than Eth2 by far. But not great either if you’re looking for responsiveness. Maybe you’re not. I know Phil Daian isn’t a fa of low-latency protocols. Either way, in a globally distributed production environment this will depart from “web page” speed as the network grows, no? Is that not a design goal?

Mussadiq.Moh · September 6, 2023, 7:25pm

If we focus solely on Crash Fault Tolerance (excluding Byzantine faults), we can raise the threshold for quorum in top-level committees to 51%.

For Byzantine Fault Tolerance, we require the participation of more than two-thirds of active nodes in the top committees. They are crucial for creating a certificate of attestation for a block or generating a timeout certificate. In essence, we rely on a supermajority of nodes in the network to maintain the protocol’s progress. Your question, if I understand correctly, pertains to the CAP theorem (The CAP Theorem and why State Machine Replication for Two Servers and One Crash Failure is Impossible in Partial Synchrony), which posits that a system cannot simultaneously possess all three properties: Consistency (safety), Availability (liveness), and Partition Tolerance.

Carnot achieves two of these properties, specifically Consistency and Partition Tolerance. This means that if the network experiences a partition where no single partition holds the majority of nodes, the protocol ensures safety but won’t make progress until the partition is resolved. In contrast, some protocols, like Bitcoin and Ethereum’s inactivity leak mechanism, aim to achieve availability even in the presence of partition. However, these protocols cannot guarantee consistency or safety during partitions. In certain cases, if an adversary gains control over the underlying network, they can compromise the protocol’s safety, which is a more severe violation than a loss of liveness.

Nonetheless, we are exploring the idea of incorporating an additional layer, similar to Ethereum’s inactivity leak mechanism, to enhance availability in the presence of partition for Carnot.

When we say “responsive,” we mean that the protocol transitions from one round to another based on specific events rather than fixed time intervals. This implies that the protocol doesn’t rely on block time, slot time, etc., but rather makes progress when a quorum of messages is processed. So, if it takes 10 seconds in one case, the protocol remains responsive. Dependence on slot time can be problematic because if an application considers a block committed based on time, there is always a risk of tampering. This opens up vulnerabilities when an adversary controls the network or time servers. In contrast, with a responsive protocol, an application can simply assume that a transaction is committed once it receives the necessary certificates.

Furthermore, each tree node represents a committee. Therefore, if there are 1024 nodes in a tree, the total number of nodes/validators can be calculated as committee_size * 1024 nodes.

Additionally, there will be other chains in addition to the main chain that are specialized for low-latency applications. More details about this will be available in the white paper. For instance, in these chains, Carnot can operate within a single committee setting.

collincusce · September 6, 2023, 8:19pm

Basically was saying, if it’s not at least CFT it’s not BFT. So start there. Framed it in FLP, not CAP.

Carnot achieves two of these properties, specifically Consistency and Partition Tolerance. This means that if the network experiences a partition where no single partition holds the majority of validators, the protocol ensures safety but won’t make progress until the partition is resolved.

And yea, exactly. However, I was also examining the threat surface for a liveness event. It doesn’t have to be a network partition, just liveness… a committee is on AWS and that particular region crashes… OR the validators just get bored with the incentives and go down…(po-tay-to po-tah-to).

Safety in partion is clear to me, as I stated here:

there’s a description of why this would be unlikely to cause a safety issue, and I agree,

BUT I was very unclear here:

but it doesn’t address a network halt scenario, does it?

I mean the liveness problem which you stated would happen in your response when you said:

but won’t make progress until the partition is resolved.

The question I have is one of threat surface. Can a root committee (see picture) halt the network by going down or conducting a withholding attack?

Screenshot 2023-09-06 160341

The pink X goes down or is controlled by a malicious competitor who conducts a withholding attack Low probability on paper where 1 actor gets 1 vote, but actually HIGH probability in practice as multiple validators are controlled by single actors.

If you cannot make progress, you must go to social consensus or have a by-protocol jumpstart by a synchronous epoch… perhaps a reshuffle-and-pray system (not ideal, could cause extended outages).

The surface area of this threat is quite high. The most obvious case is 1 committee halts near the root, but there’s other cases where multiple committees halt, preventing 2/3rd QC quorum to be reached.

we are exploring the idea of incorporating an additional layer

Do what Snowman++ does and do repeated round voting as a fall back… just use linear-chain Avalanche after a timeout. If it fails then, you have a serious problem and require social consensus. Alternatively you can select a priority queue of fallback leaders and tally all certificates manually. High decision latency and scales poorly, but could work to get your view change. Many protocols have higher decision latency for view changes, so it’s not unheard of…

Furthermore, each tree node represents a committee. Therefore, if there are 1024 nodes in a tree, the total number of nodes/validators can be calculated as committee_size * 1024 nodes.

This is a miscommunication. I’m not speaking about tree nodes, I’m speaking about network nodes. Let me use the term validator from now on to remove the confusion.

Let me reframe my question understanding that I meant validators.

The parents then vote if the QC reached quorum etc etc. Each “round” takes up to some “round timeout” there will be at most log(n) timeouts for each decision, and so if timeout is, say, 1 second, and you have 1024 validators, you’d need 10 seconds for each block to come to decision.

Keep in mind that’s a maximum and I chose 1 second arbitrarily.

Is this a correct analysis?

Mussadiq.Moh · September 7, 2023, 3:37am

The paper delves into the BFT model, in which all votes carry equal weight. Currently, the integration of the PoS mechanism with BFT has not been realized. We are planning to address this integration in a forthcoming paper. However, we are contemplating a scenario where each staker’s influence in the network, represented by the number of validators they control, is proportionate to their stake (as it will also help with privacy). In this case while validators are randomly assigned, a bad stalker’s high stake gets diluted.

For 1024 validators the committee overlay shouldn’t be more than 2 levels deep. For example commitee size of 300+ we can have 3 committees. Also the actual round latency is generally way lower than the maximum timeout period. For a block to come to a decision the protocol processess votes not the timeout.

collincusce · September 7, 2023, 3:48am

CFT is essential for BFT, so it doesn’t matter when I’m asking about uptime and withholding. If it isn’t safe under the CFT assumptions, it cannot ever be BFT. That’s what I was saying.

300+ seems very excessive, but it would balance the surface area of attack way more than what I thought the committee size would be (either 13 or 21).

And yea I think I was saying something wrong with the 1024… you’re right I meant committees, I got lost in my own damn question. Sorry about that. 1024 committees is absurd, so that’s not a big deal afterall.