Introduction

On October 9th, 2023, zkSecurity was tasked to audit Aleo’s consensus for use in the Aleo blockchain. Two consultants worked over the next 3 weeks to review Aleo’s codebase for security issues.

The code was found to be thoroughly documented and of high quality. In addition, the team acted in a highly cooperative way and was key in helping us find a number of the issues in this report.

On February 19th, 2024, a week-long review was performed in order to audit the fixes to the first phase of this audit. The fixes were found to be satisfactory.

Below, we go through the scope and a few general recommendations. The rest of this report includes an overview of the protocol as well as the findings.

Scope

zkSecurity used the testnet3-audit-zks branch (commit 9234cbee73666c7e6d00dfbdbcb947a244a43818) of the snarkOS repository.

Included in scope was:

  • Aleo’s Bullshark (the partially synchronous version) and Narwhal implementation for resharing and coming to consensus on a global order of transactions.
  • Aleo’s high-level consensus logic in snarkOS (within some folders of snarkOS/node)
  • Aleo’s ledger service, that has dependencies living in the snarkVM repository.

Note that some logic (like the puzzle mechanism) was out of scope, as it was planned for replacement.

General Recommendations

In addition to addressing the findings listed in this report, we offer the following strategic recommendations to Aleo:

Specify the protocol implemented. Bullshark, as it was introduced in two white papers, is loosely specified. Furthermore, the implementation of Aleo represents an important departure from the protocol described in the papers: the commit rule (discussed in finding Commit Flow Can Lead To Safety Violation), the garbage collection (discussed in finding Garbage Collection Can Block Commits From Happening), and the dynamic committee feature (discussed in finding Dynamic Committee Feature is Not Safe), all deviate from the whitepaper version of Bullshark. This makes it hard to understand the protocol, and to reason about its security. We strongly recommend that Aleo writes a specification of the protocol implemented.

Create test vectors. It might benefit Aleo to implement an embedded-DSL, as well as test vectors, to test a suite of interesting scenarios. For example, a number of scenarios like the one described in Commit Flow Can Lead To Safety Violation could be useful to heuristically check that properties from the Bullshark protocol are well-understood and preserved in the implementation.

Bullshark in Aleo

In this section we give an overview of the consensus protocol implemented by Aleo, namely Bullshark.

The Bullshark consensus protocol is the evolution of a long lineage of consensus protocol, perhaps starting with the very first practical one (PBFT). The idea remains more or less the same in most (all?) of these protocols: the set of participant is fixed, and they advance together in rounds:

  1. In the first round, a “leader” broadcasts a proposal.
  2. In the second round, participants vote on the proposal.
  3. In the third round, participants vote on votes.

If you see enough participants who have voted on a proposal, and enough participants have seen that (the “vote on votes” part), then we can “commit” to a proposal. In blockchain lingo, committing a proposal would be to finalize and apply the transactions it contains to the blockchain’s state.

This basic 3-step committing process was never really improved. Instead, most major optimizations have been obtained by using pipelining techniques. For example, The Hotstuff protocol introduced pipelining of this flow by having a leader propose in every round. More recently, Bullshark introduced another type of pipelining by having every participant propose in every round. (Another recent protocol Shoal introduces both pipelining techniques in a single protocol.)

More concretely, a round of bullshark between 3f+1 participants (of which only f can be malicious) goes like this:

  1. At the start of a round, every participant broadcasts a new batch of transactions (which can be thought of as a block) to every other participant. (See at the end of the list how the batch was created.)
  2. If you receive a batch from someone, make sure that it’s the only batch you’ve seen from them in this round, store it, and sign it.
  3. Once you collect 2f+1 signatures on your batch, this produces a certificate for your batch, which you can broadcast to everyone.
  4. Once you observe 2f+1 certificates in this round, produce a batch of transactions for the next round in which you also include the 2f+1 observed certificates. This should look like a directed acyclic graph (or DAG) of batches.
  5. Go back to step 1.

To explain concrete examples in the rest of this report, we use the following types of diagrams:

bullshark

This still doesn’t let us know how participants agree on what gets committed. The “commit rule” of Bullshark is quite simple: in every even round (e.g. round 2, 4, etc.) a leader is chosen (for example, via a round-robin election), and their DAG (their batch and all the batches they link to transitively) gets committed if there’s f+1 certificates referring to it.

A leader’s batch is called an anchor, and it is possible that an anchor does not get committed in an even round. But if an anchor gets committed by some honest nodes, it will always be committed later on by all other honest nodes. This is because a committed anchor will always be included in the DAG of another anchor.

This property is called the quorum intersection property and can be proven quite easily. If an anchor is committed in an even round i, it is because it was referred to by f+1 certificates in round i+1. Let’s call this set A. Any block (including anchor block) in round i+2 will have to include 2f+1 certificates from round i+1. let’s call that set B.

We need to show that AB, or in other words that the intersection contains at least one certificate, or in other words that when a block is committed then there’ll always be a path leading to that block in any block from the next even round. This is not too hard to show: the multiset |AB|=(f+1)+(2f+1)=3f+2 which is 1 vertex more than the total number of vertices in a round and thus has some multiplicities.

As leaders and their anchors can be slow to be proposed and certified, timeouts are added to ensure that participants will wait a certain amount of time for the leader to show up before moving to the next round. Concretely, in even rounds, participants will wait to see an anchor to certify, and will wait for a certified anchor to include in their next proposal. In odd rounds, participants will wait to see votes for an anchor to certify, and will wait for certified votes for an anchor to include in their next proposal.

One last (important) subtlety is that as described, the protocol allows for a node’s storage to grow ad-infinitum if they do not observe any commit. To prevent this to happen, a garbage collection protocol is added in order to prune older parts of the DAG as it grows. As such, when a commit happens, the DAG will be traversed up to some “garbage collection frontier”, and everything on the left of that frontier will be discarded.