Introduction
On April 15, 2024, zkSecurity was engaged by the Ethereum Foundation to audit the zk-email project. The specific code to review was shared via GitHub as public repositories. The audit lasted 2 weeks with 3 consultants.
Scope
The scope of the audit included the following components:
zk-email-verify. The core circom library for email verification.
- https://github.com/zkemail/zk-email-verify
- The code was audited at commit
f2fb77c
. - The audit covers all circom templates in the
packages/circuits
directory, including templates exposed as utilities and not directly used in the mainEmailVerifier
template. The TS helper code inpackages/helpers
is covered as well. - The Solidity code in
packages/contracts
is excluded from the audit.
zk-regex. The regex-to-circom compiler, written in Rust.
- https://github.com/zkemail/zk-regex
- The code was audited at commit
9962a11
. - The audit covers the core compiler in
packages/compiler
, the regex JSON inputs inpackages/circom/circuits/common
and the interaction between compiled regex circuits and zk-email circuits.
Summary and general comments
A number of issues were found in both zk-email-verify and zk-regex. You can find the full list near the end of this report.
All concrete high- and medium-severity issues were addressed by the zk-email team. For more information, see the “Client Response” section at the end of each finding. The commits at which all fixes are applied and reviewed by us are:
- 95cd90 for zk-email-verify
- 5396ec for zk-regex
We believe that in addition to these fixes, substantial changes are needed to ensure the security and robustness of the zk-regex library. We provided a finding with general recommendations for zk-regex, and we also recommend to re-audit this part of the project after all issues are addressed.
For existing users of zk-email, we want to emphasize that some of our findings pertain to using zk-regex or specific templates from zk-email-verify as a library. That means, they don’t necessarily imply a critical vulnerability for end-to-end users of the EmailVerifier
template. Specifically:
- Our finding on the complement regex does not affect
EmailVerifier
, as it doesn’t use the complement regex operator. - Our finding on SHA256 does not affect
EmailVerifier
in a critical way, because exploiting it using malicious inputs is made infeasible by other constraints on those inputs inEmailVerifier
. - Our recommendation to document implicit assumptions does not affect
EmailVerifier
and is targeting usage of zk-email-verify as a library more generally.
Since the version of zk-regex that we audited is a complete rewrite of the one used in older versions of EmailVerifier
, we can’t comment on the security of previous versions of EmailVerifier
more generally. We recommend to evaluate whether the finding on start of string (^
) affects your use case.
A note on Circom compilation: During the audit, there was a concern that when compiling Circom with the --O2
flag, some linear constraints are entirely removed from the constraint system. The zk-email team asked us to evaluate the security impact of this optimization, and whether --O0
should be used instead for safety. In our assessment, --O2
can safely be used, as it does not impact zk-email negatively in any way.
In general, while the effect of removing certain statements from the constraint system is surprising, we believe --O2
shouldn’t have a security impact for any project, since the optimization never changes the mathematical statement of the generated zk proof.
Overview of zk-regex
This section provides a simplified overview of the zk-regex compiler.
zk-regex is a tool that enables developers to create circuits that prove and verify matches of strings against a regex pattern and optionally reveal parts of the matching string. The zk-regex compiler takes a regex pattern as input and synthesizes a Circom template. Users can choose to decompose the regex into multiple parts and select specific parts to reveal. The compiler has two modes: the first, called raw, takes a regex and produces a Circom circuit; the second, called decomposed, takes a JSON with a decomposed regex and generates the appropriate Circom circuit. Hereafter, we will focus on the decomposed mode as it is commonly used in zk-email and is the most stable way to interact with zk-regex.
High-level Overview of zk-regex
The figure below describes the high-level interactions for using zk-regex. First, we define a regex for which we want to create a proof, then we use zk-regex to create the respective Circom code. In another Circom template, we can use our regex template as a component to feed data and get an output that represents whether the input satisfies the regex. Note that we can also produce a reveal array in the regex template to process a particular part of the input string that matches a specific regex. Afterward, we can use the circuit like any Circom circuit (e.g., perform preprocessing and create a JavaScript prover and a Solidity verifier). This allows us to perform regex checks on-chain using off-chain computations, without revealing the input string or only revealing part of it.
zk-regex overview.
Consider the following regex example:
m[01]+-[ab]+;
To use zk-regex to obtain a Circom circuit that creates and verifies a proof for inputs that match this regex, we can use the following JSON:
{
"parts": [
{
"regex_def": "m[01]+-[ab]+;",
"is_public": false
}
]
}
To produce the Circom circuit, we run the following command:
zk-regex decomposed -d ./simple_regex_decomposed.json -c ./simple_regex.circom -t SimpleRegex -g true
This results in a Circom template with the following structure. Note that we must define the number of bytes our input message should be at compile time. This limits how large the message can be.
template SimpleRegex(msg_bytes) {
signal input msg[msg_bytes];
signal output out;
...
}
To call that template as a component in another template, we could use the following code:
signal simpleRegexMatch <== SimpleRegex(maxLength)(msg);
simpleRegexMatch === 1;
This will result in accepting the proof if the msg
matches the regex; otherwise, the constraint simpleRegexMatch === 1
will fail.
Now let’s consider the case where we want to reveal the part between -
and ;
. In this case, we would use the following JSON:
{
"parts": [
{
"regex_def": "m[01]+-",
"is_public": false
},
{
"regex_def": "[ab]+",
"is_public": true
},
{
"regex_def": ";",
"is_public": false
}
]
}
This will result in a Circom template with the following structure:
template SimpleRegex(msg_bytes) {
signal input msg[msg_bytes];
signal output out;
...
signal output reveal0;
...
}
That can be used as follows:
signal (simpleRegexMatch, simpleReveal[maxLength]) <== SimpleRegex(maxLength)(msg);
simpleRegexMatch === 1;
The reveal part is an array that is full of zeros, and in the positions of the revealing part, it contains the input that matches that part of the regex. For example, for the input m01-aab;
and length of the message set to 10
, the result of the reveal part would be aab
. Note that in practice the template takes as input UTF-8 encoded messages in decimal format. Hence, the actual input array would be ["109", "48", "49", "45", "97", "97", "98", "59", "0", "0"]
(you can use this tool for the conversion), and the reveal array would be ["0", "0", "0", "0", "97", "97", "98", "0", "0", "0"]
.
Synthesizing Circom Circuits for Regex
zk-regex operates on top of Deterministic Finite Automatons (DFAs), proving that a string satisfies a DFA and optionally revealing parts of the transition steps involved.
Regex to DFA
To understand how that works, we should go a step backward. First of all, a Deterministic Finite Automaton (DFA) is a finite-state machine that accepts or rejects a given string of symbols by running through a state sequence uniquely determined by the string. A regular expression can be converted to a DFA using the following process:
- Parse the Regular Expression: Initially, the regex string is parsed into a data structure, such as a parse tree or an abstract syntax tree (AST). This structure represents the hierarchical organization of the regex, including its characters, operators, and sub-expressions.
- Convert the Parse Tree to an NFA: Using the parse tree, a Non-deterministic Finite Automaton (NFA) is constructed. An NFA is a finite-state machine that may include several possible transitions for a single state and a given input symbol.
- Convert the NFA to a DFA: The NFA is then transformed into a DFA, which ensures a single, deterministic transition for each state and input symbol combination. This process can also include minimizing the DFA to reduce its complexity without changing the language it recognizes.
These steps are foundational in compiling regex into a format that can be processed efficiently. More information on this process can be found in various online resources.
DFA Definition
A deterministic finite automaton is defined as a 5-tuple, (, , , , ), where:
- is a finite set of states,
- is a finite set of input symbols (the alphabet),
- is the transition function ,
- is the initial state, and
- is the set of accept states, .
Example DFA Transition Matrix
Furthermore, typically, we have a transitions matrix for a DFA. Let’s consider our example from above, i.e., m[01]+-[ab]+;
.
DFA of m[01]+-[ab]+; (produced using https://zkregex.com/min_dfa).
The transition matrix will be:
DFA State | Type | m |
[0-1] |
- |
[a-b] |
; |
---|---|---|---|---|---|---|
0 | 1 | |||||
1 | 2 | |||||
2 | 2 | 3 | ||||
3 | 4 | |||||
4 | 4 | 5 | ||||
5 | Accept |
For a string (or part of the input string) to match the regex, it must match the DFA, or part of it should satisfy it.
Synthesizing Circom Code and Applying Constraints
Now, we can proceed to explain how zk-regex goes from the decomposed JSON input to the circom circuit and how it works. Initially, we will use an example where we have a single part and don’t reveal anything. After that, we will see what happens when we have multiple parts and how we reveal some parts of the matched string.
The first step in the compiler, once all the inputs are read and parsed, is to convert the regex to a DFA. This is done using DFAGraphInfo::regex_and_dfa
. If we have multiple regexes, their DFAs are computed separately and merged. Then, we compute which substrings should be revealed. The result is a structure called RegexAndDFA
, which contains regex_str, dfa_val, substrs_defs
. Let’s first understand what happens when we have a single regex.
Let’s consider the case where we have the following input.
{
"parts": [
{
"regex_def": "m[01]+-[ab]+;",
"is_public": false
}
]
}
The resulting RegexAndDFA
will be:
RegexAndDFA { regex_str: "m[01]+-[ab];", dfa_val: DFAGraph { states: [DFAState { type: "", state: 0, edges: {1: {109}} }, DFAState { type: "", state: 1, edges: {2: {48, 49}} }, DFAState { type: "", state: 2, edges: {2: {48, 49}, 3: {45}} }, DFAState { type: "", state: 3, edges: {4: {97, 98}} }, DFAState { type: "", state: 4, edges: {5: {59}} }, DFAState { type: "accept", state: 5, edges: {} }] }, substrs_defs: SubstrsDefs { substr_defs_array: [], substr_endpoints_array: None } }
It transforms it by using the rust DFA:Builder
after wrapping the regex into another regex with a specific anchored start ^
and end $
, i.e.,^Input_Regex$
.
The output of the DFA: Builder
is the following:
D 000000:
Q 000001:
*000002:
000003: m => 8
000004: - => 5, 0-1 => 4
000005: a-b => 6
000006: ; => 7
000007: EOI => 2
000008: 0-1 => 4
This is transformed into a graph, and it is sorted. Note that *000002
is the accepting state, and EOI
is the end of the input. The parsing and conversion are happening in the function parse_dfa_output
. Then, the graph is post-processed by dfa_to_graph
to get the decimal representation of the UTF-8 characters, which is the alphabet of the provided regex. This function does the following:
- Define a custom key mapping for
"\\n", "\\r", "\\t", "\\v", "\\f", "\\0"
to their UTF-8-encoded decimals. - It specially handles the space (
) to transform it into a format that it can handle and translate to a UTF-8 encoding correctly.
- It handles ranges (e.g.,
0-9
), by generating a set of all UTF-8 encoded values within that range. - It also specially handles the custom key mappings
- Finally, it also specially handles hexadecimal representations (
\xNN
); it parses the string after\x
as a hexadecimal number. Note thatDFA:Builder
translates non-ASCII characters to their hexadecimal UTF-8 encodings.
After having the RegexAndDFA
, then it uses the RegexAndDFA:gen_circom
to generate the Circom circuit, which in turn uses the gen_circom_allstr
function to produce the Circom code for the DFA.
The gen_circom_allstr
is a complicated function, and we believe it should be further documented and decoupled while also being extensively tested to reveal any subtle bugs. In a nutshell, it works as follows.
fn gen_circom_allstr(dfa_graph: &DFAGraph, template_name: &str, regex_str: &str) -> String
dfa_graph
— a collection of states and, for each state, transitions to other states labeled by letters.
template_name
— the name of the generated Circom template.
regex_str
— the regex string implemented by the generated Circom template.
Important variables:
n: usize
— number of states in the DFA.mut rev_graph: BTreeMap<usize, BTreeMap<usize, Vec<u8>>>
— reverse graph of the DFA, where the key is the state index and the value is a map of the states, that have transitions to the key state, and the letters of the transitions.accept_node: usize
— the index of the accepting state of the DFA.mut lines: Vec<String>
— the core generated Circom template code, each line in a separate element.mut declarations: Vec<String>
— the initial declarations in the Circom code: pragma, import, template declaration, input and output signals, components defined at the top of the template.mut init_code: Vec<String>
— the initialization code forstates[0][0..n-1]
.mut accept_lines: Vec<String>
— the code for the computing the output from the accepting state.
High level flow:
- Populate the reverse graph and
to_init_graph
based on the DFA. Also, identify the accepting states. - Check if there is only one accepting state, if not, panic.
- Initialize counters for different types of checks.
- Initialize data structures for different types of checks.
- Generate the Circom code for each state in the DFA.
- Generate the Circom code for the final state.
- Combine all the generated Circom code and return.
The main signals and components used to determine if we have a match are:
signal in[num_bytes];
: This signal represents the user input prepended by256
.states[num_bytes+1][n];
: This signal represents the transitions of states based on the input. Every time we go from a valid state to an invalid one, we transition back to the initial state.states_tmp[num_bytes+1][n];
: This is a helper signal to handle complicated conditions.from_zero_enabled[num_bytes+1];
: This signal checks if we are at the zero state.state_changed[num_bytes];
: This signal tracks the steps we took to change the state successfully (i.e., progress). The comparison components used to check if an input matches a character, a set of characters, or a range. These components are:IsEqual
,LessEqThan
,MultiOR
, etc.final_state_result
: This component is used to constrainout
to1
if we reach the acceptance state at some point.is_consecutive[msg_bytes+1][n-1];
: Keeps track of whether the matches are consecutive across the input, ensuring that patterns requiring consecutive elements (like a*) are correctly validated.
The steps in the Circom circuit are:
- It has a main loop that goes over the inputs and mainly sets and constrains
states
andstate_changed
. - It sets and constrains
final_state_result
, which sets and constrainsout
. - It computes and appropriately constrains the
is_consecutive
signal.
For simplicity, we are going to describe how that works for the regex m[ab];
and the input:
["109", "0", "109", "98", "59"]
So, our DFA is the following.
Visualisation for DFA of m[ab]+; (produced using https://zkregex.com/min_dfa).
The initialization Circom code would be:
signal input msg[msg_bytes];
signal output out;
var num_bytes = msg_bytes+1;
signal in[num_bytes];
in[0]<==255;
for (var i = 0; i < msg_bytes; i++) {
in[i+1] <== msg[i];
}
component eq[4][num_bytes];
component and[3][num_bytes];
component multi_or[1][num_bytes];
signal states[num_bytes+1][4];
signal states_tmp[num_bytes+1][4];
signal from_zero_enabled[num_bytes+1];
from_zero_enabled[num_bytes] <== 0;
component state_changed[num_bytes];
for (var i = 1; i < 4; i++) {
states[0][i] <== 0;
}
Note that the in
has one more initial byte that is set to the invalid decimal 255
. states
is two-dimensional. The first dimension is set to num_bytes+1
because we want to have a dummy initial array in the beginning to process the initial byte, and it is used to keep the state for each byte. The second dimension represents the states of the DFA. The other components are used to make comparisons and evaluations.
Next, we have the main loop, which is implemented as follows.
for (var i = 0; i < num_bytes; i++) {
state_changed[i] = MultiOR(3);
states[i][0] <== 1;
eq[0][i] = IsEqual();
eq[0][i].in[0] <== in[i];
eq[0][i].in[1] <== 109;
and[0][i] = AND();
and[0][i].a <== states[i][0];
and[0][i].b <== eq[0][i].out;
states_tmp[i+1][1] <== 0;
eq[1][i] = IsEqual();
eq[1][i].in[0] <== in[i];
eq[1][i].in[1] <== 97;
eq[2][i] = IsEqual();
eq[2][i].in[0] <== in[i];
eq[2][i].in[1] <== 98;
and[1][i] = AND();
and[1][i].a <== states[i][1];
multi_or[0][i] = MultiOR(2);
multi_or[0][i].in[0] <== eq[1][i].out;
multi_or[0][i].in[1] <== eq[2][i].out;
and[1][i].b <== multi_or[0][i].out;
states[i+1][2] <== and[1][i].out;
eq[3][i] = IsEqual();
eq[3][i].in[0] <== in[i];
eq[3][i].in[1] <== 59;
and[2][i] = AND();
and[2][i].a <== states[i][2];
and[2][i].b <== eq[3][i].out;
states[i+1][3] <== and[2][i].out;
from_zero_enabled[i] <== MultiNOR(3)([states_tmp[i+1][1], states[i+1][2], states[i+1][3]]);
states[i+1][1] <== MultiOR(2)([states_tmp[i+1][1], from_zero_enabled[i] * and[0][i].out]);
state_changed[i].in[0] <== states[i+1][1];
state_changed[i].in[1] <== states[i+1][2];
state_changed[i].in[2] <== states[i+1][3];
}
So, let’s see how the transition signal states
gets filled and how we constrain it to the correct value.
The initial values for states
are:
state | 0 |
---|---|
0 | 0 |
1 | 0 |
2 | 0 |
3 | 0 |
Where state
column represent the DFA states, and 0
the current state when we processing in[0]
(255
).
The first byte we process is 255
and the resulting state will be the following. We also keep track of the state_changed
(sc
) variable. Note, that we set the state from 1 to 3 for the next input.
state | 0 | 1 |
---|---|---|
0 | 1 | |
1 | 0 | 0 |
2 | 0 | 0 |
3 | 0 | 0 |
sc | 0 |
We set all three states to 0
because we are not making the correct transition from the first state (0
) to state’ 1’, i.e., transitioning to 109
to go to state 1
. Next, we will be in state 0
and need m
(109
) to transition. Here, we should also note that we always set state 0
to 1
, but we handle that later with the from_zero_enabled
component in case we are not actually in state 0
. Our next input is 109
. The resulting table will be:
state | 0 | 1 | 2 |
---|---|---|---|
0 | 1 | 1 | |
1 | 0 | 0 | 1 |
2 | 0 | 0 | 0 |
3 | 0 | 0 | 0 |
sc | 0 | 1 |
Note that we transition to state 1
because and[0][i]
is 1
(meaning that state[i][0] is 1
and current input is 109
), and from_zero_enabled
is 1
.
Next, being in state 1
and processing the next input, we want to check if we can transition to state 2
, which means that we need input to be 97
or 98
. As it is not (0
) we change all the states for the next input to 0
which means we return to state 0
. Following the same logic, we eventually will have the following state transition table.
State transitions table.
The following code verifies that we have at least a match:
component final_state_result = MultiOR(num_bytes+1);
for (var i = 0; i <= num_bytes; i++) {
final_state_result.in[i] <== states[i][3];
}
out <== final_state_result.out;
This is because we check if we reach the state 3
, which is the accepted state, at least once. The final part of the code is the following:
signal is_consecutive[msg_bytes+1][3];
is_consecutive[msg_bytes][2] <== 1;
for (var i = 0; i < msg_bytes; i++) {
is_consecutive[msg_bytes-1-i][0] <== states[num_bytes-i][3] * (1 - is_consecutive[msg_bytes-i][2]) + is_consecutive[msg_bytes-i][2];
is_consecutive[msg_bytes-1-i][1] <== state_changed[msg_bytes-i].out * is_consecutive[msg_bytes-1-i][0];
is_consecutive[msg_bytes-1-i][2] <== ORAnd()([(1 - from_zero_enabled[msg_bytes-i+1]), states[num_bytes-i][3], is_consecutive[msg_bytes-1-i][1]]);
}
Which is a reverse loop to keep track of the state transitions using consecutive correct transitions.
Merging DFAs
Remember that we can split our regex into multiple parts. This functionality works by parsing each part in a separate DFA and then merging the DFAs by adding an anode from the first’s accepted state to the second’s first, etc.
Revealing sub-parts
When using the decomposed JSON and select to reveal a part of a regex, the regex_and_dfa
code will identify from an accepted state which transitions lead to it and will add it to the substr_defs_array
as a set of edges for the DFA of each public regex part. Then, in the Circom code, some additional code will be produced per public substring to be revealed that will be an output signal array of the size of the input message and will be filled with 0
’s and the input value for the part of the regex it matches. For example, for the following JSON.
{
"parts": [
{
"regex_def": "m",
"is_public": false
},
{
"regex_def": "[ab]",
"is_public": true
},
{
"regex_def": ";",
"is_public": false
}
]
}
The following Circom code will be produced:
// substrings calculated: [{(1, 2)}]
signal is_substr0[msg_bytes];
signal is_reveal0[msg_bytes];
signal output reveal0[msg_bytes];
for (var i = 0; i < msg_bytes; i++) {
// the 0-th substring transitions: [(1, 2)]
is_substr0[i] <== MultiOR(1)([states[i+1][1] * states[i+2][2]]);
is_reveal0[i] <== is_substr0[i] * is_consecutive[i][2];
reveal0[i] <== in[i+1] * is_reveal0[i];
}
That checks if the proper state transitions are enabled and if they are part of a consecutive string. Then, it reveals the input; otherwise, it sets the value to 0
. Note that this code has some issues, as demonstrated in finding “In accepting input, reveal array reveals more values than the matching ones”. Also, the user should be careful if the 0
value is supposed to be part of the revealed match.
UTF-8 encoding
zk-regex works on any UTF-8 encoded Unicode character in its decimal representation. This means that .
should match any valid UTF-8 encoded beyond the line terminator. More specifically, every character that is described using the following format and represented in its decimal form is accepted (note that it should broken into bytes), and a single character could be broken into multiple bytes, resulting in multiple consecutive nodes in the DFA. Still, there needs to be more specific documentation on what is accepted and what is not. It is up to the user to define a verifier by either using a very strict regex that will be sure to accept only the strings that he wants or defining and checking how a message should look.
UTF-8 conversion table (ref: https://en.wikipedia.org/wiki/UTF-8).
Attack Methodologies
This section describes the various attack methodologies employed during the audit of the zk-regex compiler.
Underconstrained Produced Circom Code
Upon understanding the algorithm used to transform the regex into a Circom template, we manually checked the produced examples to identify potential underconstrained vulnerabilities in the produced code. Our focus was not on the logic of the translation but rather on the application of constraints in the circuit.
Incorrect Compilation and Handling of Regexes
We examined some common patterns (also used in the examples) to check if the produced code behaved as expected. This led to the discovery of high-severity bugs, such as findings “Complement regex could lead to soundness issues” and “Start (‘^’) and end (‘$’) of line match does not work”.
Unexpected Leakage of Private Information
We also evaluated whether there were instances where the reveal arrays disclosed more inputs than those matched, or revealed parts of the input when there was no match. This investigation led to findings “The reveal array leaks inputs in a non-match” and “In accepting input, reveal array reveals more values than the matching ones”.
Unexpected Behaviors of the Compiler
Finally, we tested whether the compiler could effectively process well-defined user inputs. This revealed findings, such as “Regex using | leads to a compiler crash”.