J What’s going on?

Recursive Curse

2025-05-31T01:00:00+01:00

This is a test

Look mum, it’s moving!

2025-05-30T01:00:00+01:00

I don’t have art talent. When my daughter was very small we once had this conversation:

me: you will find your talent, you do have it you just don’t know yet.
tiny person: what is your talent mummy?
me: I don’t think I have any…I am still trying to find it!
tiny person: your talent is.. looking after me! You are doing it very well mummy.

(I do miss the time when she always finished her sentence with “mummy”).

I wish I had art talent. I enjoy art projects but I could never just create something from nothing that looks pretty. I guess that’s why doing computer science suits me more. Computer programs could be described as art projects, the process of producing a working piece of software involves not only putting instructions together, but how they are put together does matter too.

My undergraduate final year project on the computer science side was to implement a “Key Frame Animation Tool”. It was written in C, using OpenGL and XForms libraries (you get the idea from the feature picture at the top). Fast forward to today, animation software tools are much much more sophisticated. So I felt pretty excited when I went to an RDP (Research Development Programme) workshop on “Animate your research” last week to learn about how to create animation based on our research.

Three key things I learnt from the workshop: (1) An animation can still be very impactful even if the drawings are considered bad to the untrained eyes; (2) when the term “cryptography” does not mean anything to the people I was talking to, using it to explain my research so that they can visualise it simply doesn’t work; (3) The people at the workshop prefer to hear about the positive impact by doing something, rather than the negative impact if we don’t do something.

I felt rather intimidated when I was given some blank paper and a pencil and asked to draw three key themes of my research. I didn’t like anything I put on those papers. Towards the end of the workshop though, I felt very motivated to try and see if I could create an animation! So I did a silly animation with sound effects added:

Unfortunately, I didn’t have enough time during the workshop to complete the drawings for the animation based on zero-knowledge proofs for data exchange in a coffee supply chain. So I wrote down the idea briefly, hopefully, one day I can create it!

Scene showing a series of coffee shops, a person goes into one with a recognisable label (or QR code?)
The person comes out from the coffee shop with a coffee cup in their hand, smiling
Zoom in the hand and then the coffee cup, and then the coffee
“Go back in time” to show how the coffee was made
Coffee -> coffee beans added to machine -> coffee beans in bags delivered to the shop -> coffee beans selected based on verified certificates -> coffee beans bagged in a factory, with certification process going on -> coffee beans delivered to the factory by different distributors -> Coffee beans distributors obtain certification -> farmers sell coffee beans to distributors with certificates showing that they didn’t use deforested lands and that the beans were grown legally.

Obviously, this sequence is overly simplified. However, during the workshop I found that as soon as I went into any details, people didn’t seem to be interested. I can see that this animation can be a nice way to open a technical presentation. Now I just need to start creating some bad drawings…

Recursive Curse

2025-05-30T01:00:00+01:00

The prototype I am working on at the moment is related to the first cloud computing use case mentioned in my previous post Zero trust, always verify. The prototype consists of five actors:

Data centre operator
Data centre customer
Electricity supplier
Smart meter manufacturer
Trusted certificate authority

In this use case we assume that for regulatory or business reputation purposes, a data centre customer wants to publish their carbon emissions data that includes all three scopes of emissions. Therefore they need to know the carbon emissions figures from their cloud providers, and they want to be able to verify the figures. The data centre operator, therefore, acts as the prover in this scenario, as they have all the data to produce the carbon emissions report, but they don’t want to reveal all the related business-sensitive information in the process of doing so.

I have written a circuit that can generate a proof using the private data input by the data centre operator. The proof can be serialised and sent to their customers, who can run the verification in a separate process using public data and the proof. This proof actually consists of multiple sub-proofs, because not only do they want to provide a proof that a customer’s emissions were calculated correctly based on their usage, but they also need to prove that the smart meter reading and the customers’ share can all be trusted too. So I have also written a circuit to verify all the signatures in the smart meter and carbon intensity chains, and another circuit that verifies that all customer shares add up to 100% of the total carbon emissions.

The challenge I am facing at the moment is scalability.

Take the “customer shares add to up 100%” scenario for example. It is possible that a data centre can have over a million customers. In my prototype, I assume that customer records (each contains a customer ID and their share of the total emissions) are encrypted and put on a Merkle tree by the prover (i.e. data centre operator). The initial idea to generate a proof for the root of the tree is first to generate a proof for each leaf, and then recursively generate a proof for each node at each level until the root. The final proof, the root proof, should have a public output value of 100%. Running the circuits on my laptop it takes ~10-14s for the base proofs (i.e. for each leaf), and a few seconds more for each recursive proof. Let’s say 12s for a base proof and 16s for a recursive proof.

For a million customers it would take ~12,582,912s, i.e. close to 146 days, to finish all the base proofs. The Merkle tree has 21 levels in total, so the number of nodes above the leafs would be (2^20)-1 = 1,048,575, and it would take ~16,777,200s (~194 days) to do all the recursive proofs. So in total, it would take almost a year running non-stop to complete all the proof generation!

I am now trying a different approach. In theory, each customer only needs one proof, the root, to do the verification. Therefore, instead of using recursive proofs to produce the final sum, I could in theory build the Merkle tree using the sums along with the hashes. I tested the building of such a tree and it took only 1 hour and 35 minutes for a million customer records. The trick now is to generate a cryptographically provable witness for proof. The o1js framework I am using has example code that I can base on, I haven’t got it fully working yet but it’s looking very promising! Perhaps in a future blog I could write up the cryptographic properties of the circuits I have built.

The scalability challenge for the carbon intensity, meter readings, and signature chains is another story for another day, but it’s equally interesting!

Zero trust, always verify [1]

2025-05-15T01:00:00+01:00

Trust - a simple word but yet such a complicated concept. To determine if someone or something can be trusted, the process tends to involve some evaluation based on a combination of human traits: knowledge, judgement, ethics, morals, to name a few. I cannot do it justice to even try to explain the concept of trust (it took a PhD to formalise trust [2]!). If it is a computer system that is doing the evaluation, then not only does it need to be given the information but also the rules on how that decision should be made. What if the information required by the rules is not all available? The system’s behaviour will be determined by the designer/programmer of the system. A trivial example: a system presents an interface for a user to enter their username and password -> the user inputs some text as the username and some text as the password -> the entered username and/or password does not match with what the system expects -> the system rejects access request.

There are many scenarios where someone or a system wants to be trusted (e.g. to gain access) but cannot reveal all the information required, for example due to privacy concerns. Having the ability to prove to other parties that you can be trusted, without telling them any of the secret information needed for the evaluation process, would be very useful. Imagine if you receive a call from an unknown number, the person on the line claims that they have important information about your bank account, but they need to verify that you are who they want to speak to first. Neither of the parties involved in this scenario can blindly trust the other. However, if the identities can be verified using cryptographic evidence, i.e. you give the caller some cryptographic data and they would be able to tell if you are telling the truth or not, and vice versa, then no confidential information is shared in this conversation.

On the other hand, having the ability to verify if the information they are getting is accurate and can be trusted is also very powerful. Companies have strong incentives to hide or even lie about certain information disclosed to the public [3,4,5], so if the information is important then it is crucial that the information can be verified. Traditional systems very much depend on manual processes to do the verification, e.g. the UK voting sytem. The voting in the UK only happens once in a while, the same manual process cannot work if it is applied to a system that has a much lower turnaround time requirement.

The above can be applied to carbon emissions reporting. Firstly, carbon emissions data are very important for tackling climate change. Carbon emissions is a measure of greenhouse gas released into the atmosphere (expressed in terms of carbon dioxide equivalent, CO_2e), as a result from burning fossil fuels for generating power, heating, cooling, manufacturing goods and foods, and transportation [6]. Without data we cannot know the state, and without knowing the state we cannot track changes or progress. Secondly, carbon emissions accounting often involves supply chains. It is challenging to get accurate data from company to company for the same reasons mentioned earlier. There are emerging standards for exchanging emissions data between companies. For instance, WBCSD [7] is leading the effort and has produced a set of standards for emissions data exchange [8]. However, the data exchange methodology does not currently involve cryptographic verification. So to achieve trusthworthy carbon emissions reporting, we need a way to verify the claims without revealing any business sensitive data at the same time.

This “zero trust, always verify, private data protected” goal can be achieved by applying zero-knowledge proofs.

USE CASE 1

My first paper on this topic was accepted at the LOCO 2024 workshop [9], which introduces the concept of applying zero-knowledge proofs (ZKPs) to achieve verifiable carbon emissions claims without compromising business sensitive data in a cloud computing scenario. The ZKP is constructed as follows:

Actors	Roles
Prover	Data centre operator. They give their customers the carbon emissions data based on their usage.
Verifier	Customer of the data centre, a company who uses the data centre’s hosting service for their online business. They need to produce their sustainability report [10], which includes their scope 3 carbon emissions. Hence, they need to make sure that the data they receive are accurate.
Electricity supplier	Supplies electricity to the data centre. They provide the carbon intensity figures for the data centre to do their carbon emissions accounting. The figures are signed by the electricity supplier.
Smart meter manufacturer	Makes smart meters that are used by the data centre to measure their electricity consumption. They sign the smart meters’ public keys.
Trusted certificate authority (CA)	They are trusted third party authorities who provide signed certificates for the public keys from the smart meter manufacturer and electricity supplier.

Commitment
Carbon emissions accounting that produces the emissions claim for the customer

Data	Public or Private witness
Carbon emissions claim for the customer	Public
CA’s public keys	Public
Carbon intensity	Private
Electricity consumption	Private
Customer’s share of usage	Private
Digital signatures for the smart meter reading, smart meter’s public key, manufacturer’s public key, carbon intensity and the electricity supplier’s public key	Private

USE CASE 2 (an extension to use case 1)

Considering the cloud computing scenario above, we can imagine that data centre operators buy both carbon emitting energy and clean energy from their suppliers. This means that the electricity consumed at the data centre has different carbon intensity factors, depending on the type of generation source. We can also imagine that the pricing could be set differently based on power consumption, and customers could choose to pay more to be carbon-free for their services. Whilst it is not possible to directly measure the amount of carbon-free energy being used by individual customers, we can apply the same Greenhouse Gas Protocol’s “Completeness Principle”. The principle states that the total amount of energy consumed by all the customers add up to the total amount of energy contributed to the carbon emissions at the data centre (internal use can be counted as non-paying customers). For example, if the data centre bought 50% carbon emitting energy and 50% renewable, and if one customer, consuming 1% of the total power consumption, has signed up for 100% carbon-free energy, then there should be 50% carbon emitting energy and 49% renewable for the rest of the customers.

The chain (much simplified with details omitted) looks something like this:

Let X kWh be power generated from carbon-emitting source, and Y kWh be power generated from carbon-free source. a1, a2, a3 and a4 are carbon emissions for each customer, calculated using carbon intensity for the carbon-emitting source, and b1, b2, b3 and b4 are carbon emissions calculated using carbon intensity for the carbon-free source. We want to prove that a1 + a2 + a3 + a4 = X kWh and b1 + b2 + b3 + b4 = Y kWh, and that a1 + a2 + a3 + a4 + b1 + b2 + b3 + b4 = X + Y kWh, without knowing any of the input numbers. This is only an illustration to explain the use case, in real life there could be over a million customers! Therefore a human auditor cannot practically solve this. However, a human auditor could play the role of verifier and make use of the ZKP system.

The ZKP for this scenario can be constructed based on the following:

Actors	Roles
Prover	In this use case we are only considering the data centre operator as the prover. To extend the use case further, we can also generate a proof at the electricity suppliers level.
Verifier	Customer of the data centre, they want to verify that the carbon emissions data from the data centre are accurate. In the extended use case mentioned above, the proof produced by the prover would also include a verified proof provided by the electricity supplier on carbon intensity and energy source.
Electricity supplier	Supplies electricity to the data centre. They provide the carbon intensity figures for the data centre to do their carbon emissions accounting, the figures are signed by the electricity supplier. The intensity factors could be different depending on the generator source.
Smart meter manufacturer	Makes smart meters that are used by the data centre to measure their electricity consumption. They sign the smart meters’ public keys.
Trusted certificate authority (CA)	They are trusted third party authorities who provide signed certificates for the public keys from the smart meter manufacturer and electricity supplier.

Commitment
Carbon emissions accounting that produces the emissions claim for the customer

Data	Public or Private witness
Carbon emissions claim for the customer	Public
CA’s public keys	Public
Carbon intensity	Private
Electricity consumption	Private
Customer’s share of usage	Private
Customer’s contracted portion of renewable energy	Private
Digital signatures for the smart meter reading, smart meter’s public key, manufacturer’s public key, carbon intensity and the electricity supplier’s public key	Private

The prototypes for these two use cases are a work in progress, currently I am testing out different techniques and frameworks that can achieve the same ZKPs but have different properties. Once I have completed the proof of concept on these two use cases, I could apply a similar technique on other commodities such as coffee beans. I will continue to share this research journey in the next blog(s)!

[1] Russian proverb “Trust but verify”, https://en.wikipedia.org/wiki/Trust,_but_verify
[2] S. P. Marsh. 1994. Formalizing Trust as a Computational Concept. Ph.D. Dissertation. University of Stirling.
[3] Volkswagen emissions scandal: https://www.epa.gov/vw/learn-about-volkswagen-violations
[4] Ikea logging protected forests: https://earth.org/ikea-implicated-in-logging-protected-siberian-forests/
[5] What is greenwashing: https://www.un.org/en/climatechange/science/climate-issues/greenwashing
[6] Causes of Climate Change, the United Nations, https://www.un.org/en/climatechange/science/causes-effects-climate-change
[7] The World Business Council for Sustainable Development, WBCSD https://www.wbcsd.org/
[8] Partnership for Carbon Transparency, PACT: https://www.carbon-transparency.org/
[9] Man, J., Jaffer, S., Ferris, P., Kleppmann, M. and Madhavapeddy, A., Emission Impossible: privacy-preserving carbon emissions claims.
[10] EU CSRD: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32022L2464

Fun with recursion

2025-04-26T01:00:00+01:00

Following on from the previous blog post regarding the prototype I built to generate carbon emissions proofs, I found out that the maximum number of bits that can be packed into a field element is 126 in Circom. Therefore, if we want to have 128 bits security strength as mentioned previously, we need to have a modulus size of 6144 bits. For Paillier, as the modulus is the square of the key size (the product of two prime numbers), which needs to be 3072 bits to achieve the 128 bits security strength. If the number of bits in a field element can only be up to 126, that means we will need 25 elements in the field elements array to represent a keysize that is at least 3072 bits.

It is also not a straightforward task to write the Paillier encryption in circuits, instead of a basic exponentiation computation (the randomness number r needs to be raised to the power of the key, n) by calling something like r**n and let the compiler/runtime engine deals with the rest, the circuit needs to include r**n as part of the proof and hence reduce it to “Rank-1 Constraints Satisfaction” (R1CS) system (there are other interpretation to what ‘S’ stands for, e.g. System, Satisfactory). In R1CS the algebraic circuits are expressed as a set of vectors and matrices, which in turn are converted to a set of polynomials to be used for the rest of the zkSNARK pipeline. So how do you express r**n as algebraic circuits in the first place?

At first I tried the naive approach and simply created a loop (Circom supports loops) for r to multiply itself n times. This turned out to have very bad performance. Then, with my supervisor Martin’s help, I was able to apply the Square and Multiply method into a Circom circuit, which makes it way more performant. The circuit looks like this:

However, it is still too big (in terms of the number of constraints). The carbon emissions prototype circuits with the Paillier encryption added were compiled, and the circom compiler reported that it has ~142 million constraints (as shown below). The trusted setup required to kick off the zkSNARK system, which uses the Groth16 protocol, will therefore have to be able to support up to 2^28 constraints, which is the maximum snarkjs can support currently. The high number of constraints causes the Powers of Tau ceremony for the trusted setup to take a very long time (days!). However, I could not even complete the experiment with a keysize bigger than 1000bits on my laptop, as it doesn’t have enough memory to carry out the trusted setup and proof generation.

non-linear constraints: 142769486
linear constraints: 0
public inputs: 28
private inputs: 65
public outputs: 1
wires: 141986048
labels: 149556141
Written successfully: ./emissions_proof.r1cs
Written successfully: ./emissions_proof.sym
thread 'main' panicked at code_producers/src/wasm_elements/wasm_code_generator.rs:9:5:
the size of memory needs addresses beyond 32 bits long. This circuit cannot be run on WebAssembly

So, the experiment with Circom didn’t feel satisfactory because of the Paillier encryption. Taking a step back, the Paillier encryption was added to prove that the provided customer share was correctly encrypted, and then outside of zkSNARK we can verify that all customers’ shares reported by the data centre operator add up to a 100% of the total power usage using the Paillier cryptosystem. If we can find a way to prove that within a SNARK proof, without having to input the data of every customer at once (a data centre could potentially have thousands or even millions of customers!), then we won’t need to apply the Paillier cryptosystem at all.

One way to do it is through recursive SNARK[1]. The input to each proof can be limited to one customer share at a time, and we add the share to the previous customer’s share recursively. Through enough recursive steps to go through the whole customer base, the final proof output should be 100!

Circom does not support recursion, so for the past couple of weeks I have been experimenting two different methods: one is to use a framework called o1js, a TypeScript library provided as part of the Mina blockchain protocol, created and maintained by O(1)Labs and Mina Foundation; another one is to use a zkVM, e.g. RiscZero, SP1 and Jolt. zkVM provides a virtual environment that generates zk-proofs, abstracted away the complexity of circuit logic and provides a more developer-friendly language (e.g. Rust) to write circuits. Within the last five or six years there has been lots of effort to improve the performance of zkVM.

It would be very interesting to see the results from these two methods!

I tried the o1js framework first. First impression was already a win from my experiment with Circom. I am able to express my circuits within a few lines, and the readily available modules are sufficient for me to write the same emissions proof prototype. With the support of recursion, I am now able to do the customer share additions one by one in each proof, and then verify the final output from the recusive proof is indeed a 100.

Right now I am trying to learn about RiscZero and SP1. So far RiscZero’s protocol and framwork makes more sense to me, with SP1 it has abstracted the zero-knowledge proving part so much that it is quite difficult to express my intention using their framework. It is very much designed for writing proofs that can be deployed to smart contracts.

In terms of performace, initial observation (without measurements) is that they take substantially more compute power and time to generate a proof. Verification is still very fast and small.

I will write in more details about these experiments and results in future blogs. For the next blog though, I think I will go back to the problems I am trying to solve and explore more on the use cases!

[1] Bitansky, N., Canetti, R., Chiesa, A. and Tromer, E., 2013, June. Recursive composition and bootstrapping for SNARKS and proof-carrying data. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing (pp. 111-120)

Fun with Yeo Tokens

2025-04-26T01:00:00+01:00

For many years I buy milk, butter and other dairy products from a brand called Yeo Valley Organic (Disclaimer: it is purely personal taste that I buy their products. I have no association with the company other than being one of their customers. There are many other brands available, readers please choose based on your own preferences.). The company offer “tokens” on their products for their customers to collect. The collected tokens can be stored on the customers’ accounts via their website, by entering the corresponding code printed on the products. The stored tokens can be spent in exchange for whatever they offer on their website. Even though I don’t use the tokens for anything, I do want to store them. However, I don’t always enter the codes every time I bought or finished a product straightaway. In fact, I almost never do that. Instead, I cut out the codes and put them in a box, thinking that one day I will enter them.

Today was one of those days. I decided to “bank” a few tokens by submitting some of the codes. I have accumulated so many that the box I am using has become too full! However, some of the cut-out codes have stuck together and because they have stuck together for so long, the prints of the codes have merged and faded! An example is shown on the picture above.

So instead of throwing them away (I totally can do that!), I tried to solve it. I used the Magnifyer app on my iPad to get a closer look and played with contrast and filters on the image.

Imagine the excitment I have when I finally cracked it and got the codes accepted!

Sometimes tiny wins do make the day.

My ZKP Experiment

2025-04-19T01:00:00+01:00

This week I had two (unrelated) meetings with people who work with zk-SNARKs, first time I talked to people who actually work with zero-knowledge proofs (zkp) as part of their jobs! I had mixed feelings after the meetings. On the one hand it was so exciting to talk to people who work with zkp in real life, so interesting to hear about their applications; on the other it is a bit intimidating just how much is there to learn in this field but not everything is useful. New frameworks, languages and zkVM popped up within the last five or six years, created mainly to address two issues: (a) time-consuming computation and (b) user un-friendly complex proof-logic. The underlying maths and cryptography used for zero-knowledge are pretty stable. The problem is, with lanugages abstracted further and further away from the proof logic and with the priorities on speed, the cost shifted to security and privacy protection properties. This article gives a very high level but direct comparison of existing zkp lanuages/zkVM based on their “zk’ness”, which could be useful if you don’t know where to start and if privacy is important in your use case.

My first prototype using zkp to tackle carbon emissions claims was written in Circom. The prototype was built for a use case in which a customer of a cloud provider wants to know the carbon emissions based on their usage. The customer’s business run on servers hosted by their cloud providers and they want to know their Scope 3 emissions. Existing systems and methodology for carbon emissions reporting rely on customers either trusting the data from their providers unconditionally or recruiting third party independent auditors to verify the data. With zkp, customers can automate the verification as frequently as needed. The providers do not need to reveal confidential input that goes into the emissions accounting, for example they might not want to reveal their business volume by giving away their total power consumption at any one of their data centres, nor would they want to reveal data related to their electricity suppliers.

There is one tricky bit in this prototype - how can we ensure that the customer share of the power consumption is accurate? We could apply the “Completeness Principle” of the GreenHouse Gas Protocol, where all sources of emissions have to be accounted for. So we can assume that the divided power consumption must add up to 100% of the total power consumption. Therefore we could make it a requirements that providers also need to provide a transparency log with encrypted customer data, then we can use homomorphic cryptography to prove that all customer shares in percent add up to a 100. Moreover, if the data on the log is arranged in a Merkle Tree customers can also verify that they are indeed part of this customer base. This is not bulletproof unfortunately, providers can still cheat by adding fake customers to the log. I will provide more information in future blogs about this problem.

Now back to the prove that all customer shares add up to 100, I can use Paillier cyptosystem[1] for this. Given that each customer share is encrypted using Paillier, we can then do homomorphic addtion to prove that they add up to 100 without knowing each individual share, hence protecting the private data. This can be done outside of zkSNARK, but we still need to check that the encrypted share used in the carbon emission calculation is the right one!

To achieve that I added a circuit that can do Paillier encryption on the (private) customer share. In this circuit the encrypted customer share used in the transparency log is checked against the customer share encrypted in the circuit. As it turns out, this encryption is pretty computationally expensive! Paillier’s modulus for the key is made of the square of the product of two prime numbers, and to achieve high security property (at least 128bit security strength) we need the modulus size to be 3072bits. It is a big number and therefore needs to be divided into field elements for the arithmetic operations. The bigger the modulus, the higher the number of constraints generated by the circuit. My laptop cannot complete a run with modulus size bigger than 200bits.

I did some benchmarking and plotted the results to find out the relationship between the various field elements sizes and the number of constraints generated. The results show that with the same key size, the more number of bits packed into each field element the fewer constraints are generated:

So how do we improve on this? What is the max number of bits can be packed in a single field element? Tune in next blog post!

[1] Paillier, P., 1999, April. Public-key cryptosystems based on composite degree residuosity classes. In International conference on the theory and applications of cryptographic techniques (pp. 223-238). Berlin, Heidelberg: Springer Berlin Heidelberg.