Efficient Side-Channel Resilient Post-Quantum Root-of-Trust Design

Posted by

May 19, 2026

On May 19, 2026

As hinted at above, securing cryptographic algorithms against side-channel attacks is often synonymous with significant performance decreases. The decomposition of sensitive variables into multiple independent shares requires that the functions operating on these variables be decomposed accordingly.

Such subfunctions are more complex than their unshared parent from which they derive and come with specific requirements on the composition of the underlying circuits in terms of gates and randomness. As a result, the decomposition of even the simplest functions such as a 2-bit AND gate can result in a circuit that’s 10X to 20 X larger and requires multiple cycles to compute its output.

This overhead is further amplified if one chooses to implement these shared functions purely in software, where the penalty in terms of code size and running can be prohibitive, especially on resource-constrained devices.

To remedy the overhead of a shared/masked PQC implementation on its performance metrics, we identified the most salient functions that form the basis of a shared lattice-based cryptography and offloaded their computation to a set of dedicated accelerators with the OpenTitan Big Number Accelerator (OTBN).

That set contains a shared 32-bit adder and both an A2B and B2A converter. All three circuits are vectorized and can operate multiple 32-bit words in parallel to amortize their multicycle nature. A secure shared adder is in fact the fundamental building block of the A2B and B2A converters. There are multiple well-established techniques on how to bootstrap these converters in a secure manner from a single secure adder.

This architectural choice reflects a strategic balance between performance and flexibility:

Hardware for the known: We have dedicated hardware to handle mask conversion — an operation that’s both computationally costly and theoretically well-understood.
Software for the evolving: By keeping the high-level SCA hardening of ML-DSA in software, we retain the flexibility to adapt to new research. Since side-channel protection for lattice-based schemes is a relatively nascent field, this allows us to update our countermeasures without requiring a full silicon redesign.

The inclusion of these accelerators into the OpenTitan fold is indicative of a tradeoff. By increasing the circuit footprint by a reasonable amount (these three mask-conversion accelerators are small compared to the overall size of the OpenTitan SoC), we’re able, according to preliminary measurements, to bound the performance overhead of a fully masked ML-DSA implementation to the 2X to 4X range. This makes it feasible to use the algorithm in performance-critical applications such as secure boot.

Moreover, the accelerators allow us to significantly reduce the code size of our hardened PQC implementations, which are now only insignificantly larger than their unhardened counterparts.