Low-Stack HAETAE for Memory-Constrained Microcontrollers

Published in In *eprint*, 2026

We present a low-stack implementation of the module-lattice signature scheme HAETAE, targeting microcontrollers with 8–16 kilobytes of available SRAM. On such devices, peak stack usage is often the main constraint, and HAETAE’s hyperball-based sampler, large transient polynomial vectors, and variable-length signature payloads (hint and high-bits arrays) pose a particular challenge.

To address this, we introduce (i) rejection-aware pass decomposition, (ii) component-level early rejection, and (iii) reverse-order streaming entropy coding using range Asymmetric Numeral Systems (rANS).

Combined with streamed matrix generation, a two-pass hyperball sampler with a streaming Gaussian backend, and row-streamed verification, these techniques reduce signing stack usage from 71–141 kilobytes in the reference implementation down to 5.8–6.0 kilobytes, key generation to 4.7–5.7 kilobytes, and verification to 4.7–4.8 kilobytes across all three security levels.

Our pure C implementation covers all three security levels (HAETAE-2, HAETAE-3, and HAETAE-5), whose optimization paths differ due to the public-key domain (d > 0 vs. d = 0) and rejection structure. We implement our optimizations on a Nucleo-L4R5ZI and compare them to the reference pqm4 implementation (for HAETAE-2 and HAETAE-3) and to a recently published memory-optimized implementation targeting HAETAE-5 only.

We reduce HAETAE-2, HAETAE-3, and HAETAE-5 stack usage by 75%, 86%, and 8%, respectively, for key generation; 92%, 95%, and 24% for signature generation; and 85%, 91%, and 22% for verification. Depending on the parameter set, this affects performance by at most a factor of 1.8 for key generation and 3.4 for signature generation, while even providing a performance improvement of up to 18% for verification.

Verification at all security levels fits within 8 kilobytes of RAM (signature buffer plus stack) and is 2.34–3.34× faster than ML-DSA m4fstack at each comparable security level. We additionally validate portability under RIOT-OS on ARM Cortex-M4 and RISC-V targets.