Hardware-based tick counters for high-precision benchmarks in Rust

One of the aspects I love about Rust is its ability to inline low-level assembly code snippets, enabling access to the capabilities of the hardware platform. These snippets can then be conditionally compiled according to the underlying platform and wrapped with high-level methods for easy use. It's great being close to the metal to achieve maximum by using hardware capacities.

Rust is fast, but the code is performant when developed in the right way. Benchmarking plays a crucial role in measuring and optimizing code performance.

While the Rust ecosystem offers excellent libraries for sophisticated benchmarks, there are situations during development and testing when I need a more basic tool for benchmarking specific lines or blocks of code. Rust's standard library provides a user-friendly option with std::time::Instant, which is well-suited for this purpose.

I experimented with std::time::Instant and observed that its precision is approximately 40 nanoseconds on both Intel x86_64 (Intel® Core™ i7) running Linux and AArch64 (Apple M1 Pro) platforms.

However, Intel CPUs (since Pentium) provide access to the hardware time stamp counter through the RDTSC instruction, and utilizing this feature theoretically could offer better precision. It was also interesting to explore what the Apple platform has to offer for this purpose.

I enjoy delving into low-level details and experimenting. I love such stuff.

On the Apple platform, by reading the CNTVCT_EL0 counter-timer register, my tests were consistent with std::time::Instant, at around 40 nanoseconds.

But, on the Intel i7 platform running Linux, the results were more interesting.

First, I learned that Rust's standard library already includes _rdtsc() and __rdtscp() wrapper functions for the x86/x86_64-based platforms, which gave approximately the same results as using inline assembly instructions.

Furthermore, my tests indicate that when using inline assembly on the Intel i7 platform (Linux), the precision of ticks was approximately 0.29 nanoseconds. However, I observed an additional, around 17 nanoseconds or 60 ticks overhead in my benchmarks, probably due to CPU cycles consumed by the RDTSC instruction itself and the wrapper functions.

I also calculated some statistics and standard deviations using both std::time::Instant and the tick_counter crate.

These results are intriguing and show certain benefits in terms of benchmarking accuracy and stability.

Overall, it has been a rewarding journey.

The source code is available on GitHub: https://github.com/sheroz/tick_counter

IT / Coding / How-To

Other Topics