Jekyll2022-05-30T14:11:20+02:00http://blog.salkinium.com/feed.xmlembedded entanglementreflecting on embedded software developmentNiklas Hauserniklas@salkinium.comAccurate Micro- and Nanosecond Delay in modm2021-07-05T00:00:00+02:002021-07-05T00:00:00+02:00http://blog.salkinium.com/modm-delay<p>Accurately spinning for short and long time durations is an essential part of an embedded application. In the <a href="https://modm.io">modm embedded library</a> we provide blocking delay functions in the resolution of milli-, micro- and even nanoseconds. Let me show you how we used the available hardware to implement a fast, efficient and flexible API that works with thousands of devices all with different clock configurations.</p>
<p>The most prominent uses for blocking delays in modm are during initialization of internal peripherals and external drivers that may require a few micro- to milliseconds to stabilize their hardware, and when bit-banging protocols in software with kHz and MHz baudrates requiring micro- or even nanosecond delay.</p>
<p>The delay functions must be as accurate as possible. In particular they must have the shortest possible overhead and a low error over at least 1s of delay. They must already work before main (during the global constructor calls) and remain accurate if the clock configuration and therefore the CPU frequency dynamically changes. They must also be reentrant so they can be called from inside an interrupt if needed. And lastly they should be compatible with the <code class="language-plaintext highlighter-rouge">std::chrono</code> time units, so that we can pass them literals for ease of use:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">modm</span><span class="o">::</span><span class="n">delay</span><span class="p">(</span><span class="mx">1s</span><span class="p">);</span> <span class="c1">// non-literal version</span>
<span class="n">modm</span><span class="o">::</span><span class="n">delay</span><span class="p">(</span><span class="mx">10ms</span><span class="p">);</span> <span class="n">modm</span><span class="o">::</span><span class="n">delay_ms</span><span class="p">(</span><span class="mi">10</span><span class="p">);</span>
<span class="n">modm</span><span class="o">::</span><span class="n">delay</span><span class="p">(</span><span class="mx">100us</span><span class="p">);</span> <span class="n">modm</span><span class="o">::</span><span class="n">delay_us</span><span class="p">(</span><span class="mi">100</span><span class="p">);</span>
<span class="n">modm</span><span class="o">::</span><span class="n">delay</span><span class="p">(</span><span class="mx">1000ns</span><span class="p">);</span> <span class="n">modm</span><span class="o">::</span><span class="n">delay_ns</span><span class="p">(</span><span class="mi">1000</span><span class="p">);</span>
</code></pre></div></div>
<h1 id="computing-cycles">Computing Cycles</h1>
<p>The simplest delay function converts the input time to CPU cycles and then spins in place counting them down. For the conversion we need to know the CPU frequency and have some mechanism for keeping track of elapsed CPU cycles.</p>
<p>For microsecond and longer delays the conversion is simple: <em>1µs = 1MHz<sup>-1</sup></em>, so you can just take the CPU frequency in MHz and multiply it with the input to get the cycles. We store the frequency in a global <code class="language-plaintext highlighter-rouge">uint16_t</code> already scaled down to MHz and initialized with the boot frequency during startup.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// microcontroller boots with a 8MHz clock</span>
<span class="kt">uint16_t</span> <span class="n">fcpu_MHz</span> <span class="o">=</span> <span class="mi">8</span><span class="p">;</span>
<span class="c1">// simple conversion with multiplication</span>
<span class="kt">uint32_t</span> <span class="n">cycles</span> <span class="o">=</span> <span class="n">us</span> <span class="o">*</span> <span class="n">fcpu_MHz</span><span class="p">;</span>
</code></pre></div></div>
<p>This works well for frequencies that divide 1MHz cleanly, however, the STM32L0/L1 microcontrollers boot at 2.097MHz for example, which results in a 5% error right after boot. We therefore binary scale the MHz value to achieve a much lower error, which can be done very efficiently with bit shifting:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// multiply MHz with power-of-two 2^5 = 32</span>
<span class="k">constexpr</span> <span class="kt">uint8_t</span> <span class="n">shift</span> <span class="o">=</span> <span class="mi">5</span><span class="p">;</span>
<span class="c1">// 2.097MHz * 32 -> 67 = 2.09375MHz -> ~0.2% error</span>
<span class="n">constinit</span> <span class="kt">uint16_t</span> <span class="n">fcpu_MHz</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">round</span><span class="p">(</span><span class="mf">2.097</span><span class="n">f</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1ul</span> <span class="o"><<</span> <span class="n">shift</span><span class="p">));</span>
<span class="c1">// divide with simple bit shift</span>
<span class="kt">uint32_t</span> <span class="n">cycles</span> <span class="o">=</span> <span class="p">(</span><span class="n">us</span> <span class="o">*</span> <span class="n">fcpu_MHz</span><span class="p">)</span> <span class="o">>></span> <span class="n">shift</span><span class="p">;</span>
</code></pre></div></div>
<p>To keep the 32-bit multiplication from overflowing and to maintain at least 1s = 1’000’000µs of delay, we must limit the scaling so that <em>2<sup>32 - shift</sup> / max_fcpu ≥ 1s</em>. A scalar of 32 (shift 5) is only good up to 134MHz, while the fastest STM32H7 running at 480MHz limits the scalar to only 8 (shift 3).</p>
<p>For nanosecond delay we need a different algorithm, since the microcontrollers all run below 1GHz so one CPU cycle is several nanoseconds long. For example, the STM32F7 runnning at 216MHz will take ~4.6ns per cycle. To get the cycles from a nanosecond input we would need to <em>divide</em>:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="n">ns_per_cycle</span> <span class="o">=</span> <span class="mf">4.6</span><span class="n">f</span><span class="p">;</span>
<span class="kt">uint32_t</span> <span class="n">cycles</span> <span class="o">=</span> <span class="n">ns</span> <span class="o">/</span> <span class="n">ns_per_cycle</span><span class="p">;</span>
</code></pre></div></div>
<p>This is obviously way too slow to compute, but we first need to understand how to accurately <em>count</em> cycles to find a better solution to this problem.</p>
<h1 id="counting-cycles">Counting Cycles</h1>
<p>Wouldn’t it be nice if we could just delegate counting cycles to some hardware counter? Well, look no further than the <a href="https://developer.arm.com/documentation/ddi0439/b/Data-Watchpoint-and-Trace-Unit/DWT-functional-description?lang=en">Data Watchpoint and Trace Unit (DWT)</a> and its 32-bit <code class="language-plaintext highlighter-rouge">CYCCNT</code> counter free running at CPU frequency!</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Enable Tracing Debug Unit</span>
<span class="n">CoreDebug</span><span class="o">-></span><span class="n">DEMCR</span> <span class="o">|=</span> <span class="n">CoreDebug_DEMCR_TRCENA_Msk</span><span class="p">;</span>
<span class="c1">// Enable CPU cycle counter</span>
<span class="n">DWT</span><span class="o">-></span><span class="n">CTRL</span> <span class="o">|=</span> <span class="n">DWT_CTRL_CYCCNTENA_Msk</span><span class="p">;</span>
</code></pre></div></div>
<p>By reading <code class="language-plaintext highlighter-rouge">DWT->CYCCNT</code> once at the beginning and then comparing this constantly in a loop until the number of cycles have passed, we can implement a very simple, yet very accurate delay function:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">modm</span><span class="o">::</span><span class="n">delay_us</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">us</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">start</span> <span class="o">=</span> <span class="n">DWT</span><span class="o">-></span><span class="n">CYCCNT</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">cycles</span> <span class="o">=</span> <span class="n">fcpu_MHz</span> <span class="o">*</span> <span class="n">us</span> <span class="o">>></span> <span class="n">shift</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="nb">true</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">now</span> <span class="o">=</span> <span class="n">DWT</span><span class="o">-></span><span class="n">CYCCNT</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">now</span> <span class="o">-</span> <span class="n">start</span> <span class="o">>=</span> <span class="n">cycles</span><span class="p">)</span> <span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Bonus win for this solution: time spent in interrupts during the delay is compensated for, since the hardware counter continues counting throughout.</p>
<h1 id="counting-loops">Counting Loops</h1>
<p>Unfortunately, the <code class="language-plaintext highlighter-rouge">DWT</code> peripheral is not accessible in all ARMv7-M devices (NRF52 only allows the debugger to access it) and it’s not even implemented on ARM Cortex-M0(+) aka. ARMv6-M devices, so we have to count cycles a different way. We could use the <code class="language-plaintext highlighter-rouge">SysTick->VAL</code>, however it’s just a 24-bit counter, which limits us to ~16.8 million cycles: a ~1s delay at 16MHz or a maximum 35ms delay (!) at 480MHz. In addition, the SysTick is often used for preemptive scheduling (in FreeRTOS) or to create a global clock (for software timers), so we cannot use it as a replacement.</p>
<p>Instead we will count cycles the old fashioned way: in a tight assembly loop with a known timing. We use two 16-bit Thumb-2 instructions: <a href="https://www.keil.com/support/man/docs/armasm/armasm_dom1361289908389.htm">subtraction with condition flags update</a> and <a href="https://www.keil.com/support/man/docs/armasm/armasm_dom1361289863797.htm">branch back if positive</a>. They are aligned so they fit into a single 32-bit instruction fetch and fill the pipeline entirely, giving us the maximum performance: 1 cycle for the subtraction and 2-cycles to branch back, so the loop takes 3 cycles total:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">modm</span><span class="o">::</span><span class="n">delay_us</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">us</span><span class="p">)</span> <span class="n">modm_fastcode</span>
<span class="p">{</span>
<span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">cycles</span> <span class="o">=</span> <span class="n">fcpu_MHz</span> <span class="o">*</span> <span class="n">us</span> <span class="o">>></span> <span class="n">shift</span><span class="p">;</span>
<span class="k">asm</span> <span class="k">volatile</span> <span class="p">(</span>
<span class="s">".align 4"</span> <span class="c1">// align for *one* 32-bit instruction fetch</span>
<span class="s">"1: subs %0, %0, #3"</span> <span class="c1">// subtract the loop cycles</span>
<span class="s">"bpl 1b"</span> <span class="c1">// loop while cycles are positive</span>
<span class="o">::</span> <span class="s">"l"</span> <span class="p">(</span><span class="n">cycles</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The instruction fetch timings for executing directly from Flash depends on the CPU speed, the currently configured wait states and the state of the instruction cache (if available and configured) and finally the branch speculation of the cache implementation. We therefore place the entire function into SRAM using the <code class="language-plaintext highlighter-rouge">modm_fastcode</code> attribute, which gives us <em>predictable</em> timings for instruction fetches across all Cortex-M cores, since we’re bypassing the Flash wait states and the (vendor supplied) cache entirely.</p>
<p>Predictable, but not consistent: In my experiments I’ve discovered the loop to take 3 cycles on STM32{F3, G0, G4, L0, L4}, 4 cycles on STM32{L1, F0, F1, F4, F2} and just 1 cycle (!) on STM32F7. The timings depend on the (vendor defined) bus matrix implementation and the system configuration and are mainly about whether the Instruction Bus (I-Code) can access SRAM directly or whether the access is performed by the slower System Bus (S-Bus). The <a href="https://www.st.com/resource/en/reference_manual/dm00031020-stm32f405-415-stm32f407-417-stm32f427-437-and-stm32f429-439-advanced-arm-based-32-bit-mcus-stmicroelectronics.pdf#page=68">STM32F4 reference manual states in section 2.3.1 Embedded SRAM</a>:</p>
<blockquote>
<p>The CPU can access the SRAM1, SRAM2, and SRAM3 through the System Bus or through the I-Code/D-Code buses when boot from SRAM is selected or when physical remap is selected. To get the max performance on SRAM execution, physical remap should be selected (boot or software selection).</p>
</blockquote>
<p>It seems that access through the I-Code takes 2-cycles, but the S-Bus takes 4-cycles, while the Cortex-M7 has a dual issue pipeline and native instruction cache with native branch prediction, so it’s just… really fast ? As confusing as it might be, at least the instruction fetch timing from SRAM is independent from the configured CPU frequency, which allows us to hardcode the loop cycles to subtract as an immediate value encoded in the instruction.</p>
<p>The upper bound on the error is at most 3 cycles plus the error of the binary scaling, which together is good enough for our purpose. However, interrupts are not compensated, so the real delay may be significantly longer. If an accurate delay is absolutely necessary it can be wrapped into <code class="language-plaintext highlighter-rouge">modm::atomic::Lock</code> to disable interrupts during the delay.</p>
<h1 id="counting-nanoseconds">Counting Nanoseconds</h1>
<p>To delay for nanoseconds we need to do something a little different, since the naive approach involves division, which would be way too slow. We can, however, approximate this division with a loop of subtractions! So we input the nanoseconds into the <code class="language-plaintext highlighter-rouge">subs bpl</code> loop and subtract the nanoseconds each loop takes. We store this value in SRAM and update it on every clock change:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">constexpr</span> <span class="kt">uint8_t</span> <span class="n">cycles_per_loop</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span> <span class="c1">// 1-4 cycles, depends on device</span>
<span class="c1">// round the nanoseconds to minimize error</span>
<span class="kt">uint16_t</span> <span class="n">ns_per_loop</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">round</span><span class="p">(</span><span class="mf">1e9</span> <span class="o">*</span> <span class="n">cycles_per_loop</span> <span class="o">/</span> <span class="n">fcpu</span><span class="p">);</span>
<span class="kt">void</span> <span class="n">modm</span><span class="o">::</span><span class="n">delay_ns</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">ns</span><span class="p">)</span> <span class="n">modm_fastcode</span>
<span class="p">{</span>
<span class="k">asm</span> <span class="k">volatile</span> <span class="p">(</span>
<span class="s">".align 4"</span> <span class="c1">// align for *one* 32-bit instruction fetch</span>
<span class="s">"1: subs %0, %0, %1"</span> <span class="c1">// subtract the nanoseconds per loop</span>
<span class="s">"bpl 1b"</span> <span class="c1">// loop while nanoseconds are positive</span>
<span class="o">::</span> <span class="s">"l"</span> <span class="p">(</span><span class="n">ns</span><span class="p">),</span> <span class="s">"l"</span> <span class="p">(</span><span class="n">ns_per_loop</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This works, however, there is a large overhead until execution arrives at the loop. The reason is that the compiler uses a <a href="https://www.keil.com/support/man/docs/armasm/armasm_dom1361289865686.htm"><code class="language-plaintext highlighter-rouge">bl</code> (branch and link) instruction</a> to jump to an address encoded as an <em>immediate value</em>. This is fast and efficient, however, it limits us to a relative address range of ±16MBs and our delay function in SRAM is waaaaay out there (SRAM starts @0x20000000 vs Flash @0x08000000). So the linker has to add a veneer, that does nothing else but jump further by loading the address into a register and loading it into the program counter therefore jumping:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> modm::delay_ns(ns);
8000214: f002 fbf4 bl 8002a00 <___ZN4modm8delay_nsEm_veneer>
08002a00 <___ZN4modm8delay_nsEm_veneer>:
8002a00: f85f f000 ldr.w pc, [pc] ; 8002a04
8002a04: 20000189 .word 0x20000189
20000188 <_ZN4modm8delay_nsEm>:
void modm_fastcode modm::delay_ns(uint32_t us)
</code></pre></div></div>
<p>Since Flash access is very slow (up to a dozen wait states for fast devices), vendors supply a cache implementation with a large, but limited buffer size (the STM32F4 cache has 64 cache lines of 128-bit = 1kB!). So the jump to a veneer outside of the 1kB range spends many cycles just waiting on the Flash and this time depends on the current clock configuration. Can we do better? Yes, with inline assembly!</p>
<p>We move the actual implementation to <code class="language-plaintext highlighter-rouge">modm::platform::delay_ns</code> and then use a forced-inline forwarding function that uses the <a href="https://www.keil.com/support/man/docs/armasm/armasm_dom1361289866046.htm"><code class="language-plaintext highlighter-rouge">blx</code> instruction</a> to jump there directly instead of through a veneer:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">modm_always_inline</span>
<span class="kt">void</span> <span class="n">modm</span><span class="o">::</span><span class="n">delay_ns</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">ns</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">asm</span> <span class="k">volatile</span><span class="p">(</span>
<span class="s">"mov r0, %0"</span> <span class="c1">// Pass the argument in r0 as per EABI</span>
<span class="s">"blx %1"</span> <span class="c1">// Jump there directly</span>
<span class="o">::</span> <span class="s">"r"</span> <span class="p">(</span><span class="n">ns</span><span class="p">),</span> <span class="s">"l"</span> <span class="p">(</span><span class="n">modm</span><span class="o">::</span><span class="n">platform</span><span class="o">::</span><span class="n">delay_ns</span><span class="p">)</span> <span class="o">:</span> <span class="s">"r0"</span><span class="p">,</span> <span class="s">"r1"</span><span class="p">,</span> <span class="s">"r2"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This reduces the overhead by eliminating the unnecessary jump and loading a literal from Flash that’s stored much closer to the execution site (here its just <code class="language-plaintext highlighter-rouge">#148</code> bytes away) and therefore most likely already in the cache:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> modm::delay_ns(ns);
80002c6: 4c25 ldr r4, [pc, #148] ; 800035c
80002ca: 4628 mov r0, r5
80002cc: 47a0 blx r4
800035c: 200001a9 .word 0x200001a9
200001a8 <_ZN4modm8platform8delay_nsEm>:
void modm_fastcode modm::platform::delay_ns(uint32_t us)
</code></pre></div></div>
<p>However, we still need to actually compensate for this overhead, even if it’s just a few cycles, there should not be an offset in the delay function. To have maximum control we declare the function to be naked and implement the whole function in inline assembly:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">__attribute__</span><span class="p">((</span><span class="kr">naked</span><span class="p">,</span> <span class="n">aligned</span><span class="p">(</span><span class="mi">4</span><span class="p">)))</span> <span class="n">modm_fastcode</span>
<span class="n">modm</span><span class="o">::</span><span class="n">platform</span><span class="o">::</span><span class="n">delay_ns</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">ns</span><span class="p">)</span> <span class="c1">// passed in r0</span>
<span class="p">{</span>
<span class="k">asm</span> <span class="k">volatile</span> <span class="p">(</span>
<span class="s">"ldr r2, =ns_per_loop"</span> <span class="c1">// address of ns_per_loop</span>
<span class="s">"ldrh r2, [r2, #0]"</span> <span class="c1">// load the actual 16-bit ns_per_loop value</span>
<span class="s">"lsls r1, r2, #2"</span> <span class="c1">// approximate overhead in ns by shifting</span>
<span class="s">"subs r0, r0, r1"</span> <span class="c1">// subtract the overhead in nanoseconds</span>
<span class="s">"1: subs r0, r0, r2"</span> <span class="c1">// subtract the nanoseconds per loop</span>
<span class="s">"bpl 1b"</span> <span class="c1">// loop while nanoseconds are positive</span>
<span class="s">"bx lr"</span> <span class="c1">// return to execution</span>
<span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The overhead is measured experimentally and expressed in loops, which we can convert to nanoseconds by multiplying with the <code class="language-plaintext highlighter-rouge">ns_per_loop</code> variable. However the <a href="https://www.keil.com/support/man/docs/armasm/armasm_dom1361289882394.htm"><code class="language-plaintext highlighter-rouge">mul</code> instruction</a> requires passing the operands in registers, which would require an additional <a href="https://www.keil.com/support/man/docs/armasm/armasm_dom1361289878994.htm"><code class="language-plaintext highlighter-rouge">mov</code> instruction</a> to put the value into a register, so instead we use the <a href="https://www.keil.com/support/man/docs/armasm/armasm_dom1361289876185.htm"><code class="language-plaintext highlighter-rouge">lsl</code> instruction</a> to shift the value left with the same effect. This limits the “overhead loop count” to powers of two, which in practice is not an issue.</p>
<p>In the above code we’re using 4 loops as overhead (so about 12-16 cycles at 3-4 cycles per loop), which is equivalent to shifting left by 2, hence the <code class="language-plaintext highlighter-rouge">#2</code> immediate value in the <code class="language-plaintext highlighter-rouge">lsl</code> instruction.</p>
<h1 id="counting-cycles-on-avr">Counting Cycles on AVR</h1>
<p>AVRs cannot change their CPU frequency at runtime, instead it is fixed at compile time via the <code class="language-plaintext highlighter-rouge">F_CPU</code> macro, so we don’t have to worry about that. The avr-lib provide implementations of <code class="language-plaintext highlighter-rouge">_delay_ms(double)</code> and <code class="language-plaintext highlighter-rouge">_delay_us(double)</code> in the <a href="https://www.nongnu.org/avr-libc/user-manual/group__util__delay.html"><code class="language-plaintext highlighter-rouge"><util/delay.h></code> header</a>: However, <a href="https://www.nongnu.org/avr-libc/user-manual/delay_8h_source.html">the implementations use floating point math to calculate the delay cycles</a> for runtime arguments. But fear not for there is a very sternly worded warning for passing a dynamic value to this incredibly powerful foot gun:</p>
<blockquote>
<p>In order for these functions to work as intended, compiler optimizations must be enabled, and the delay time must be an expression that is a known constant at compile-time. If these requirements are not met, the resulting delay will be much longer (and basically unpredictable), and applications that otherwise do not use floating-point calculations will experience severe code bloat by the floating-point library routines linked into the application.</p>
</blockquote>
<p>Of course this is a completely unacceptable implementation, since avr-gcc provides <a href="https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html"><code class="language-plaintext highlighter-rouge">__builtin_constant_p()</code></a> to detect constant arguments and together with <a href="https://gcc.gnu.org/onlinedocs/gcc/AVR-Built-in-Functions.html"><code class="language-plaintext highlighter-rouge">__builtin_avr_delay_cycles(uint32_t)</code></a> can generates very accurate delays down to a single cycle for constant inputs at any clock rate.</p>
<p>For a delay with a runtime argument we can loop over a 1ms or 1us constant delay and compensate for the loop overhead:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">modm_always_inline</span> <span class="c1">// <- must be force inlined to work</span>
<span class="kt">void</span> <span class="n">modm</span><span class="o">::</span><span class="n">delay_ms</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">ms</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">__builtin_constant_p</span><span class="p">(</span><span class="n">ms</span><span class="p">)</span> <span class="o">?</span> <span class="p">({</span>
<span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">cycles</span> <span class="o">=</span> <span class="n">ceil</span><span class="p">((</span><span class="n">F_CPU</span> <span class="o">*</span> <span class="kt">double</span><span class="p">(</span><span class="n">ms</span><span class="p">))</span> <span class="o">/</span> <span class="mf">1e3</span><span class="p">);</span>
<span class="n">__builtin_avr_delay_cycles</span><span class="p">(</span><span class="n">cycles</span><span class="p">);</span>
<span class="p">})</span> <span class="o">:</span> <span class="p">({</span>
<span class="k">while</span><span class="p">(</span><span class="n">ms</span><span class="o">--</span><span class="p">)</span> <span class="n">__builtin_avr_delay_cycles</span><span class="p">((</span><span class="n">F_CPU</span> <span class="o">/</span> <span class="mf">1e3</span><span class="p">)</span> <span class="o">-</span> <span class="mi">10</span><span class="p">);</span>
<span class="p">});</span>
<span class="p">}</span>
<span class="n">modm_always_inline</span>
<span class="kt">void</span> <span class="n">modm</span><span class="o">::</span><span class="n">delay_us</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">us</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">__builtin_constant_p</span><span class="p">(</span><span class="n">us</span><span class="p">)</span> <span class="o">?</span> <span class="p">({</span>
<span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">cycles</span> <span class="o">=</span> <span class="n">ceil</span><span class="p">((</span><span class="n">F_CPU</span> <span class="o">*</span> <span class="kt">double</span><span class="p">(</span><span class="n">us</span><span class="p">))</span> <span class="o">/</span> <span class="mf">1e6</span><span class="p">);</span>
<span class="n">__builtin_avr_delay_cycles</span><span class="p">(</span><span class="n">cycles</span><span class="p">);</span>
<span class="p">})</span> <span class="o">:</span> <span class="p">({</span>
<span class="c1">// slightly lower overhead due to 16-bit delay vvv</span>
<span class="k">while</span><span class="p">(</span><span class="n">us</span><span class="o">--</span><span class="p">)</span> <span class="n">__builtin_avr_delay_cycles</span><span class="p">((</span><span class="n">F_CPU</span> <span class="o">/</span> <span class="mf">1e6</span><span class="p">)</span> <span class="o">-</span> <span class="mi">6</span><span class="p">);</span>
<span class="p">});</span>
<span class="p">}</span>
</code></pre></div></div>
<p>For dynamic nanosecond delay we approximate the division again with a shift, however, this time without multiplication, since that operation is very expensive on AVRs (dozens of cycles). The shift value is computed at compile time by rounding to the nearest power-of-two. The result is passed to the 4-cycle <code class="language-plaintext highlighter-rouge">_delay_loop_2(uint16_t)</code>, which does the actual delay. This solution only yields accurate delays at 16MHz (shift 8), 8MHz (shift 9) and 4MHz (shift 10), and has a significant error plus additional overhead of a few cycles for shifts > 8. It’s also limited to 24-bits of input or about 16ms. It’s not an ideal solution, but all other ideas yielded significantly worse results incl. using the Cortex-M method of subtraction in a loop.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">modm_always_inline</span>
<span class="kt">void</span> <span class="n">modm</span><span class="o">::</span><span class="n">delay_ns</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">ns</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">__builtin_constant_p</span><span class="p">(</span><span class="n">ns</span><span class="p">)</span> <span class="o">?</span> <span class="p">({</span>
<span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">cycles</span> <span class="o">=</span> <span class="n">ceil</span><span class="p">((</span><span class="n">F_CPU</span> <span class="o">*</span> <span class="kt">double</span><span class="p">(</span><span class="n">ns</span><span class="p">))</span> <span class="o">/</span> <span class="mf">1e9</span><span class="p">);</span>
<span class="n">__builtin_avr_delay_cycles</span><span class="p">(</span><span class="n">cycles</span><span class="p">);</span>
<span class="p">})</span> <span class="o">:</span> <span class="p">({</span>
<span class="k">const</span> <span class="kt">uint16_t</span> <span class="n">loops</span> <span class="o">=</span> <span class="n">ns</span> <span class="o">>></span> <span class="mi">8</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">loops</span><span class="p">)</span> <span class="n">_delay_loop_2</span><span class="p">(</span><span class="n">loops</span><span class="p">);</span>
<span class="p">});</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="using-stdchrono">Using std::chrono</h1>
<p>We want these functions to be compatible with <code class="language-plaintext highlighter-rouge">using namespace std::chrono_literals</code>, so we overload the <code class="language-plaintext highlighter-rouge">modm::delay()</code> function with the appropriate durations. The conversion gets completely inlined and optimized away, so even for dynamic arguments there’s no code generated. A notable exception is the millisecond delay on Cortex-M, which gets converted to microseconds via a fast multiplication.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span><span class="o"><</span><span class="k">class</span> <span class="nc">Rep</span><span class="p">></span>
<span class="kt">void</span> <span class="n">modm</span><span class="o">::</span><span class="n">delay</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">duration</span><span class="o"><</span><span class="n">Rep</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">nano</span><span class="o">></span> <span class="n">ns</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">const</span> <span class="k">auto</span> <span class="n">ns_</span><span class="p">{</span><span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">duration_cast</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">nanoseconds</span><span class="o">></span><span class="p">(</span><span class="n">ns</span><span class="p">)};</span>
<span class="n">modm</span><span class="o">::</span><span class="n">delay_ns</span><span class="p">(</span><span class="n">ns_</span><span class="p">.</span><span class="n">count</span><span class="p">());</span>
<span class="p">}</span>
<span class="k">template</span><span class="o"><</span><span class="k">class</span> <span class="nc">Rep</span><span class="p">></span>
<span class="kt">void</span> <span class="n">modm</span><span class="o">::</span><span class="n">delay</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">duration</span><span class="o"><</span><span class="n">Rep</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">micro</span><span class="o">></span> <span class="n">us</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">const</span> <span class="k">auto</span> <span class="n">us_</span><span class="p">{</span><span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">duration_cast</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">microseconds</span><span class="o">></span><span class="p">(</span><span class="n">us</span><span class="p">)};</span>
<span class="n">modm</span><span class="o">::</span><span class="n">delay_us</span><span class="p">(</span><span class="n">us_</span><span class="p">.</span><span class="n">count</span><span class="p">());</span>
<span class="p">}</span>
<span class="k">template</span><span class="o"><</span><span class="k">class</span> <span class="nc">Rep</span><span class="p">></span>
<span class="kt">void</span> <span class="n">modm</span><span class="o">::</span><span class="n">delay</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">duration</span><span class="o"><</span><span class="n">Rep</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">milli</span><span class="o">></span> <span class="n">ms</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// converted to us on Cortex-M, but AVR just forwards to modm::delay_ms</span>
<span class="k">const</span> <span class="k">auto</span> <span class="n">us</span><span class="p">{</span><span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">duration_cast</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="o">::</span><span class="n">microseconds</span><span class="o">></span><span class="p">(</span><span class="n">ms</span><span class="p">)};</span>
<span class="n">modm</span><span class="o">::</span><span class="n">delay_us</span><span class="p">(</span><span class="n">us</span><span class="p">.</span><span class="n">count</span><span class="p">());</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="evaluation">Evaluation</h1>
<p>We can test the performance of our delay functions with <code class="language-plaintext highlighter-rouge">DWT->CYCCNT</code> on ARMv7-M which has a fixed 4 cycle overhead:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">start</span> <span class="o">=</span> <span class="n">DWT</span><span class="o">-></span><span class="n">CYCCNT</span><span class="p">;</span>
<span class="n">modm</span><span class="o">::</span><span class="n">delay</span><span class="p">(</span><span class="n">time</span><span class="p">);</span>
<span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">stop</span> <span class="o">=</span> <span class="n">DWT</span><span class="o">-></span><span class="n">CYCCNT</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">cycles</span> <span class="o">=</span> <span class="p">(</span><span class="n">stop</span> <span class="o">-</span> <span class="n">start</span><span class="p">)</span> <span class="o">-</span> <span class="mi">4</span><span class="p">;</span> <span class="c1">// 4 cycles overhead</span>
</code></pre></div></div>
<p>ARMv6-M has no DWT module, so we use the <code class="language-plaintext highlighter-rouge">SysTick->VAL</code> instead. The value counts down (!) and gets reloaded to <code class="language-plaintext highlighter-rouge">SysTick->LOAD</code> on underrun. We need to make sure the underrun does not happen during measurement so we reload the <code class="language-plaintext highlighter-rouge">SysTick->VAL</code>before it. The 24-bit value limits our measurement duration to 262ms @ 64MHz (the fastest ARMv6-M tested).</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">SysTick</span><span class="o">-></span><span class="n">VAL</span> <span class="o">=</span> <span class="n">SysTick</span><span class="o">-></span><span class="n">LOAD</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">start</span> <span class="o">=</span> <span class="n">SysTick</span><span class="o">-></span><span class="n">VAL</span><span class="p">;</span>
<span class="n">modm</span><span class="o">::</span><span class="n">delay</span><span class="p">(</span><span class="n">time</span><span class="p">);</span>
<span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">stop</span> <span class="o">=</span> <span class="n">SysTick</span><span class="o">-></span><span class="n">VAL</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">cycles</span> <span class="o">=</span> <span class="p">(</span><span class="n">start</span> <span class="o">-</span> <span class="n">stop</span><span class="p">)</span> <span class="o">-</span> <span class="mi">4</span><span class="p">;</span> <span class="c1">// swapped subtraction!</span>
</code></pre></div></div>
<p>And finally on AVRs we use the 16-bit Timer/Counter 1, which limits the measurement duration (but not the delay functions) to 4ms @16MHz.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="kt">uint16_t</span> <span class="n">start</span> <span class="o">=</span> <span class="n">TCNT1</span><span class="p">;</span>
<span class="n">modm</span><span class="o">::</span><span class="n">delay</span><span class="p">(</span><span class="n">time</span><span class="p">);</span>
<span class="k">const</span> <span class="kt">uint16_t</span> <span class="n">stop</span> <span class="o">=</span> <span class="n">TCNT1</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">uint16_t</span> <span class="n">cycles</span> <span class="o">=</span> <span class="p">(</span><span class="n">stop</span> <span class="o">-</span> <span class="n">start</span><span class="p">)</span> <span class="o">-</span> <span class="mi">4</span><span class="p">;</span>
</code></pre></div></div>
<p>In total 20 devices were tested by passing the <code class="language-plaintext highlighter-rouge">modm::delay_ns()</code> function durations from 0ns to 10000ns in 10ns steps. The Cortex-M devices were tested once at boot frequency and then again at their highest frequency.</p>
<table>
<thead>
<tr>
<th style="text-align: left">Device</th>
<th style="text-align: left">Core Type</th>
<th style="text-align: center">Cycles per Loop</th>
<th style="text-align: center">Minimum Cycles at Boot/High Frequency</th>
<th style="text-align: right">Minimum Delay at Boot Frequency</th>
<th style="text-align: right">Minimum Delay at High Frequency</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">ATMEGA2560</td>
<td style="text-align: left">avr8</td>
<td style="text-align: center">4</td>
<td style="text-align: center">16</td>
<td style="text-align: right">1000ns @ 16 MHz</td>
<td style="text-align: right"> </td>
</tr>
<tr>
<td style="text-align: left">SAMD21</td>
<td style="text-align: left">cm0+</td>
<td style="text-align: center">3</td>
<td style="text-align: center">15</td>
<td style="text-align: right"> </td>
<td style="text-align: right">312ns @ 48 MHz</td>
</tr>
<tr>
<td style="text-align: left">STM32F072</td>
<td style="text-align: left">cm0</td>
<td style="text-align: center">4</td>
<td style="text-align: center">18/19</td>
<td style="text-align: right">1125ns @ 16 MHz</td>
<td style="text-align: right">395ns @ 48 MHz</td>
</tr>
<tr>
<td style="text-align: left">STM32F091</td>
<td style="text-align: left">cm0</td>
<td style="text-align: center">4</td>
<td style="text-align: center">18/19</td>
<td style="text-align: right">1125ns @ 16 MHz</td>
<td style="text-align: right">395ns @ 48 MHz</td>
</tr>
<tr>
<td style="text-align: left">STM32F103</td>
<td style="text-align: left">cm3</td>
<td style="text-align: center">4</td>
<td style="text-align: center">16</td>
<td style="text-align: right">2000ns @ 8 MHz</td>
<td style="text-align: right">250ns @ 64 MHz</td>
</tr>
<tr>
<td style="text-align: left">STM32F303</td>
<td style="text-align: left">cm4f</td>
<td style="text-align: center">3</td>
<td style="text-align: center">13</td>
<td style="text-align: right">1625ns @ 8 MHz</td>
<td style="text-align: right">203ns @ 64 MHz</td>
</tr>
<tr>
<td style="text-align: left">STM32F334</td>
<td style="text-align: left">cm4f</td>
<td style="text-align: center">3</td>
<td style="text-align: center">13</td>
<td style="text-align: right">1625ns @ 8 MHz</td>
<td style="text-align: right">203ns @ 64 MHz</td>
</tr>
<tr>
<td style="text-align: left">STM32F401</td>
<td style="text-align: left">cm4f</td>
<td style="text-align: center">4</td>
<td style="text-align: center">16</td>
<td style="text-align: right">1000ns @ 16 MHz</td>
<td style="text-align: right">190ns @ 84 MHz</td>
</tr>
<tr>
<td style="text-align: left">STM32F411</td>
<td style="text-align: left">cm4f</td>
<td style="text-align: center">4</td>
<td style="text-align: center">16</td>
<td style="text-align: right">1000ns @ 16 MHz</td>
<td style="text-align: right">166ns @ 96 MHz</td>
</tr>
<tr>
<td style="text-align: left">STM32F429</td>
<td style="text-align: left">cm4f</td>
<td style="text-align: center">4</td>
<td style="text-align: center">16</td>
<td style="text-align: right">1000ns @ 16 MHz</td>
<td style="text-align: right">95ns @ 168 MHz*</td>
</tr>
<tr>
<td style="text-align: left">STM32F446</td>
<td style="text-align: left">cm4f</td>
<td style="text-align: center">4</td>
<td style="text-align: center">16</td>
<td style="text-align: right">1000ns @ 16 MHz</td>
<td style="text-align: right">88ns @ 180 MHz</td>
</tr>
<tr>
<td style="text-align: left">STM32F469</td>
<td style="text-align: left">cm4f</td>
<td style="text-align: center">4</td>
<td style="text-align: center">16</td>
<td style="text-align: right">1000ns @ 16 MHz</td>
<td style="text-align: right">88ns @ 180 MHz</td>
</tr>
<tr>
<td style="text-align: left">STM32F746</td>
<td style="text-align: left">cm7fd</td>
<td style="text-align: center">4</td>
<td style="text-align: center">17</td>
<td style="text-align: right">1062ns @ 16 MHz</td>
<td style="text-align: right">78ns @ 216 MHz</td>
</tr>
<tr>
<td style="text-align: left">STM32G071</td>
<td style="text-align: left">cm0+</td>
<td style="text-align: center">3</td>
<td style="text-align: center">16/18</td>
<td style="text-align: right">1000ns @ 16 MHz</td>
<td style="text-align: right">281ns @ 64 MHz</td>
</tr>
<tr>
<td style="text-align: left">STM32G474</td>
<td style="text-align: left">cm4f</td>
<td style="text-align: center">3</td>
<td style="text-align: center">17/21</td>
<td style="text-align: right">1062ns @ 16 MHz</td>
<td style="text-align: right">123ns @ 170 MHz</td>
</tr>
<tr>
<td style="text-align: left">STM32H743</td>
<td style="text-align: left">cm7fd</td>
<td style="text-align: center">4</td>
<td style="text-align: center">19</td>
<td style="text-align: right">296ns @ 64 MHz</td>
<td style="text-align: right">47ns @ 400 MHz</td>
</tr>
<tr>
<td style="text-align: left">STM32L031</td>
<td style="text-align: left">cm0</td>
<td style="text-align: center">3</td>
<td style="text-align: center">16/17</td>
<td style="text-align: right">7629ns @ 2.097 MHz</td>
<td style="text-align: right">531ns @ 32 MHz</td>
</tr>
<tr>
<td style="text-align: left">STM32L152</td>
<td style="text-align: left">cm3</td>
<td style="text-align: center">4</td>
<td style="text-align: center">16/17</td>
<td style="text-align: right">7629ns @ 2.097 MHz</td>
<td style="text-align: right">531ns @ 32 MHz</td>
</tr>
<tr>
<td style="text-align: left">STM32L432</td>
<td style="text-align: left">cm4f</td>
<td style="text-align: center">3</td>
<td style="text-align: center">13/15</td>
<td style="text-align: right">812ns @ 16 MHz</td>
<td style="text-align: right">162ns @ 80 MHz</td>
</tr>
<tr>
<td style="text-align: left">STM32L476</td>
<td style="text-align: left">cm4f</td>
<td style="text-align: center">3</td>
<td style="text-align: center">13/15</td>
<td style="text-align: right">812ns @ 16 MHz</td>
<td style="text-align: right">312ns @ 48 MHz*</td>
</tr>
</tbody>
</table>
<center>(* lower than maximum due to software limitations)</center>
<p>The absolute minimum delay we can achieve is ~50ns and only on the STM32H7 with a very fast clock. You can clearly see the effects of the additional flash wait-states despite the cache on some devices after switching to high frequency.</p>
<p><img invertible="" src="ns_boot.svg" /></p>
<p>The graph of nanosecond delay at boot frequency shows several interesting points:</p>
<ul>
<li>The above mentioned minimum delays are very clear, particularly the ~7600ns minimum delay for the STM32L0 and STM32L1.</li>
<li>An offset error for STM32L0/L1 with different stepping coarseness.</li>
<li>A ~600ns offset error on AVR: This is not surprising as our implementation does not compensate for the calling overhead at all.</li>
<li>A 2.5% error on AVR: At 16MHz the correct divider would be 250 for a 4-cycle loop, however, we’re shifting 8 = divide by 256, which is a 2.5% error. For other frequencies this error will be much higher.</li>
<li>An offset error on STM32F7: There is some cache effects at work that do no allow for precise control of the overhead. We’ve optimized the overhead therefore for high frequencies.</li>
<li>A fast boot clock of 64MHz on the STM32H7 resulting in the lowest minimum delay at boot, however, with a ~3% error over time.</li>
<li>The coarseness of the stepping varies, showing the effect of different clock speeds and cycles per loop.</li>
<li>Most implemementations follow the ideal delay line very closely.</li>
</ul>
<p><img invertible="" src="ns_high.svg" /></p>
<p>The graph of nanosecond delay at high frequency shows that all implementations follow the ideal delay very precisely with no significant offset or error.</p>
<p>The notable exceptions are the Cortex-M7 devices STM32F7 with ~7.5% error and STM32H7 with a whopping 25% error. Our delay implementation has a 1-cycle loop on Cortex-M7 due to the built-in L1 cache and branch prediction. Running at 400MHz a 1-cycle loop takes 2.5ns which gets rounded up to 3ns which is then subtracted on every 1-cycle loop, thus yielding this error. This creates an interesting failure mode for this delay algorithm: At around 667MHz the error is highest at 50%, since a 1.5ns per loop (=1ns/667MHz) delay must be rounded to either 1ns or 2ns.</p>
<p>The delay implementation on other devices has the same problem, however, since the loop takes 3-4 cycles the error is much smaller. For example, the 3-cycle loop on the STM32G4 running at a comparable 170MHz takes ~17.6ns (=3ns/170MHz) ≈ 18ns per loop, which is an error of just ~2%. In contrast, the 4-cycle loop on the 64MHz STM32F1 takes 62.5ns (=4ns/64MHz) ≈ 63ns with an error of ~1%.</p>
<p>It becomes clear that the subtraction spreads the rounding error over 3-4 cycles which essentially functions as a fractional integer division. So an easy fix for this error on the Cortex-M7 is to lengthen the loop with some NOPs to reduce the overall error at the cost of resolution. Since two 4B aligned NOPs get folded by the pipeline into one cycle, so we need to add six NOPs to get a 4-cycle loop:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> "1: subs r0, r0, r2" // subtract the nanoseconds per loop
"nop" // folded into previous subs
"nop" // +1 cycle
"nop" // folded into previous nop
"nop" // +1 cycle
"nop" // folded into previous nop
"nop" // +1 cycle folded into next bpl
"bpl 1b" // loop while nanoseconds are positive
</code></pre></div></div>
<p>With this fix the error is reduced to a maximum of ~6% @ 533MHz (=7.5ns ≈ 8ns), which is much more acceptable. For the STM32F7 @216MHz the new error is ~2.5% and for the STM32H7 @400MHz the error is ~0%. This is comparable to all the other devices.</p>
<p><img invertible="" src="ns_high_detail.svg" /></p>
<p>A detailed version of nanosecond delay graph at high frequencies from 0ns to 1000ns shows the same properties as the boot frequency graph, however with much smaller minimal delays and stepping.</p>
<p><img invertible="" src="us_boot.svg" /></p>
<p>For completeness we’ve also measured microsecond delay from 0us to 1000us at both boot frequency. The results have almost no error over time due to our fractional frequency encoding, however, we don’t compensate for calling overhead for the non-DWT implementation, therefore ARMv6-M devices have an slight offset error. In future this could be improved if required.</p>
<p>The microsecond delay measurements at high frequency shows no errors at all and are therefore omitted.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Very accurate delays even at nanosecond resolution on AVR and Cortex-M devices are possible if the call overhead is compensated and the error over time is bound. However, the delay implementations are not as trivial as expected, but with some simple tricks can be made to work very well.</p>
<p><a href="https://xkcd.com/598"><img dimmable="" src="xkcd.png" /></a></p>
<p>The code presented here is slightly simplified, so please also check the real delay implementations inside modm:</p>
<ul>
<li><a href="https://github.com/modm-io/modm/blob/develop/src/modm/platform/core/avr/delay_impl.hpp.in">AVR <code class="language-plaintext highlighter-rouge">modm::delay_us</code> and <code class="language-plaintext highlighter-rouge">modm::delay_ns</code></a>.</li>
<li><a href="https://github.com/modm-io/modm/blob/develop/src/modm/platform/core/cortex/delay.cpp.in">Cortex-M <code class="language-plaintext highlighter-rouge">modm::delay_us</code> using DWT</a></li>
<li><a href="https://github.com/modm-io/modm/blob/develop/src/modm/platform/core/cortex/delay_ns.cpp.in">Cortex-M <code class="language-plaintext highlighter-rouge">modm::delay_us</code> and <code class="language-plaintext highlighter-rouge">modm::delay_ns</code> using Cycle Counting</a></li>
</ul>
<p>The <a href="https://github.com/modm-io/modm/blob/develop/examples/generic/delay/main.cpp">example used to measure the delay in hardware can be found here</a>.</p>
<p>The <a href="https://github.com/salkinium/blog/tree/master/_posts/modm-delay">data of all measurements and graphing scripts can be found here</a>.</p>
<p>Special thanks to <a href="https://github.com/chris-durand">Christopher Durand</a> for helping with the measurements!</p>Niklas Hauserniklas@salkinium.comAccurately spinning for short and long time durations is an essential part of an embedded application. In the modm embedded library we provide blocking delay functions in the resolution of milli-, micro- and even nanoseconds. Let me show you how we used the available hardware to implement a fast, efficient and flexible API that works with thousands of devices all with different clock configurations. The most prominent uses for blocking delays in modm are during initialization of internal peripherals and external drivers that may require a few micro- to milliseconds to stabilize their hardware, and when bit-banging protocols in software with kHz and MHz baudrates requiring micro- or even nanosecond delay. The delay functions must be as accurate as possible. In particular they must have the shortest possible overhead and a low error over at least 1s of delay. They must already work before main (during the global constructor calls) and remain accurate if the clock configuration and therefore the CPU frequency dynamically changes. They must also be reentrant so they can be called from inside an interrupt if needed. And lastly they should be compatible with the std::chrono time units, so that we can pass them literals for ease of use: modm::delay(1s); // non-literal version modm::delay(10ms); modm::delay_ms(10); modm::delay(100us); modm::delay_us(100); modm::delay(1000ns); modm::delay_ns(1000); Computing Cycles The simplest delay function converts the input time to CPU cycles and then spins in place counting them down. For the conversion we need to know the CPU frequency and have some mechanism for keeping track of elapsed CPU cycles. For microsecond and longer delays the conversion is simple: 1µs = 1MHz-1, so you can just take the CPU frequency in MHz and multiply it with the input to get the cycles. We store the frequency in a global uint16_t already scaled down to MHz and initialized with the boot frequency during startup. // microcontroller boots with a 8MHz clock uint16_t fcpu_MHz = 8; // simple conversion with multiplication uint32_t cycles = us * fcpu_MHz; This works well for frequencies that divide 1MHz cleanly, however, the STM32L0/L1 microcontrollers boot at 2.097MHz for example, which results in a 5% error right after boot. We therefore binary scale the MHz value to achieve a much lower error, which can be done very efficiently with bit shifting: // multiply MHz with power-of-two 2^5 = 32 constexpr uint8_t shift = 5; // 2.097MHz * 32 -> 67 = 2.09375MHz -> ~0.2% error constinit uint16_t fcpu_MHz = std::round(2.097f * (1ul << shift)); // divide with simple bit shift uint32_t cycles = (us * fcpu_MHz) >> shift; To keep the 32-bit multiplication from overflowing and to maintain at least 1s = 1’000’000µs of delay, we must limit the scaling so that 232 - shift / max_fcpu ≥ 1s. A scalar of 32 (shift 5) is only good up to 134MHz, while the fastest STM32H7 running at 480MHz limits the scalar to only 8 (shift 3). For nanosecond delay we need a different algorithm, since the microcontrollers all run below 1GHz so one CPU cycle is several nanoseconds long. For example, the STM32F7 runnning at 216MHz will take ~4.6ns per cycle. To get the cycles from a nanosecond input we would need to divide: float ns_per_cycle = 4.6f; uint32_t cycles = ns / ns_per_cycle; This is obviously way too slow to compute, but we first need to understand how to accurately count cycles to find a better solution to this problem. Counting Cycles Wouldn’t it be nice if we could just delegate counting cycles to some hardware counter? Well, look no further than the Data Watchpoint and Trace Unit (DWT) and its 32-bit CYCCNT counter free running at CPU frequency! // Enable Tracing Debug Unit CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk; // Enable CPU cycle counter DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk; By reading DWT->CYCCNT once at the beginning and then comparing this constantly in a loop until the number of cycles have passed, we can implement a very simple, yet very accurate delay function: void modm::delay_us(uint32_t us) { const uint32_t start = DWT->CYCCNT; const uint32_t cycles = fcpu_MHz * us >> shift; while (true) { const uint32_t now = DWT->CYCCNT; if (now - start >= cycles) break; } } Bonus win for this solution: time spent in interrupts during the delay is compensated for, since the hardware counter continues counting throughout. Counting Loops Unfortunately, the DWT peripheral is not accessible in all ARMv7-M devices (NRF52 only allows the debugger to access it) and it’s not even implemented on ARM Cortex-M0(+) aka. ARMv6-M devices, so we have to count cycles a different way. We could use the SysTick->VAL, however it’s just a 24-bit counter, which limits us to ~16.8 million cycles: a ~1s delay at 16MHz or a maximum 35ms delay (!) at 480MHz. In addition, the SysTick is often used for preemptive scheduling (in FreeRTOS) or to create a global clock (for software timers), so we cannot use it as a replacement. Instead we will count cycles the old fashioned way: in a tight assembly loop with a known timing. We use two 16-bit Thumb-2 instructions: subtraction with condition flags update and branch back if positive. They are aligned so they fit into a single 32-bit instruction fetch and fill the pipeline entirely, giving us the maximum performance: 1 cycle for the subtraction and 2-cycles to branch back, so the loop takes 3 cycles total: void modm::delay_us(uint32_t us) modm_fastcode { const uint32_t cycles = fcpu_MHz * us >> shift; asm volatile ( ".align 4" // align for *one* 32-bit instruction fetch "1: subs %0, %0, #3" // subtract the loop cycles "bpl 1b" // loop while cycles are positive :: "l" (cycles)); } The instruction fetch timings for executing directly from Flash depends on the CPU speed, the currently configured wait states and the state of the instruction cache (if available and configured) and finally the branch speculation of the cache implementation. We therefore place the entire function into SRAM using the modm_fastcode attribute, which gives us predictable timings for instruction fetches across all Cortex-M cores, since we’re bypassing the Flash wait states and the (vendor supplied) cache entirely. Predictable, but not consistent: In my experiments I’ve discovered the loop to take 3 cycles on STM32{F3, G0, G4, L0, L4}, 4 cycles on STM32{L1, F0, F1, F4, F2} and just 1 cycle (!) on STM32F7. The timings depend on the (vendor defined) bus matrix implementation and the system configuration and are mainly about whether the Instruction Bus (I-Code) can access SRAM directly or whether the access is performed by the slower System Bus (S-Bus). The STM32F4 reference manual states in section 2.3.1 Embedded SRAM: The CPU can access the SRAM1, SRAM2, and SRAM3 through the System Bus or through the I-Code/D-Code buses when boot from SRAM is selected or when physical remap is selected. To get the max performance on SRAM execution, physical remap should be selected (boot or software selection). It seems that access through the I-Code takes 2-cycles, but the S-Bus takes 4-cycles, while the Cortex-M7 has a dual issue pipeline and native instruction cache with native branch prediction, so it’s just… really fast ? As confusing as it might be, at least the instruction fetch timing from SRAM is independent from the configured CPU frequency, which allows us to hardcode the loop cycles to subtract as an immediate value encoded in the instruction. The upper bound on the error is at most 3 cycles plus the error of the binary scaling, which together is good enough for our purpose. However, interrupts are not compensated, so the real delay may be significantly longer. If an accurate delay is absolutely necessary it can be wrapped into modm::atomic::Lock to disable interrupts during the delay. Counting Nanoseconds To delay for nanoseconds we need to do something a little different, since the naive approach involves division, which would be way too slow. We can, however, approximate this division with a loop of subtractions! So we input the nanoseconds into the subs bpl loop and subtract the nanoseconds each loop takes. We store this value in SRAM and update it on every clock change: constexpr uint8_t cycles_per_loop = 3; // 1-4 cycles, depends on device // round the nanoseconds to minimize error uint16_t ns_per_loop = std::round(1e9 * cycles_per_loop / fcpu); void modm::delay_ns(uint32_t ns) modm_fastcode { asm volatile ( ".align 4" // align for *one* 32-bit instruction fetch "1: subs %0, %0, %1" // subtract the nanoseconds per loop "bpl 1b" // loop while nanoseconds are positive :: "l" (ns), "l" (ns_per_loop)); } This works, however, there is a large overhead until execution arrives at the loop. The reason is that the compiler uses a bl (branch and link) instruction to jump to an address encoded as an immediate value. This is fast and efficient, however, it limits us to a relative address range of ±16MBs and our delay function in SRAM is waaaaay out there (SRAM starts @0x20000000 vs Flash @0x08000000). So the linker has to add a veneer, that does nothing else but jump further by loading the address into a register and loading it into the program counter therefore jumping: modm::delay_ns(ns); 8000214: f002 fbf4 bl 8002a00 <___ZN4modm8delay_nsEm_veneer> 08002a00 <___ZN4modm8delay_nsEm_veneer>: 8002a00: f85f f000 ldr.w pc, [pc] ; 8002a04 8002a04: 20000189 .word 0x20000189 20000188 <_ZN4modm8delay_nsEm>: void modm_fastcode modm::delay_ns(uint32_t us) Since Flash access is very slow (up to a dozen wait states for fast devices), vendors supply a cache implementation with a large, but limited buffer size (the STM32F4 cache has 64 cache lines of 128-bit = 1kB!). So the jump to a veneer outside of the 1kB range spends many cycles just waiting on the Flash and this time depends on the current clock configuration. Can we do better? Yes, with inline assembly! We move the actual implementation to modm::platform::delay_ns and then use a forced-inline forwarding function that uses the blx instruction to jump there directly instead of through a veneer: modm_always_inline void modm::delay_ns(uint32_t ns) { asm volatile( "mov r0, %0" // Pass the argument in r0 as per EABI "blx %1" // Jump there directly :: "r" (ns), "l" (modm::platform::delay_ns) : "r0", "r1", "r2"); } This reduces the overhead by eliminating the unnecessary jump and loading a literal from Flash that’s stored much closer to the execution site (here its just #148 bytes away) and therefore most likely already in the cache: modm::delay_ns(ns); 80002c6: 4c25 ldr r4, [pc, #148] ; 800035c 80002ca: 4628 mov r0, r5 80002cc: 47a0 blx r4 800035c: 200001a9 .word 0x200001a9 200001a8 <_ZN4modm8platform8delay_nsEm>: void modm_fastcode modm::platform::delay_ns(uint32_t us) However, we still need to actually compensate for this overhead, even if it’s just a few cycles, there should not be an offset in the delay function. To have maximum control we declare the function to be naked and implement the whole function in inline assembly: void __attribute__((naked, aligned(4))) modm_fastcode modm::platform::delay_ns(uint32_t ns) // passed in r0 { asm volatile ( "ldr r2, =ns_per_loop" // address of ns_per_loop "ldrh r2, [r2, #0]" // load the actual 16-bit ns_per_loop value "lsls r1, r2, #2" // approximate overhead in ns by shifting "subs r0, r0, r1" // subtract the overhead in nanoseconds "1: subs r0, r0, r2" // subtract the nanoseconds per loop "bpl 1b" // loop while nanoseconds are positive "bx lr" // return to execution ); } The overhead is measured experimentally and expressed in loops, which we can convert to nanoseconds by multiplying with the ns_per_loop variable. However the mul instruction requires passing the operands in registers, which would require an additional mov instruction to put the value into a register, so instead we use the lsl instruction to shift the value left with the same effect. This limits the “overhead loop count” to powers of two, which in practice is not an issue. In the above code we’re using 4 loops as overhead (so about 12-16 cycles at 3-4 cycles per loop), which is equivalent to shifting left by 2, hence the #2 immediate value in the lsl instruction. Counting Cycles on AVR AVRs cannot change their CPU frequency at runtime, instead it is fixed at compile time via the F_CPU macro, so we don’t have to worry about that. The avr-lib provide implementations of _delay_ms(double) and _delay_us(double) in the <util/delay.h> header: However, the implementations use floating point math to calculate the delay cycles for runtime arguments. But fear not for there is a very sternly worded warning for passing a dynamic value to this incredibly powerful foot gun: In order for these functions to work as intended, compiler optimizations must be enabled, and the delay time must be an expression that is a known constant at compile-time. If these requirements are not met, the resulting delay will be much longer (and basically unpredictable), and applications that otherwise do not use floating-point calculations will experience severe code bloat by the floating-point library routines linked into the application. Of course this is a completely unacceptable implementation, since avr-gcc provides __builtin_constant_p() to detect constant arguments and together with __builtin_avr_delay_cycles(uint32_t) can generates very accurate delays down to a single cycle for constant inputs at any clock rate. For a delay with a runtime argument we can loop over a 1ms or 1us constant delay and compensate for the loop overhead: modm_always_inline // <- must be force inlined to work void modm::delay_ms(uint32_t ms) { __builtin_constant_p(ms) ? ({ const uint32_t cycles = ceil((F_CPU * double(ms)) / 1e3); __builtin_avr_delay_cycles(cycles); }) : ({ while(ms--) __builtin_avr_delay_cycles((F_CPU / 1e3) - 10); }); } modm_always_inline void modm::delay_us(uint32_t us) { __builtin_constant_p(us) ? ({ const uint32_t cycles = ceil((F_CPU * double(us)) / 1e6); __builtin_avr_delay_cycles(cycles); }) : ({ // slightly lower overhead due to 16-bit delay vvv while(us--) __builtin_avr_delay_cycles((F_CPU / 1e6) - 6); }); } For dynamic nanosecond delay we approximate the division again with a shift, however, this time without multiplication, since that operation is very expensive on AVRs (dozens of cycles). The shift value is computed at compile time by rounding to the nearest power-of-two. The result is passed to the 4-cycle _delay_loop_2(uint16_t), which does the actual delay. This solution only yields accurate delays at 16MHz (shift 8), 8MHz (shift 9) and 4MHz (shift 10), and has a significant error plus additional overhead of a few cycles for shifts > 8. It’s also limited to 24-bits of input or about 16ms. It’s not an ideal solution, but all other ideas yielded significantly worse results incl. using the Cortex-M method of subtraction in a loop. modm_always_inline void modm::delay_ns(uint32_t ns) { __builtin_constant_p(ns) ? ({ const uint32_t cycles = ceil((F_CPU * double(ns)) / 1e9); __builtin_avr_delay_cycles(cycles); }) : ({ const uint16_t loops = ns >> 8; if (loops) _delay_loop_2(loops); }); } Using std::chrono We want these functions to be compatible with using namespace std::chrono_literals, so we overload the modm::delay() function with the appropriate durations. The conversion gets completely inlined and optimized away, so even for dynamic arguments there’s no code generated. A notable exception is the millisecond delay on Cortex-M, which gets converted to microseconds via a fast multiplication. template<class Rep> void modm::delay(std::chrono::duration<Rep, std::nano> ns) { const auto ns_{std::chrono::duration_cast<std::chrono::nanoseconds>(ns)}; modm::delay_ns(ns_.count()); } template<class Rep> void modm::delay(std::chrono::duration<Rep, std::micro> us) { const auto us_{std::chrono::duration_cast<std::chrono::microseconds>(us)}; modm::delay_us(us_.count()); } template<class Rep> void modm::delay(std::chrono::duration<Rep, std::milli> ms) { // converted to us on Cortex-M, but AVR just forwards to modm::delay_ms const auto us{std::chrono::duration_cast<std::chrono::microseconds>(ms)}; modm::delay_us(us.count()); } Evaluation We can test the performance of our delay functions with DWT->CYCCNT on ARMv7-M which has a fixed 4 cycle overhead: const uint32_t start = DWT->CYCCNT; modm::delay(time); const uint32_t stop = DWT->CYCCNT; const uint32_t cycles = (stop - start) - 4; // 4 cycles overhead ARMv6-M has no DWT module, so we use the SysTick->VAL instead. The value counts down (!) and gets reloaded to SysTick->LOAD on underrun. We need to make sure the underrun does not happen during measurement so we reload the SysTick->VALbefore it. The 24-bit value limits our measurement duration to 262ms @ 64MHz (the fastest ARMv6-M tested). SysTick->VAL = SysTick->LOAD; const uint32_t start = SysTick->VAL; modm::delay(time); const uint32_t stop = SysTick->VAL; const uint32_t cycles = (start - stop) - 4; // swapped subtraction! And finally on AVRs we use the 16-bit Timer/Counter 1, which limits the measurement duration (but not the delay functions) to 4ms @16MHz. const uint16_t start = TCNT1; modm::delay(time); const uint16_t stop = TCNT1; const uint16_t cycles = (stop - start) - 4; In total 20 devices were tested by passing the modm::delay_ns() function durations from 0ns to 10000ns in 10ns steps. The Cortex-M devices were tested once at boot frequency and then again at their highest frequency. Device Core Type Cycles per Loop Minimum Cycles at Boot/High Frequency Minimum Delay at Boot Frequency Minimum Delay at High Frequency ATMEGA2560 avr8 4 16 1000ns @ 16 MHz SAMD21 cm0+ 3 15 312ns @ 48 MHz STM32F072 cm0 4 18/19 1125ns @ 16 MHz 395ns @ 48 MHz STM32F091 cm0 4 18/19 1125ns @ 16 MHz 395ns @ 48 MHz STM32F103 cm3 4 16 2000ns @ 8 MHz 250ns @ 64 MHz STM32F303 cm4f 3 13 1625ns @ 8 MHz 203ns @ 64 MHz STM32F334 cm4f 3 13 1625ns @ 8 MHz 203ns @ 64 MHz STM32F401 cm4f 4 16 1000ns @ 16 MHz 190ns @ 84 MHz STM32F411 cm4f 4 16 1000ns @ 16 MHz 166ns @ 96 MHz STM32F429 cm4f 4 16 1000ns @ 16 MHz 95ns @ 168 MHz* STM32F446 cm4f 4 16 1000ns @ 16 MHz 88ns @ 180 MHz STM32F469 cm4f 4 16 1000ns @ 16 MHz 88ns @ 180 MHz STM32F746 cm7fd 4 17 1062ns @ 16 MHz 78ns @ 216 MHz STM32G071 cm0+ 3 16/18 1000ns @ 16 MHz 281ns @ 64 MHz STM32G474 cm4f 3 17/21 1062ns @ 16 MHz 123ns @ 170 MHz STM32H743 cm7fd 4 19 296ns @ 64 MHz 47ns @ 400 MHz STM32L031 cm0 3 16/17 7629ns @ 2.097 MHz 531ns @ 32 MHz STM32L152 cm3 4 16/17 7629ns @ 2.097 MHz 531ns @ 32 MHz STM32L432 cm4f 3 13/15 812ns @ 16 MHz 162ns @ 80 MHz STM32L476 cm4f 3 13/15 812ns @ 16 MHz 312ns @ 48 MHz* (* lower than maximum due to software limitations) The absolute minimum delay we can achieve is ~50ns and only on the STM32H7 with a very fast clock. You can clearly see the effects of the additional flash wait-states despite the cache on some devices after switching to high frequency. The graph of nanosecond delay at boot frequency shows several interesting points: The above mentioned minimum delays are very clear, particularly the ~7600ns minimum delay for the STM32L0 and STM32L1. An offset error for STM32L0/L1 with different stepping coarseness. A ~600ns offset error on AVR: This is not surprising as our implementation does not compensate for the calling overhead at all. A 2.5% error on AVR: At 16MHz the correct divider would be 250 for a 4-cycle loop, however, we’re shifting 8 = divide by 256, which is a 2.5% error. For other frequencies this error will be much higher. An offset error on STM32F7: There is some cache effects at work that do no allow for precise control of the overhead. We’ve optimized the overhead therefore for high frequencies. A fast boot clock of 64MHz on the STM32H7 resulting in the lowest minimum delay at boot, however, with a ~3% error over time. The coarseness of the stepping varies, showing the effect of different clock speeds and cycles per loop. Most implemementations follow the ideal delay line very closely. The graph of nanosecond delay at high frequency shows that all implementations follow the ideal delay very precisely with no significant offset or error. The notable exceptions are the Cortex-M7 devices STM32F7 with ~7.5% error and STM32H7 with a whopping 25% error. Our delay implementation has a 1-cycle loop on Cortex-M7 due to the built-in L1 cache and branch prediction. Running at 400MHz a 1-cycle loop takes 2.5ns which gets rounded up to 3ns which is then subtracted on every 1-cycle loop, thus yielding this error. This creates an interesting failure mode for this delay algorithm: At around 667MHz the error is highest at 50%, since a 1.5ns per loop (=1ns/667MHz) delay must be rounded to either 1ns or 2ns. The delay implementation on other devices has the same problem, however, since the loop takes 3-4 cycles the error is much smaller. For example, the 3-cycle loop on the STM32G4 running at a comparable 170MHz takes ~17.6ns (=3ns/170MHz) ≈ 18ns per loop, which is an error of just ~2%. In contrast, the 4-cycle loop on the 64MHz STM32F1 takes 62.5ns (=4ns/64MHz) ≈ 63ns with an error of ~1%. It becomes clear that the subtraction spreads the rounding error over 3-4 cycles which essentially functions as a fractional integer division. So an easy fix for this error on the Cortex-M7 is to lengthen the loop with some NOPs to reduce the overall error at the cost of resolution. Since two 4B aligned NOPs get folded by the pipeline into one cycle, so we need to add six NOPs to get a 4-cycle loop: "1: subs r0, r0, r2" // subtract the nanoseconds per loop "nop" // folded into previous subs "nop" // +1 cycle "nop" // folded into previous nop "nop" // +1 cycle "nop" // folded into previous nop "nop" // +1 cycle folded into next bpl "bpl 1b" // loop while nanoseconds are positive With this fix the error is reduced to a maximum of ~6% @ 533MHz (=7.5ns ≈ 8ns), which is much more acceptable. For the STM32F7 @216MHz the new error is ~2.5% and for the STM32H7 @400MHz the error is ~0%. This is comparable to all the other devices. A detailed version of nanosecond delay graph at high frequencies from 0ns to 1000ns shows the same properties as the boot frequency graph, however with much smaller minimal delays and stepping. For completeness we’ve also measured microsecond delay from 0us to 1000us at both boot frequency. The results have almost no error over time due to our fractional frequency encoding, however, we don’t compensate for calling overhead for the non-DWT implementation, therefore ARMv6-M devices have an slight offset error. In future this could be improved if required. The microsecond delay measurements at high frequency shows no errors at all and are therefore omitted. Conclusion Very accurate delays even at nanosecond resolution on AVR and Cortex-M devices are possible if the call overhead is compensated and the error over time is bound. However, the delay implementations are not as trivial as expected, but with some simple tricks can be made to work very well. The code presented here is slightly simplified, so please also check the real delay implementations inside modm: AVR modm::delay_us and modm::delay_ns. Cortex-M modm::delay_us using DWT Cortex-M modm::delay_us and modm::delay_ns using Cycle Counting The example used to measure the delay in hardware can be found here. The data of all measurements and graphing scripts can be found here. Special thanks to Christopher Durand for helping with the measurements!Introducing modm-devices: hardware descriptions for AVR and STM32 devices2018-03-07T00:00:00+01:002018-03-07T00:00:00+01:00http://blog.salkinium.com/modm-devices<p>For the last 2 years <a href="https://github.com/dergraaf">Fabian Greif</a> and I have been
working on a secret project called <a href="https://github.com/modm-io/">modm: a toolkit for data-driven code generation</a>.
In a nutshell, we feed detailed hardware description data for almost all AVR
and STM32 targets into a code generator to create a C++ Hardware Abstraction
Layer (HAL), startup & linkerscript code, documentation and support tools.</p>
<p>This isn’t exactly a new idea, after all very similar ideas have been floating
around before, most notably in the Linux Kernel with its
<a href="https://www.devicetree.org">Device Tree (DT) effort</a>. In fact, modm itself is based
entirely on <a href="https://github.com/roboterclubaachen/xpcc">xpcc</a> which matured the
idea of data-driven HAL generation in the first place.</p>
<p>However, for modm we focused on what goes on behind the scenes: how to <em>acquire</em>
detailed target description data and how to <em>use</em> it with reasonable effort.
We now have a toolbox that transcends its use as our C++ HAL generator and
instead can be applied generically to any project in any language
(*awkwardly winks at the Rust community*). That’s pretty powerful stuff.</p>
<p>So let me first ease you into this topic with some historic background and then
walk you through the data sources we use and the design decisions of our data
engine.
All with plenty of examples for you to follow along, just stay well clear of
those hairy yaks in the distance.</p>
<h2 id="the-origin-story">The Origin Story</h2>
<p>All the usual suspects in this case were members of the
<a href="http://www.roboterclub.rwth-aachen.de/">Roboterclub Aachen e. V.</a>
(<a href="https://twitter.com/RCA_eV">@RCA_eV</a>). Around 2006 the team surrounding
Fabian had built a communication library called RCCP for doing remote procedure
calls over CAN. Back then the only affordable microcontrollers were AVRs, but
neither were they powerful enough to perform all the computations needed for
autonomy nor did they have enough pins to interface with all the motors and
sensors we stuffed in our robots. So an embedded PC programmed in various
languages did all the heavy lifting and talked via CAN to the AVR actuators and sensors.</p>
<p>(It has been passed on for many generations of robot builders, that the embedded
PC did a disk check once during its boot process, which rendered the robot
unresponsive for a few minutes. Unfortunately it did this during the a
<a href="http://www.eurobot.org">Eurobot</a> finals game and we lost due to that.
Since then our robots don’t have a kernel in their critical path anymore.)</p>
<p>RCCP was eventually refactored into the Cross Platform Component Communication
(XPCC) library and open-sourced on Sourceforge in 2009.
Around 2012 when Fabian was leaving us to go work on satellites at the German
space agency (DLR), I took over stewardship of the project and moved it over to
<a href="https://github.com/roboterclubaachen/xpcc">GitHub where it exists to this day</a>.
It’s the foundation of all the RCAs robots.</p>
<h3 id="from-avr-to-stm32">From AVR to STM32</h3>
<p>By the time I joined in 2010, the team had been using C++ on AVRs for years.
Around 2012 we finally outgrew the AVRs used to control our autonomous robots
and switched over to Arm Cortex-M devices, specifically the STM32 series. So
began the cumbersome task of porting the HAL that worked so well on the AVRs to
the STM32F1 and F4 families, both of which have much more capable peripherals.</p>
<p>We had inherited a C++ API that passed around static classes containing the
peripheral abstraction to template classes wrapping these classes. It’s the
clear anti-thesis of polymorphic interface design, almost a form of “compile
time duck-typing”:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">GpioB0</span> <span class="p">{</span>
<span class="nl">public:</span> <span class="c1">// one class for every GPIO on the device</span>
<span class="k">static</span> <span class="kt">void</span> <span class="n">set</span><span class="p">(</span><span class="kt">bool</span> <span class="n">state</span><span class="p">);</span>
<span class="p">};</span>
<span class="k">class</span> <span class="nc">SpiMaster0</span> <span class="p">{</span>
<span class="nl">public:</span> <span class="c1">// one class for every Spi peripheral</span>
<span class="k">static</span> <span class="kt">uint8_t</span> <span class="n">swap</span><span class="p">(</span><span class="kt">uint8_t</span> <span class="n">data</span><span class="p">);</span>
<span class="p">};</span>
<span class="k">template</span><span class="o"><</span> <span class="k">class</span> <span class="nc">SpiMaster</span><span class="p">,</span> <span class="k">class</span> <span class="nc">ChipSelect</span> <span class="p">></span>
<span class="k">class</span> <span class="nc">SensorDriver</span> <span class="p">{</span>
<span class="nl">public:</span>
<span class="kt">uint8_t</span> <span class="n">read</span><span class="p">()</span> <span class="p">{</span>
<span class="n">ChipSelect</span><span class="o">::</span><span class="n">set</span><span class="p">(</span><span class="n">Gpio</span><span class="o">::</span><span class="n">Low</span><span class="p">);</span>
<span class="kt">uint8_t</span> <span class="n">result</span> <span class="o">=</span> <span class="n">SpiMaster</span><span class="o">::</span><span class="n">swap</span><span class="p">(</span><span class="n">foobar</span><span class="p">);</span>
<span class="n">ChipSelect</span><span class="o">::</span><span class="n">set</span><span class="p">(</span><span class="n">Gpio</span><span class="o">::</span><span class="n">High</span><span class="p">);</span>
<span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="c1">// Hey look, a generic sensor driver</span>
<span class="n">SensorDriver</span><span class="o"><</span> <span class="n">SpiMaster0</span><span class="p">,</span> <span class="n">GpioB0</span> <span class="o">></span> <span class="n">compass</span><span class="p">;</span>
<span class="kt">uint8_t</span> <span class="n">heading</span> <span class="o">=</span> <span class="n">compass</span><span class="p">.</span><span class="n">read</span><span class="p">();</span>
</code></pre></div></div>
<center>
<p><a href="http://www.stroustrup.com/good_concepts.pdf">C++ concepts</a> sure would be useful
here for asserting <code class="language-plaintext highlighter-rouge">SpiMaster</code> traits. *cough*</p>
</center>
<p>This technique resulted in a rather unusual HAL, but when used <em>in moderation</em> it
yields ridiculously small binary sizes! And this was absolutely a requirement on
our AVRs which wanted to stuff full of control code for our autonomous robots.</p>
<p>The size reduction didn’t so much come from using C++ features like templates,
but from being able to very accurately dissect special cases into their own functions.
This is particularly useful on AVRs where the IO memory map is very irregular and
differs quite a bit between devices. Writing one function to handle all variations
at runtime can be more expensive than writing a couple of specialized functions and
letting the linker throw away all the unused ones.</p>
<p>But it does have one significant and obvious disadvantage: Our HAL had to <em>have</em> a
class for every peripheral you want to use. And adding these classes manually didn’t
scale very well with us and it proved an even bigger problem for a device with the
peripheral amount and features of an STM32. And so the inevitable happened: we started
using preprocessor macros to “instantiate” these peripheral classes, or switched
between different implementation with extensive, often nested, <code class="language-plaintext highlighter-rouge">#if/#else/#endif</code>
trees. It was such an ugly solution.</p>
<p>We also had a mechanism for generating code manually calling a Jinja2 template
engine and committing the result, in fact, already
<a href="https://github.com/roboterclubaachen/xpcc/commit/e239176#diff-41dfb98586123c4821a51af70cf93ae8">since Nov. 2009</a>.
It was first used to create the AVR’s UART classes and slowly expanded to other
platforms. But it didn’t really scale either because you still had to explicitly
provide all the substitution data to the engine, which usually only was the number,
or letter, identifying the peripheral.</p>
<p>It wasn’t until 2013 that <a href="https://github.com/ekiwi">Kevin Läufer</a> generalized
this idea by moving it into our <a href="http://scons.org">SCons-based</a> build system and
collecting all template substitution data into one common file per target, which
we just called “The Device File” (naming things is hard, ok?). This made it much
easier to generate new peripheral drivers and it even did so on-the-fly during the
build process due to being included into SCons’ dependency graph, which eliminated
the need for manually committing these generated files and keeping them up-to-date.</p>
<h3 id="first-steps">First Steps</h3>
<p>The first draft of the <a href="https://github.com/roboterclubaachen/xpcc/commit/3fcf8cb">STM32F407’s device file</a>
was assembled by hand and lacked a clear structure. In retrospect, we also had
trouble deciding which data goes in the device file and which
<a href="https://github.com/roboterclubaachen/xpcc/blob/826c43797d31513d128760c190b19bdc61ca2f6b/src/xpcc/architecture/platform/core/cortex/stm32/stm32.macros#L52-L168">stays embedded in the templates</a>,
but, we didn’t sweat the details, since we had an entire library to refactor and
a robot to build.</p>
<p>The major limitation of our system of course was getting the required data and
manually assembling it didn’t scale, and so we were stuck in the same bottleneck
as before, albeit with a slightly better build process.
And then, after researching how avr-gcc actually generate the <code class="language-plaintext highlighter-rouge"><avr/io.h></code> headers,
a solution presented itself:
<a href="http://packs.download.atmel.com">Atmel publishes a bunch of XML files called Part Description Files</a>,
or PDFs (lolwut?), containing the memory map of their AVR devices, and we just had
to reformat this a little bit. Right? If only I knew what I was getting into…</p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><module</span> <span class="na">name=</span><span class="s">"USART"</span><span class="nt">></span>
<span class="nt"><instance</span> <span class="na">name=</span><span class="s">"USART0"</span> <span class="na">caption=</span><span class="s">"USART"</span><span class="nt">></span>
<span class="nt"><register-group</span> <span class="na">name=</span><span class="s">"USART0"</span> <span class="na">name-in-module=</span><span class="s">"USART0"</span> <span class="na">offset=</span><span class="s">"0x00"</span> <span class="na">address-space=</span><span class="s">"data"</span> <span class="na">caption=</span><span class="s">"USART"</span><span class="nt">/></span>
<span class="nt"><signals></span>
<span class="nt"><signal</span> <span class="na">group=</span><span class="s">"TXD"</span> <span class="na">function=</span><span class="s">"default"</span> <span class="na">pad=</span><span class="s">"PD1"</span><span class="nt">/></span>
<span class="nt"><signal</span> <span class="na">group=</span><span class="s">"RXD"</span> <span class="na">function=</span><span class="s">"default"</span> <span class="na">pad=</span><span class="s">"PD0"</span><span class="nt">/></span>
<span class="nt"><signal</span> <span class="na">group=</span><span class="s">"XCK"</span> <span class="na">function=</span><span class="s">"default"</span> <span class="na">pad=</span><span class="s">"PD4"</span><span class="nt">/></span>
<span class="nt"></signals></span>
<span class="nt"></instance></span>
<span class="nt"></module></span>
<span class="nt"><module</span> <span class="na">name=</span><span class="s">"TWI"</span><span class="nt">></span>
<span class="nt"><instance</span> <span class="na">name=</span><span class="s">"TWI"</span> <span class="na">caption=</span><span class="s">"Two Wire Serial Interface"</span><span class="nt">></span>
<span class="nt"><register-group</span> <span class="na">name=</span><span class="s">"TWI"</span> <span class="na">name-in-module=</span><span class="s">"TWI"</span> <span class="na">offset=</span><span class="s">"0x00"</span> <span class="na">address-space=</span><span class="s">"data"</span> <span class="na">caption=</span><span class="s">"Two Wire Serial Interface"</span><span class="nt">/></span>
<span class="nt"><signals></span>
<span class="nt"><signal</span> <span class="na">group=</span><span class="s">"SDA"</span> <span class="na">function=</span><span class="s">"default"</span> <span class="na">pad=</span><span class="s">"PC4"</span><span class="nt">/></span>
<span class="nt"><signal</span> <span class="na">group=</span><span class="s">"SCL"</span> <span class="na">function=</span><span class="s">"default"</span> <span class="na">pad=</span><span class="s">"PC5"</span><span class="nt">/></span>
<span class="nt"></signals></span>
<span class="nt"></instance></span>
<span class="nt"></module></span>
<span class="nt"><module</span> <span class="na">name=</span><span class="s">"PORT"</span><span class="nt">></span>
<span class="nt"><instance</span> <span class="na">name=</span><span class="s">"PORTB"</span> <span class="na">caption=</span><span class="s">"I/O Port"</span><span class="nt">></span>
<span class="nt"><register-group</span> <span class="na">name=</span><span class="s">"PORTB"</span> <span class="na">name-in-module=</span><span class="s">"PORTB"</span> <span class="na">offset=</span><span class="s">"0x00"</span> <span class="na">address-space=</span><span class="s">"data"</span> <span class="na">caption=</span><span class="s">"I/O Port"</span><span class="nt">/></span>
<span class="nt"><signals></span>
<span class="nt"><signal</span> <span class="na">group=</span><span class="s">"P"</span> <span class="na">function=</span><span class="s">"default"</span> <span class="na">pad=</span><span class="s">"PB0"</span> <span class="na">index=</span><span class="s">"0"</span><span class="nt">/></span>
<span class="nt"><signal</span> <span class="na">group=</span><span class="s">"P"</span> <span class="na">function=</span><span class="s">"default"</span> <span class="na">pad=</span><span class="s">"PB1"</span> <span class="na">index=</span><span class="s">"1"</span><span class="nt">/></span>
<span class="nt"><signal</span> <span class="na">group=</span><span class="s">"P"</span> <span class="na">function=</span><span class="s">"default"</span> <span class="na">pad=</span><span class="s">"PB2"</span> <span class="na">index=</span><span class="s">"2"</span><span class="nt">/></span>
<span class="nt"><signal</span> <span class="na">group=</span><span class="s">"P"</span> <span class="na">function=</span><span class="s">"default"</span> <span class="na">pad=</span><span class="s">"PB3"</span> <span class="na">index=</span><span class="s">"3"</span><span class="nt">/></span>
<span class="nt"><signal</span> <span class="na">group=</span><span class="s">"P"</span> <span class="na">function=</span><span class="s">"default"</span> <span class="na">pad=</span><span class="s">"PB4"</span> <span class="na">index=</span><span class="s">"4"</span><span class="nt">/></span>
<span class="nt"><signal</span> <span class="na">group=</span><span class="s">"P"</span> <span class="na">function=</span><span class="s">"default"</span> <span class="na">pad=</span><span class="s">"PB5"</span> <span class="na">index=</span><span class="s">"5"</span><span class="nt">/></span>
<span class="nt"><signal</span> <span class="na">group=</span><span class="s">"P"</span> <span class="na">function=</span><span class="s">"default"</span> <span class="na">pad=</span><span class="s">"PB6"</span> <span class="na">index=</span><span class="s">"6"</span><span class="nt">/></span>
<span class="nt"><signal</span> <span class="na">group=</span><span class="s">"P"</span> <span class="na">function=</span><span class="s">"default"</span> <span class="na">pad=</span><span class="s">"PB7"</span> <span class="na">index=</span><span class="s">"7"</span><span class="nt">/></span>
<span class="nt"></signals></span>
<span class="nt"></instance></span>
</code></pre></div></div>
<center>
<p>Excerpt of the <code class="language-plaintext highlighter-rouge">ATmega328P.atdf</code> part description file.</p>
</center>
<p>It really turned out to be a great, but very much incomplete, source of information
about AVRs. Even today, over 4 years later,
<a href="https://github.com/modm-io/modm/blob/29f73690f43df87030a6dc2a8df56df1fa65ea6f/test/all/ignored.txt#L1-L114">110 AVR memory maps are still missing GPIO signal definitions</a>.
So I did what any student with too much time on their hands would do:
I began to <em>manually assemble</em> the missing information by downloading <em>all</em>
existing AVR device datasheets, reading through <em>all</em> of them and collecting
the pinouts in a spreadsheet. I then <em>manually reformatted</em> this data into a
<a href="https://github.com/modm-io/modm-devices/blob/64ebb6cdc99e79e3cf405f10d4d00d21f095cf1b/tools/generator/dfg/avr/avr_io.py#L222-L1868">Python data structure, where it still exists today</a>.
Don’t do this! I did get the job done, but I wasted two weeks of my life with this
crap and even though I was being really diligent, I still made a lot of mistakes.</p>
<center>
<p><img dimmable="" src="atmega_io.png" /></p>
<p>Ah, the insanities of youth 🙄</p>
</center>
<p>I also wrote a memory map comparison tool, which was really useful for understanding
the batshit-insane AVR IO maps. Since the AVR can only address a certain amount of
IO memory directly, the hardware engineers have to “compress” (more like “forcefully
stuff”) the IO map and this quickly becomes very ugly. For example, the ATtiny*61
series features differential ADC inputs with selectable gains, configurable in 64
combinations, but register <code class="language-plaintext highlighter-rouge">ADMUX</code> only has space for 5 bits (<code class="language-plaintext highlighter-rouge">MUX0</code> - <code class="language-plaintext highlighter-rouge">MUX4</code>).
So Atmel decided to cram <code class="language-plaintext highlighter-rouge">MUX5</code> into register <code class="language-plaintext highlighter-rouge">ADCSRB</code>:</p>
<center>
<p><img src="attiny_adc_mux.png" alt="" /></p>
<p>Wait, did the <code class="language-plaintext highlighter-rouge">ADLAR</code> bit just move around? Nah, must be an illusion. 😒</p>
</center>
<p>This memory map comparison tool was vital in understanding how all the AVRs memory
maps differ and coming up with strategies on how to map this functionality into our HAL.
<a href="https://www.youtube.com/watch?v=KMU0tzLwhbE">It’s all about tools, tools, tools, tools</a>!</p>
<h2 id="peeking-into-stm32cubemx">Peeking into STM32CubeMX</h2>
<p>ST maintains the <a href="http://www.st.com/en/development-tools/stm32cubemx.html">CubeMX initialization code generator</a>,
which contains “a pinout-conflict solver, a clock-tree setting helper, a power-consumption
calculator, and an utility performing MCU peripheral configuration”. Hm, doesn’t that
sound interesting? How did they implement these features, we wondered.</p>
<p>Back in 2013 CubeMX was still called MicroXplorer and wasn’t nearly as nice to use
as today. It also launched as a Windows-only application, even though it was clearly
written in Java (those “beautiful” GUI elements give it away). Nevertheless, CubeMX
indeed is a very useful application, giving you a number of visual configuration editors:</p>
<center>
<p><img dimmable="" src="stm32f103_cube_gpio.png" /></p>
<p>Configuring the USART1_TX signal on pin PB6 on the popular STM32F103RBT.</p>
</center>
<p>During installation, CubeMX kindly unpacks a <em>huge</em> plaintext (!) database to disk
at <code class="language-plaintext highlighter-rouge">STM32CubeMX.app/Contents/Resources/db</code> (on OSX) and even updates it for
you on every app launch. This database consists out of a lot of XML files, one
for every STM32 device in ST’s portfolio, plus detailed descriptions of peripheral
configurations. It really is an insane amount of data.</p>
<p>So I invite you to join me on a stroll through the colorful fields of XML that
power the core of the CubeMX’s configurators.
I’ll be using the STM32F103RBT, which is a very popular controller that can be
found all ST Links and on the Plue Pill board available on ebay for a few bucks.</p>
<h3 id="gpio-alternate-functions">GPIO Alternate Functions</h3>
<p>We start by searching for the unique device identifier <code class="language-plaintext highlighter-rouge">STM32F103RBTx</code> in <code class="language-plaintext highlighter-rouge">mcu/families.xml</code>
(which is >30.000 lines long, btw). The minimal information about the device here
is used by the parametric search engine in CubeMX.</p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><Mcu</span> <span class="na">Name=</span><span class="s">"STM32F103R(8-B)Tx"</span> <span class="na">PackageName=</span><span class="s">"LQFP64"</span> <span class="na">RefName=</span><span class="s">"STM32F103RBTx"</span><span class="nt">></span>
<span class="nt"><Core></span>ARM Cortex-M3<span class="nt"></Core></span>
<span class="nt"><Frequency></span>72<span class="nt"></Frequency></span>
<span class="nt"><Ram></span>20<span class="nt"></Ram></span>
<span class="nt"><Flash></span>128<span class="nt"></Flash></span>
<span class="nt"><Voltage</span> <span class="na">Max=</span><span class="s">"3.6"</span> <span class="na">Min=</span><span class="s">"2.0"</span><span class="nt">/></span>
<span class="nt"><Current</span> <span class="na">Lowest=</span><span class="s">"1.7"</span> <span class="na">Run=</span><span class="s">"373.0"</span><span class="nt">/></span>
<span class="nt"><Temperature</span> <span class="na">Max=</span><span class="s">"105.0"</span> <span class="na">Min=</span><span class="s">"-40.0"</span><span class="nt">/></span>
<span class="nt"><Peripheral</span> <span class="na">Type=</span><span class="s">"ADC 12-bit"</span> <span class="na">MaxOccurs=</span><span class="s">"16"</span><span class="nt">/></span>
<span class="nt"><Peripheral</span> <span class="na">Type=</span><span class="s">"CAN"</span> <span class="na">MaxOccurs=</span><span class="s">"1"</span><span class="nt">/></span>
<span class="nt"><Peripheral</span> <span class="na">Type=</span><span class="s">"I2C"</span> <span class="na">MaxOccurs=</span><span class="s">"2"</span><span class="nt">/></span>
<span class="nt"><Peripheral</span> <span class="na">Type=</span><span class="s">"RTC"</span> <span class="na">MaxOccurs=</span><span class="s">"1"</span><span class="nt">/></span>
<span class="nt"><Peripheral</span> <span class="na">Type=</span><span class="s">"SPI"</span> <span class="na">MaxOccurs=</span><span class="s">"2"</span><span class="nt">/></span>
<span class="nt"><Peripheral</span> <span class="na">Type=</span><span class="s">"Timer 16-bit"</span> <span class="na">MaxOccurs=</span><span class="s">"4"</span><span class="nt">/></span>
<span class="nt"><Peripheral</span> <span class="na">Type=</span><span class="s">"USART"</span> <span class="na">MaxOccurs=</span><span class="s">"3"</span><span class="nt">/></span>
<span class="nt"><Peripheral</span> <span class="na">Type=</span><span class="s">"USB Device"</span> <span class="na">MaxOccurs=</span><span class="s">"1"</span><span class="nt">/></span>
<span class="nt"></Mcu></span>
</code></pre></div></div>
<p>Following the <code class="language-plaintext highlighter-rouge">Mcu/@Name</code> leads us to <code class="language-plaintext highlighter-rouge">STM32F103R(8-B)Tx.xml</code> containing what
peripherals and how many (<code class="language-plaintext highlighter-rouge">mcu/IP/@InstanceName</code>) as well as what pins exists on this
package and where and what alternate functions they can be connected to.</p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><Core></span>ARM Cortex-M3<span class="nt"></Core></span>
<span class="nt"><Ram></span>20<span class="nt"></Ram></span>
<span class="nt"><Flash></span>64<span class="nt"></Flash></span>
<span class="nt"><Flash></span>128<span class="nt"></Flash></span>
<span class="c"><!-- ... --></span>
<span class="nt"><IP</span> <span class="na">InstanceName=</span><span class="s">"USART3"</span> <span class="na">Name=</span><span class="s">"USART"</span> <span class="na">Version=</span><span class="s">"sci2_v1_1_Cube"</span><span class="nt">/></span>
<span class="nt"><IP</span> <span class="na">InstanceName=</span><span class="s">"RCC"</span> <span class="na">Name=</span><span class="s">"RCC"</span> <span class="na">Version=</span><span class="s">"STM32F102_rcc_v1_0"</span><span class="nt">/></span>
<span class="nt"><IP</span> <span class="na">InstanceName=</span><span class="s">"NVIC"</span> <span class="na">Name=</span><span class="s">"NVIC"</span> <span class="na">Version=</span><span class="s">"STM32F103G"</span><span class="nt">/></span>
<span class="nt"><IP</span> <span class="na">InstanceName=</span><span class="s">"GPIO"</span> <span class="na">Name=</span><span class="s">"GPIO"</span> <span class="na">Version=</span><span class="s">"STM32F103x8_gpio_v1_0"</span><span class="nt">/></span>
<span class="c"><!-- ... --></span>
<span class="nt"><Pin</span> <span class="na">Name=</span><span class="s">"PB5"</span> <span class="na">Position=</span><span class="s">"57"</span> <span class="na">Type=</span><span class="s">"I/O"</span><span class="nt">></span>
<span class="nt"><Signal</span> <span class="na">Name=</span><span class="s">"I2C1_SMBA"</span><span class="nt">/></span>
<span class="nt"><Signal</span> <span class="na">Name=</span><span class="s">"SPI1_MOSI"</span><span class="nt">/></span>
<span class="nt"><Signal</span> <span class="na">Name=</span><span class="s">"TIM3_CH2"</span><span class="nt">/></span>
<span class="nt"></Pin></span>
<span class="nt"><Pin</span> <span class="na">Name=</span><span class="s">"PB6"</span> <span class="na">Position=</span><span class="s">"58"</span> <span class="na">Type=</span><span class="s">"I/O"</span><span class="nt">></span>
<span class="nt"><Signal</span> <span class="na">Name=</span><span class="s">"I2C1_SCL"</span><span class="nt">/></span>
<span class="nt"><Signal</span> <span class="na">Name=</span><span class="s">"TIM4_CH1"</span><span class="nt">/></span>
<span class="nt"><Signal</span> <span class="na">Name=</span><span class="s">"USART1_TX"</span><span class="nt">/></span>
<span class="nt"></Pin></span>
<span class="nt"><Pin</span> <span class="na">Name=</span><span class="s">"PB7"</span> <span class="na">Position=</span><span class="s">"59"</span> <span class="na">Type=</span><span class="s">"I/O"</span><span class="nt">></span>
<span class="nt"><Signal</span> <span class="na">Name=</span><span class="s">"I2C1_SDA"</span><span class="nt">/></span>
<span class="nt"><Signal</span> <span class="na">Name=</span><span class="s">"TIM4_CH2"</span><span class="nt">/></span>
<span class="nt"><Signal</span> <span class="na">Name=</span><span class="s">"USART1_RX"</span><span class="nt">/></span>
<span class="nt"></Pin></span>
</code></pre></div></div>
<p>Each peripheral has a <code class="language-plaintext highlighter-rouge">IP/@Version</code>, which leads to a configuration file containing
<em>even more</em> data. Don’t cha just love the smell of freshly unpacked data in the morning?
For this device’s GPIO peripheral we’ll look for any pins with the <code class="language-plaintext highlighter-rouge">USART1_TX</code>
signal in the <code class="language-plaintext highlighter-rouge">mcu/IP/GPIO-STM32F103x8_gpio_v1_0_Modes.xml</code> file:</p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><GPIO_Pin</span> <span class="na">PortName=</span><span class="s">"PB"</span> <span class="na">Name=</span><span class="s">"PB6"</span><span class="nt">></span>
<span class="nt"><PinSignal</span> <span class="na">Name=</span><span class="s">"USART1_TX"</span><span class="nt">></span>
<span class="nt"><RemapBlock</span> <span class="na">Name=</span><span class="s">"USART1_REMAP1"</span><span class="nt">></span>
<span class="nt"><SpecificParameter</span> <span class="na">Name=</span><span class="s">"GPIO_AF"</span><span class="nt">></span>
<span class="nt"><PossibleValue></span>__HAL_AFIO_REMAP_USART1_ENABLE<span class="nt"></PossibleValue></span>
<span class="nt"></SpecificParameter></span>
<span class="nt"></RemapBlock></span>
<span class="nt"></PinSignal></span>
<span class="nt"></GPIO_Pin></span>
<span class="c"><!-- ... --></span>
<span class="nt"><GPIO_Pin</span> <span class="na">PortName=</span><span class="s">"PA"</span> <span class="na">Name=</span><span class="s">"PA9"</span><span class="nt">></span>
<span class="nt"><PinSignal</span> <span class="na">Name=</span><span class="s">"USART1_TX"</span><span class="nt">></span>
<span class="nt"><RemapBlock</span> <span class="na">Name=</span><span class="s">"USART1_REMAP0"</span> <span class="na">DefaultRemap=</span><span class="s">"true"</span><span class="nt">/></span>
<span class="nt"></PinSignal></span>
<span class="nt"></GPIO_Pin></span>
</code></pre></div></div>
<p>So <code class="language-plaintext highlighter-rouge">USART1_TX</code> maps to pin PB6 with <code class="language-plaintext highlighter-rouge">USART1_REMAP1</code> or pin PA9 with <code class="language-plaintext highlighter-rouge">USART1_REMAP0</code>.
The STM32F1 series remap signals either in (overlapping) groups or not at all.
This is controlled by the <code class="language-plaintext highlighter-rouge">AFIO_MAPRx</code> registers, where we can find PB6/PA9 again:</p>
<p><img invertible="" src="stm32f103_usart1_remap.png" /></p>
<p>The <code class="language-plaintext highlighter-rouge">__HAL_AFIO_REMAP_USART1_ENABLE</code> in the XML is actually just a C function name,
and is placed by CubeMX in the generated init code.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">HAL_UART_MspInit</span><span class="p">(</span><span class="n">UART_HandleTypeDef</span><span class="o">*</span> <span class="n">huart</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">GPIO_InitTypeDef</span> <span class="n">GPIO_InitStruct</span><span class="p">;</span>
<span class="k">if</span><span class="p">(</span><span class="n">huart</span><span class="o">-></span><span class="n">Instance</span><span class="o">==</span><span class="n">USART1</span><span class="p">)</span>
<span class="p">{</span>
<span class="cm">/* Peripheral clock enable */</span>
<span class="n">__HAL_RCC_USART1_CLK_ENABLE</span><span class="p">();</span>
<span class="cm">/**USART1 GPIO Configuration
PB6 ------> USART1_TX
PB7 ------> USART1_RX
*/</span>
<span class="n">GPIO_InitStruct</span><span class="p">.</span><span class="n">Pin</span> <span class="o">=</span> <span class="n">GPIO_PIN_6</span><span class="p">;</span>
<span class="n">GPIO_InitStruct</span><span class="p">.</span><span class="n">Mode</span> <span class="o">=</span> <span class="n">GPIO_MODE_AF_PP</span><span class="p">;</span>
<span class="n">GPIO_InitStruct</span><span class="p">.</span><span class="n">Speed</span> <span class="o">=</span> <span class="n">GPIO_SPEED_FREQ_HIGH</span><span class="p">;</span>
<span class="n">HAL_GPIO_Init</span><span class="p">(</span><span class="n">GPIOB</span><span class="p">,</span> <span class="o">&</span><span class="n">GPIO_InitStruct</span><span class="p">);</span>
<span class="n">GPIO_InitStruct</span><span class="p">.</span><span class="n">Pin</span> <span class="o">=</span> <span class="n">GPIO_PIN_7</span><span class="p">;</span>
<span class="n">GPIO_InitStruct</span><span class="p">.</span><span class="n">Mode</span> <span class="o">=</span> <span class="n">GPIO_MODE_INPUT</span><span class="p">;</span>
<span class="n">GPIO_InitStruct</span><span class="p">.</span><span class="n">Pull</span> <span class="o">=</span> <span class="n">GPIO_NOPULL</span><span class="p">;</span>
<span class="n">HAL_GPIO_Init</span><span class="p">(</span><span class="n">GPIOB</span><span class="p">,</span> <span class="o">&</span><span class="n">GPIO_InitStruct</span><span class="p">);</span>
<span class="n">__HAL_AFIO_REMAP_USART1_ENABLE</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The IP files do contain a very large amount of information, however, it’s mostly
directed at the code generation capabilities of the CubeMX project exporter, and
as such, not very useful as stand-alone information. For example, the above
GPIO signal information relies on the existence of a <code class="language-plaintext highlighter-rouge">__HAL_AFIO_REMAP_USART1_ENABLE()</code>
function that performs the remapping. The mapping between the bits in the <code class="language-plaintext highlighter-rouge">AFIO_MAPRx</code>
registers and the remap groups is therefore encoded in two separate places:
these xml files, and the family’s CubeHAL.</p>
<p>The <code class="language-plaintext highlighter-rouge">mcu/IP/NVIC-STM32F103G_Modes.xml</code> configuration file, used to configure the NVIC in
the CubeMX, exemplifies this quite well: here we see the first 10 interrupt vectors
paired with additional metadata (<code class="language-plaintext highlighter-rouge">PossibleValue/@Value</code> seems to contain some <code class="language-plaintext highlighter-rouge">:</code>
separated conditionals for visibility inside the GUI tool).</p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><RefParameter</span> <span class="na">Comment=</span><span class="s">"Interrupt Table"</span> <span class="na">Name=</span><span class="s">"IRQn"</span> <span class="na">Type=</span><span class="s">"list"</span><span class="nt">></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"Non maskable interrupt"</span> <span class="na">Value=</span><span class="s">"NonMaskableInt_IRQn:N,IF_HAL::HAL_RCC_NMI_IRQHandler:CSSEnabled"</span><span class="nt">/></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"Hard fault interrupt"</span> <span class="na">Value=</span><span class="s">"HardFault_IRQn:N,W1:::"</span><span class="nt">/></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"Memory management fault"</span> <span class="na">Value=</span><span class="s">"MemoryManagement_IRQn:Y,W1:::"</span><span class="nt">/></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"Prefetch fault, memory access fault"</span> <span class="na">Value=</span><span class="s">"BusFault_IRQn:Y,W1:::"</span><span class="nt">/></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"Undefined instruction or illegal state"</span> <span class="na">Value=</span><span class="s">"UsageFault_IRQn:Y,W1:::"</span><span class="nt">/></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"System service call via SWI instruction"</span> <span class="na">Value=</span><span class="s">"SVCall_IRQn:Y,RTOS::NONE:"</span><span class="nt">/></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"Debug monitor"</span> <span class="na">Value=</span><span class="s">"DebugMonitor_IRQn:Y::NONE:"</span><span class="nt">/></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"Pendable request for system service"</span> <span class="na">Value=</span><span class="s">"PendSV_IRQn:Y,RTOS::NONE:"</span><span class="nt">/></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"System tick timer"</span> <span class="na">Value=</span><span class="s">"SysTick_IRQn:Y:::"</span><span class="nt">/></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"Window watchdog interrupt"</span> <span class="na">Value=</span><span class="s">"WWDG_IRQn:Y:WWDG:HAL_WWDG_IRQHandler:"</span><span class="nt">/></span>
</code></pre></div></div>
<p>However, their actual position in the interrupt vector table is missing, and so
this data cannot be used to extract a valid interrupt table. Instead an alias is
used here to pair the interrupt with its actual table position, as defined in the
<a href="https://github.com/modm-io/cmsis-header-stm32/blob/master/stm32f1xx/Include/stm32f103xb.h#L86-L144">STM32F103xB CMSIS header file</a>.</p>
<p>For example, the <code class="language-plaintext highlighter-rouge">WWDG</code> interrupt vector is located at position 16 (=16+0), while
the <code class="language-plaintext highlighter-rouge">SVCall</code> vector is located at position 11 (=16-5), or 5 positions behind
the <code class="language-plaintext highlighter-rouge">UsageFault</code> vector:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="cm">/*!< Interrupt Number Definition */</span>
<span class="k">typedef</span> <span class="k">enum</span> <span class="p">{</span>
<span class="n">NonMaskableInt_IRQn</span> <span class="o">=</span> <span class="o">-</span><span class="mi">14</span><span class="p">,</span> <span class="cm">/*!< 2 Non Maskable Interrupt */</span>
<span class="n">HardFault_IRQn</span> <span class="o">=</span> <span class="o">-</span><span class="mi">13</span><span class="p">,</span> <span class="cm">/*!< 3 Cortex-M3 Hard Fault Interrupt */</span>
<span class="n">MemoryManagement_IRQn</span> <span class="o">=</span> <span class="o">-</span><span class="mi">12</span><span class="p">,</span> <span class="cm">/*!< 4 Cortex-M3 Memory Management Interrupt */</span>
<span class="n">BusFault_IRQn</span> <span class="o">=</span> <span class="o">-</span><span class="mi">11</span><span class="p">,</span> <span class="cm">/*!< 5 Cortex-M3 Bus Fault Interrupt */</span>
<span class="n">UsageFault_IRQn</span> <span class="o">=</span> <span class="o">-</span><span class="mi">10</span><span class="p">,</span> <span class="cm">/*!< 6 Cortex-M3 Usage Fault Interrupt */</span>
<span class="n">SVCall_IRQn</span> <span class="o">=</span> <span class="o">-</span><span class="mi">5</span><span class="p">,</span> <span class="cm">/*!< 11 Cortex-M3 SV Call Interrupt */</span>
<span class="n">DebugMonitor_IRQn</span> <span class="o">=</span> <span class="o">-</span><span class="mi">4</span><span class="p">,</span> <span class="cm">/*!< 12 Cortex-M3 Debug Monitor Interrupt */</span>
<span class="n">PendSV_IRQn</span> <span class="o">=</span> <span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="cm">/*!< 14 Cortex-M3 Pend SV Interrupt */</span>
<span class="n">SysTick_IRQn</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="cm">/*!< 15 Cortex-M3 System Tick Interrupt */</span>
<span class="n">WWDG_IRQn</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="cm">/*!< Window WatchDog Interrupt */</span>
<span class="c1">// ...</span>
<span class="p">}</span> <span class="n">IRQn_Type</span><span class="p">;</span>
</code></pre></div></div>
<p>So keep in mind that this data is not meant to be a sensible hardware
description format and it just often lacks basic information that would make it
much more useful. Then again, the only consumer of this information is supposed
to be CubeMX for its fairly narrow goal of code generation.</p>
<h3 id="clock-tree">Clock Tree</h3>
<p>Let’s look at another very interesting data source in CubeMX: the clock
configuration wizard:</p>
<p><img dimmable="" src="stm32f103_cube_clock.gif" /></p>
<p>What’s so interesting about this configurator is that it <em>knows</em> what the maximum
frequencies of the respective clock segments are, and more importantly, how to
set the prescalers to resolve these issues and this for every device.
You surely know where this is going by know. Yup, it’s backed by data, and here
is what it looks like rendered with graphviz.</p>
<p><img invertible="" src="stm32f100_clock.png" /></p>
<p>Here is a beautified excerpt from <code class="language-plaintext highlighter-rouge">plugins/clock/STM32F102.xml</code>, which only
shows the connections highlighted in red. Note how the text in the nodes maps to
the <code class="language-plaintext highlighter-rouge">Element/@type</code> and <code class="language-plaintext highlighter-rouge">Element/@id</code> attributes, and how the <code class="language-plaintext highlighter-rouge">Element/Output</code>
and <code class="language-plaintext highlighter-rouge">Element/Input</code> children declare a (unique) <code class="language-plaintext highlighter-rouge">@signalId</code> and which node they
are connecting to:</p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><Tree</span> <span class="na">id=</span><span class="s">"ClockTree"</span><span class="nt">></span>
<span class="c"><!-- HSE --></span>
<span class="nt"><Element</span> <span class="na">id=</span><span class="s">"HSEOSC"</span> <span class="na">type=</span><span class="s">"variedSource"</span> <span class="na">refParameter=</span><span class="s">"HSE_VALUE"</span><span class="nt">></span>
<span class="nt"><Output</span> <span class="na">signalId=</span><span class="s">"HSE"</span> <span class="na">to=</span><span class="s">"HSEDivPLL"</span><span class="nt">/></span>
<span class="nt"></Element></span>
<span class="c"><!-- PLL div input from HSE --></span>
<span class="nt"><Element</span> <span class="na">id=</span><span class="s">"HSEDivPLL"</span> <span class="na">type=</span><span class="s">"devisor"</span> <span class="na">refParameter=</span><span class="s">"HSEDivPLL"</span><span class="nt">></span>
<span class="nt"><Input</span> <span class="na">signalId=</span><span class="s">"HSE"</span> <span class="na">from=</span><span class="s">"HSEOSC"</span><span class="nt">/></span>
<span class="nt"><Output</span> <span class="na">signalId=</span><span class="s">"HSE_PLL"</span> <span class="na">to=</span><span class="s">"PLLSource"</span><span class="nt">/></span>
<span class="nt"></Element></span>
<span class="nt"><Tree</span> <span class="na">id=</span><span class="s">"PLL"</span><span class="nt">></span>
<span class="c"><!-- PLLsource MUX source pour PLL mul --></span>
<span class="nt"><Element</span> <span class="na">id=</span><span class="s">"PLLSource"</span> <span class="na">type=</span><span class="s">"multiplexor"</span> <span class="na">refParameter=</span><span class="s">"PLLSourceVirtual"</span><span class="nt">></span>
<span class="nt"><Input</span> <span class="na">signalId=</span><span class="s">"HSE_PLL"</span> <span class="na">from=</span><span class="s">"HSEDivPLL"</span> <span class="na">refValue=</span><span class="s">"RCC_PLLSOURCE_HSE"</span><span class="nt">/></span>
<span class="nt"><Output</span> <span class="na">signalId=</span><span class="s">"VCOInput"</span> <span class="na">to=</span><span class="s">"VCO2output"</span><span class="nt">/></span>
<span class="nt"></Element></span>
<span class="nt"><Element</span> <span class="na">id=</span><span class="s">"VCO2output"</span> <span class="na">type=</span><span class="s">"output"</span> <span class="na">refParameter=</span><span class="s">"VCOOutput2Freq_Value"</span><span class="nt">></span>
<span class="nt"><Input</span> <span class="na">signalId=</span><span class="s">"VCOInput"</span> <span class="na">from=</span><span class="s">"PLLSource"</span><span class="nt">/></span>
<span class="nt"><Output</span> <span class="na">signalId=</span><span class="s">"VCO2Input"</span> <span class="na">to=</span><span class="s">"PLLMUL"</span><span class="nt">/></span>
<span class="nt"></Element></span>
<span class="nt"><Element</span> <span class="na">id=</span><span class="s">"PLLMUL"</span> <span class="na">type=</span><span class="s">"multiplicator"</span> <span class="na">refParameter=</span><span class="s">"PLLMUL"</span><span class="nt">></span>
<span class="nt"><Input</span> <span class="na">signalId=</span><span class="s">"VCO2Input"</span> <span class="na">from=</span><span class="s">"VCO2output"</span><span class="nt">/></span>
<span class="nt"><Output</span> <span class="na">signalId=</span><span class="s">"PLLCLK"</span> <span class="na">to=</span><span class="s">"SysClkSource"</span><span class="nt">/></span>
<span class="nt"></Element></span>
<span class="nt"></Tree></span>
<span class="c"><!--Sysclock mux --></span>
<span class="nt"><Element</span> <span class="na">id=</span><span class="s">"SysClkSource"</span> <span class="na">type=</span><span class="s">"multiplexor"</span> <span class="na">refParameter=</span><span class="s">"SYSCLKSource"</span><span class="nt">></span>
<span class="nt"><Input</span> <span class="na">signalId=</span><span class="s">"PLLCLK"</span> <span class="na">from=</span><span class="s">"PLLMUL"</span> <span class="na">refValue=</span><span class="s">"RCC_SYSCLKSOURCE_PLLCLK"</span><span class="nt">/></span>
<span class="nt"><Output</span> <span class="na">signalId=</span><span class="s">"SYSCLK"</span> <span class="na">to=</span><span class="s">"SysCLKOutput"</span><span class="nt">/></span>
<span class="nt"></Element></span>
<span class="nt"><Element</span> <span class="na">id=</span><span class="s">"SysCLKOutput"</span> <span class="na">type=</span><span class="s">"output"</span> <span class="na">refParameter=</span><span class="s">"SYSCLKFreq_VALUE"</span><span class="nt">></span>
<span class="nt"><Input</span> <span class="na">signalId=</span><span class="s">"SYSCLK"</span> <span class="na">from=</span><span class="s">"SysClkSource"</span><span class="nt">/></span>
<span class="nt"><Output</span> <span class="na">signalId=</span><span class="s">"SYSCLKOUT"</span> <span class="na">to=</span><span class="s">"AHBPrescaler"</span><span class="nt">/></span>
<span class="nt"></Element></span>
<span class="c"><!-- AHB input**SYSclock** --></span>
<span class="nt"><Element</span> <span class="na">id=</span><span class="s">"AHBPrescaler"</span> <span class="na">type=</span><span class="s">"devisor"</span> <span class="na">refParameter=</span><span class="s">"AHBCLKDivider"</span><span class="nt">></span>
<span class="nt"><Input</span> <span class="na">signalId=</span><span class="s">"SYSCLKOUT"</span> <span class="na">from=</span><span class="s">"SysCLKOutput"</span><span class="nt">/></span>
<span class="nt"><Output</span> <span class="na">signalId=</span><span class="s">"HCLK"</span> <span class="na">to=</span><span class="s">"AHBOutput"</span><span class="nt">/></span>
<span class="nt"></Element></span>
<span class="c"><!-- AHB input**SYSclock** output**FHCLK,HCLK,Diviseurcortex,APB1,APB2 --></span>
<span class="nt"><Element</span> <span class="na">id=</span><span class="s">"AHBOutput"</span> <span class="na">type=</span><span class="s">"activeOutput"</span> <span class="na">refParameter=</span><span class="s">"HCLKFreq_Value"</span><span class="nt">></span>
<span class="nt"><Input</span> <span class="na">signalId=</span><span class="s">"HCLK"</span> <span class="na">from=</span><span class="s">"AHBPrescaler"</span><span class="nt">/></span>
<span class="nt"><Output</span> <span class="na">to=</span><span class="s">"FCLKCortexOutput"</span> <span class="na">signalId=</span><span class="s">"AHBCLK"</span><span class="nt">/></span>
<span class="nt"><Output</span> <span class="na">to=</span><span class="s">"FSMClkOutput"</span> <span class="na">signalId=</span><span class="s">"AHBCLK"</span><span class="nt">/></span>
<span class="nt"><Output</span> <span class="na">to=</span><span class="s">"SDIOClkOutput"</span> <span class="na">signalId=</span><span class="s">"AHBCLK"</span><span class="nt">/></span>
<span class="nt"><Output</span> <span class="na">to=</span><span class="s">"HCLKDiv2"</span> <span class="na">signalId=</span><span class="s">"AHBCLK"</span><span class="nt">/></span>
<span class="nt"><Output</span> <span class="na">to=</span><span class="s">"HCLKOutput"</span> <span class="na">signalId=</span><span class="s">"AHBCLK"</span><span class="nt">/></span>
<span class="nt"><Output</span> <span class="na">to=</span><span class="s">"TimSysPresc"</span> <span class="na">signalId=</span><span class="s">"AHBCLK"</span><span class="nt">/></span>
<span class="nt"><Output</span> <span class="na">to=</span><span class="s">"APB1Prescaler"</span> <span class="na">signalId=</span><span class="s">"AHBCLK"</span><span class="nt">/></span>
<span class="nt"><Output</span> <span class="na">to=</span><span class="s">"APB2Prescaler"</span> <span class="na">signalId=</span><span class="s">"AHBCLK"</span><span class="nt">/></span>
<span class="nt"></Element></span>
<span class="nt"></Tree></span>
</code></pre></div></div>
<p>We still don’t know how CubeMX is able to do it actual calculations,
because the clock graph above doesn’t contain any numbers at all.
Some digging around later we can trace the <code class="language-plaintext highlighter-rouge">Element/@refParameter</code> attribute to
the <code class="language-plaintext highlighter-rouge">IP/RCC-STM32F102_rcc_v1_0_Modes.xml</code> which contains *drumroll* numbers,
and lots of ‘em:</p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"><!-- Les frequences des sources --></span>
<span class="nt"><RefParameter</span> <span class="na">Name=</span><span class="s">"HSE_VALUE"</span> <span class="na">Min=</span><span class="s">"4000000"</span> <span class="na">Max=</span><span class="s">"16000000"</span> <span class="na">Display=</span><span class="s">"value/1000000"</span> <span class="na">Unit=</span><span class="s">"MHz"</span><span class="nt">/></span>
<span class="c"><!-- frequence PLL --></span>
<span class="nt"><RefParameter</span> <span class="na">Name=</span><span class="s">"VCOOutput2Freq_Value"</span> <span class="na">Min=</span><span class="s">"1000000"</span> <span class="na">Max=</span><span class="s">"25000000"</span> <span class="na">Display=</span><span class="s">"value/1000000"</span> <span class="na">Unit=</span><span class="s">"MHz"</span><span class="nt">/></span>
<span class="c"><!-- les diviseurs --></span>
<span class="nt"><RefParameter</span> <span class="na">Name=</span><span class="s">"HSEDivPLL"</span> <span class="na">DefaultValue=</span><span class="s">"RCC_HSE_PREDIV_DIV1"</span><span class="nt">></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"1"</span> <span class="na">Value=</span><span class="s">"RCC_HSE_PREDIV_DIV1"</span><span class="nt">/></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"2"</span> <span class="na">Value=</span><span class="s">"RCC_HSE_PREDIV_DIV2"</span><span class="nt">/></span>
<span class="nt"></RefParameter></span>
<span class="c"><!-- Les multiplicateurs --></span>
<span class="nt"><RefParameter</span> <span class="na">Name=</span><span class="s">"PLLMUL"</span> <span class="na">DefaultValue=</span><span class="s">"RCC_PLL_MUL2"</span><span class="nt">></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"2"</span> <span class="na">Value=</span><span class="s">"RCC_PLL_MUL2"</span><span class="nt">/></span>
<span class="c"><!-- ... --></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"16"</span> <span class="na">Value=</span><span class="s">"RCC_PLL_MUL16"</span><span class="nt">/></span>
<span class="nt"></RefParameter></span>
<span class="c"><!-- Les frequences des signaux --></span>
<span class="c"><!-- SYS clock freq de l'output --></span>
<span class="nt"><RefParameter</span> <span class="na">Name=</span><span class="s">"SYSCLKFreq_VALUE"</span> <span class="na">Max=</span><span class="s">"72000000"</span> <span class="na">Display=</span><span class="s">"value/1000000"</span> <span class="na">Unit=</span><span class="s">"MHz"</span><span class="nt">/></span>
<span class="c"><!-- diviseur AHB 1..512 --></span>
<span class="nt"><RefParameter</span> <span class="na">Name=</span><span class="s">"AHBCLKDivider"</span> <span class="na">DefaultValue=</span><span class="s">"RCC_SYSCLK_DIV1"</span><span class="nt">></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"1"</span> <span class="na">Value=</span><span class="s">"RCC_SYSCLK_DIV1"</span><span class="nt">/></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"2"</span> <span class="na">Value=</span><span class="s">"RCC_SYSCLK_DIV2"</span><span class="nt">/></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"4"</span> <span class="na">Value=</span><span class="s">"RCC_SYSCLK_DIV4"</span><span class="nt">/></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"8"</span> <span class="na">Value=</span><span class="s">"RCC_SYSCLK_DIV8"</span><span class="nt">/></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"16"</span> <span class="na">Value=</span><span class="s">"RCC_SYSCLK_DIV16"</span><span class="nt">/></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"64"</span> <span class="na">Value=</span><span class="s">"RCC_SYSCLK_DIV64"</span><span class="nt">/></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"128"</span> <span class="na">Value=</span><span class="s">"RCC_SYSCLK_DIV128"</span><span class="nt">/></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"256"</span> <span class="na">Value=</span><span class="s">"RCC_SYSCLK_DIV256"</span><span class="nt">/></span>
<span class="nt"><PossibleValue</span> <span class="na">Comment=</span><span class="s">"512"</span> <span class="na">Value=</span><span class="s">"RCC_SYSCLK_DIV512"</span><span class="nt">/></span>
<span class="nt"></RefParameter></span>
<span class="c"><!-- AHB out freq --></span>
<span class="nt"><RefParameter</span> <span class="na">Name=</span><span class="s">"HCLKFreq_Value"</span> <span class="na">Max=</span><span class="s">"72000000"</span> <span class="na">Display=</span><span class="s">"value/1000000"</span> <span class="na">Unit=</span><span class="s">"MHz"</span><span class="nt">/></span>
</code></pre></div></div>
<p>Did you know that ST is a French-Italian company? Cos those XML comments clearly
aren’t in English. 🤔 Well, that and they seem keen on calling it a “devisor”
when they really mean “divider”. What is this, I don’t even.</p>
<center>
<p><img src="not_anything_wrong.gif" alt="" /></p>
<p>French comments in XML</p>
</center>
<p>Anyways, here you can see the <code class="language-plaintext highlighter-rouge">RefParameter/@min</code> and <code class="language-plaintext highlighter-rouge">RefParameter/@max</code>
frequency values as well as prescaler values encoded as <code class="language-plaintext highlighter-rouge">PossibleValue/@Comment</code>,
which are all used by CubeMX to check and fix your clock tree.
That’s pretty amazing actually.</p>
<p>Ok, so I’m not going into the data of their board support packages, because
I don’t think any health insurance covers this much exposure to XML, especially
not XML containing French comments. But feel free to take a look at your own risk,
it’s just waiting there in <code class="language-plaintext highlighter-rouge">plugins/boardmanager/boards</code> for your prying eyes.</p>
<p>Let’s move on to how we can extract this data programmatically and use it to
bring order to chaos, one example at a time. A bit like the Avengers franchise
*drags out blog post to infinity*</p>
<h2 id="generating-device-files">Generating Device Files</h2>
<p>The goal of finding machine-readable device description data obviously was to
write a program to import, clean-up and convert it into a format that’s more
agreeable to our use-case of generating a HAL.
Ironically the Device File Generator (DFG) started out in mid 2013 with the
innocently named commit
<a href="https://github.com/roboterclubaachen/xpcc/commit/1532289">“Cheap and simple parsing of the XML files”</a>.
It’s not cheap and simple anymore.</p>
<p>The DFG started out as a glorified <a href="https://en.wikipedia.org/wiki/XPath">XPath</a>
wrapper in xpcc, but then quickly devolved into some messy monster, that pulled
in data from all over the place and arranged it without much concept.
Back then we were busy building porting the HAL, writing sensor drivers and
building robots, so we didn’t approach this problem structurally, and rather
fixed bugs when they occurred.</p>
<p>I won’t talk about xpcc’s DFG architecture issues in detail, instead I’ll be
showing you the problems it caused us. This way, the lessons learned are more
transferable to other format (*cough* Device Tree *cough*), since the
device data is immutable whereas the DFG’s architecture is not.</p>
<p>Note that I rewrote the DFG from scratch for modm, so <a href="https://github.com/modm-io/modm-devices">you can have a look at the
source code</a> while reading this.
I’m continuing to use the STM32F103RBT6 for illustration, but this all works
very similarly for all STM32 and AVR devices.</p>
<h3 id="device-identifiers">Device Identifiers</h3>
<p>We needed a way to identify what device to build our HAL for, and of course we
use the manufacturers identifier, since it’s (hopefully) unique.
We also needed to split up the identifier string, so that the HAL can query
its traits to select what code templates to use.
For example, in xpcc we split <code class="language-plaintext highlighter-rouge">stm32f103rbt6</code> into:</p>
<center>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> stm32 f1 103 r b
{platform}{family}{name}{pin-id}{size-id}
</code></pre></div> </div>
</center>
<p>Note how we forgot the <code class="language-plaintext highlighter-rouge">t6</code> suffix. If we compare this with the documentation
on the ST ordering information scheme, you’ll see why this was a huge mistake:</p>
<center>
<p><img invertible="" src="stm32f1_ordering_info_scheme.png" width="75%" /></p>
</center>
<p>Yup, that’s right, we forgot to encode the package type, causing the DFG to select
the first device matching <code class="language-plaintext highlighter-rouge">STM32F103RB</code>! And that would be the <code class="language-plaintext highlighter-rouge">STM32F103RBHx</code>
device, since it occurs first in <code class="language-plaintext highlighter-rouge">families.xml</code>.</p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><Mcu</span> <span class="na">Name=</span><span class="s">"STM32F103R(8-B)Hx"</span> <span class="na">PackageName=</span><span class="s">"TFBGA64"</span> <span class="na">RefName=</span><span class="s">"STM32F103RBHx"</span><span class="nt">></span>
<span class="c"><!-- ... --></span>
<span class="nt"><Mcu</span> <span class="na">Name=</span><span class="s">"STM32F103R(8-B)Tx"</span> <span class="na">PackageName=</span><span class="s">"LQFP64"</span> <span class="na">RefName=</span><span class="s">"STM32F103RBTx"</span><span class="nt">></span>
</code></pre></div></div>
<p>So we actually used the definitions for the TFBGA64 packaged device instead of
the LQFP64 packaged device. 🤦 Incredibly this didn’t cause immediate problems,
since we first focussed on the STM32F3 and F4 families, whose functionality
is almost identical between packages.</p>
<p>However, we did notice some changes when a new version of CubeMX was released
which added or reordered devices in <code class="language-plaintext highlighter-rouge">families.xml</code>.
And then all hell broke loose when I added support for parsing the STM32F1 device
family, which couples peripheral features to memory size <em>and(!)</em> pin count:</p>
<center>
<p><img invertible="" src="stm32f1_feature_package.png" width="80%" /><br />
“32 KB Flash<sup>(1)</sup>” aka. this table isn’t complicated enough already</p>
</center>
<p>If you’re a hardware engineer at $vendor, <em>PLEASE DON’T DO THIS!</em> This is pure
punishment for anyone writing software for these chips. <strong>PLEASE DO NOT DO THIS!</strong>
You should not have to query for <em>combinations</em> of identifier traits to get your
hardware feature set. Expand your device lineup into new (orthogonal) identifier
space instead.</p>
<center>
<p><img src="not_like_this.gif" alt="" /></p>
</center>
<p>To be fair, the STM32F1 family was the first ST product to feature a Cortex-M
processor and they didn’t use this approach for any of their other STM32 families.
I forgive you, ST.</p>
<p>So for modm I looked very carefully at how to split the identifier into traits.
I made the trait composition and naming transparent to the DFG, it only operates
on a dictionary of items, sharing the same identifier mechanism with the AVRs.
Since we currently don’t have any information that depends on the temperature
range, I left it out for now. Similarly, the device revision is not considered
either.</p>
<center>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> stm32 f1 03 r b t
{platform}{family}{name}{pin}{size}{package}
</code></pre></div> </div>
</center>
<p>Note how both the xpcc and modm identifier encodings differ from the official ST
ordering scheme. Since we are sharing some code across vendors (like the
Cortex-M startup code), we need to have a common naming scheme, at least for
<code class="language-plaintext highlighter-rouge">{platform}</code> and <code class="language-plaintext highlighter-rouge">{family}</code> or the equivalent for other vendors.</p>
<p>Also note that <code class="language-plaintext highlighter-rouge">{name}</code> now does not contain part the trailing <code class="language-plaintext highlighter-rouge">1</code> of the family.
This is to prevent the problem in xpcc where the code template authors only
checked for the <code class="language-plaintext highlighter-rouge">{name}</code> instead of the <code class="language-plaintext highlighter-rouge">{family}</code> <em>and</em> <code class="language-plaintext highlighter-rouge">{name}</code>, for example,
<code class="language-plaintext highlighter-rouge">id["name"] == "103"</code> vs. <code class="language-plaintext highlighter-rouge">id["family"] == "f1" and id["name"] == "03"</code>.
This lead to issues when we ported some peripheral drivers to the <code class="language-plaintext highlighter-rouge">L1</code> family
(similar to <code class="language-plaintext highlighter-rouge">F0/L0</code>, <code class="language-plaintext highlighter-rouge">F4/L4</code> and <code class="language-plaintext highlighter-rouge">F7/H7</code>).</p>
<h3 id="encoding-commonality">Encoding Commonality</h3>
<p>You’ve undoubtedly already noticed that the AVR and CubeMX data is quite verbose
and noisy. We didn’t want to use this data directly, hence the DFG.
However, we wanted to go a step further and cut down on duplicated data, so that
we have an easier time verifying the output of the DFG by not having to look
through <em>thousands</em> of files, but rather <em>dozens</em>.</p>
<p>At the time of this writing, <code class="language-plaintext highlighter-rouge">families.xml</code> contains 1171 STM32 devices, but
<a href="https://github.com/modm-io/modm-devices/tree/5d5285ae1b6e889676b6d04a653d26977bf127e8/devices/stm32"><code class="language-plaintext highlighter-rouge">modm-devices/devices/stm32</code></a>
only contains 62 device files, that’s ~19x less files than devices.</p>
<p>We observed that ST clusters their devices on their website, in their technical
documentation and in their software offerings. The coarsest regular cluster
pattern is the family, which denotes the type of Cortex-M code used among other
features. The subfamilies are then more or less arbitrarily clustered around
whatever combination of functionality ST wanted to bring to market, but the
cluster patterns of pin count, memory size and package are <em>very</em> regular and
often explicitly called out. We wanted to reflect this in our data structure too.</p>
<center>
<p><img dimmable="" src="stm32f4x9_clusters.jpg" /><br />
This <a href="http://www.st.com/en/microcontrollers/stm32f469-479.html">STM32F4x9 feature matrix</a> is extremely regular.</p>
</center>
<p>The Device Tree format deals with data duplication by allowing data specialization
through an inheritance tree and tree inclusion nodes.
However, you still have to create one leaf node for every device, so in the best
case you’d have one DT per device, or if you moved common data up the inheritance
tree, you’d have more files than devices.</p>
<p>We decided instead to <em>merge</em> our data trees for devices within similar enough
clusters and then filter out the data for <em>one</em> device on access.
We use logical OR (<code class="language-plaintext highlighter-rouge">|</code>) to combine identifier traits to declare what devices
are merged. You’ll recognize the <code class="language-plaintext highlighter-rouge"><naming-schema></code> from the previous chapter:</p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><device</span> <span class="na">platform=</span><span class="s">"stm32"</span> <span class="na">family=</span><span class="s">"f1"</span> <span class="na">name=</span><span class="s">"03"</span> <span class="na">pin=</span><span class="s">"c|r|t|v"</span> <span class="na">size=</span><span class="s">"8|b"</span> <span class="na">package=</span><span class="s">"h|i|t|u"</span><span class="nt">></span>
<span class="nt"><naming-schema></span>{platform}{family}{name}{pin}{size}{package}<span class="nt"></naming-schema></span>
<span class="nt"><valid-device></span>stm32f103c8t<span class="nt"></valid-device></span>
<span class="c"><!-- ... --></span>
<span class="nt"><valid-device></span>stm32f103rbt<span class="nt"></valid-device></span>
</code></pre></div></div>
<p><a href="https://github.com/modm-io/modm-devices/blob/5d5285ae1b6e889676b6d04a653d26977bf127e8/devices/stm32/stm32f1-03-8_b.xml">This device file for the F103x8/b devices</a>
therefore contains all that match the identifier pattern of
<code class="language-plaintext highlighter-rouge">r"stm32f103[crtv][8b][hitu]"</code>.
<a href="https://github.com/modm-io/modm-devices/blob/5d5285ae1b6e889676b6d04a653d26977bf127e8/tools/device/modm/device_file.py#L33-L51">The engine extracting the data set for a single device</a>
will first construct a list of all possible identifier strings via the
naming schema and the <code class="language-plaintext highlighter-rouge">device</code> combinations: 4*2*4 = 32 identifiers in this example.
It then filters these identifiers by the list in <code class="language-plaintext highlighter-rouge"><valid-device></code>, since not
every combination actually exists. Whatever device file contains the requested
identifier string is then used.</p>
<p>The identifier schema does not have to include all traits either, it only has
to be unambiguous. For example the AVR device identifier schema does not contain
<code class="language-plaintext highlighter-rouge">{platform}</code> but we can infer it anyways:</p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><device</span> <span class="na">platform=</span><span class="s">"avr"</span> <span class="na">family=</span><span class="s">"mega"</span> <span class="na">name=</span><span class="s">"48|88|168|328"</span> <span class="na">type=</span><span class="s">"|a|p|pa"</span><span class="nt">></span>
<span class="nt"><naming-schema></span>at{family}{name}{type}<span class="nt"></naming-schema></span>
</code></pre></div></div>
<p>It first seems unnecessary to do this reverse lookup, but it gives us a very
important property for free: The extractor does not need to <em>know</em> anything
about the identifier, and still understands the mapping of string to traits.
So passing <code class="language-plaintext highlighter-rouge">stm32f103rbt</code> is now <em>understood</em> as <code class="language-plaintext highlighter-rouge">stm32 f1 03 r b t</code>.
The disadvantage is having to first build all identifier strings, before
returning the corresponding device file. However, this mapping can be cached.</p>
<p>The device file can now use the traits as filters by prefixing them with <code class="language-plaintext highlighter-rouge">device-</code>.
For our example, the device file continues with declaring the core driver instance,
which contains the memory map and vector table. The devices here only differ in
Flash size, otherwise they are identical:</p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><driver</span> <span class="na">name=</span><span class="s">"core"</span> <span class="na">type=</span><span class="s">"cortex-m3"</span><span class="nt">></span>
<span class="nt"><memory</span> <span class="na">device-size=</span><span class="s">"8"</span> <span class="na">name=</span><span class="s">"flash"</span> <span class="na">access=</span><span class="s">"rx"</span> <span class="na">start=</span><span class="s">"0x8000000"</span> <span class="na">size=</span><span class="s">"65536"</span><span class="nt">/></span>
<span class="nt"><memory</span> <span class="na">device-size=</span><span class="s">"b"</span> <span class="na">name=</span><span class="s">"flash"</span> <span class="na">access=</span><span class="s">"rx"</span> <span class="na">start=</span><span class="s">"0x8000000"</span> <span class="na">size=</span><span class="s">"131072"</span><span class="nt">/></span>
<span class="nt"><memory</span> <span class="na">name=</span><span class="s">"sram1"</span> <span class="na">access=</span><span class="s">"rwx"</span> <span class="na">start=</span><span class="s">"0x20000000"</span> <span class="na">size=</span><span class="s">"20480"</span><span class="nt">/></span>
<span class="nt"><vector</span> <span class="na">position=</span><span class="s">"0"</span> <span class="na">name=</span><span class="s">"WWDG"</span><span class="nt">/></span>
<span class="nt"><vector</span> <span class="na">position=</span><span class="s">"1"</span> <span class="na">name=</span><span class="s">"PVD"</span><span class="nt">/></span>
<span class="c"><!-- ... --></span>
<span class="nt"><vector</span> <span class="na">position=</span><span class="s">"42"</span> <span class="na">name=</span><span class="s">"USBWakeUp"</span><span class="nt">/></span>
</code></pre></div></div>
<p>By applying some simple combinatorics math we can find the minimal trait set that
uniquely describes this difference and can push this filter as far up the data
tree as possible while still being unambiguous and therefore losslessly
reconstructible for all merged device data.
This is all done for the sole purpose of optimizing for human readability, so an
embedded engineer with some experience can just look at this data and say:
“This filter looks too noisy to me, so something is probably is wrong here” 🤓
*sound of datasheet pages flipping*.</p>
<p>Here is an example of what I so dramatically complained about before: The STM32F1
peripheral feature set is coupled to the device’s pin count: F103 devices with
just <a href="https://github.com/modm-io/modm-devices/blob/5d5285ae1b6e889676b6d04a653d26977bf127e8/devices/stm32/stm32f1-03-8_b.xml#L68-L109">36 pins have fewer instances of these peripherals</a>:</p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><driver</span> <span class="na">name=</span><span class="s">"i2c"</span> <span class="na">type=</span><span class="s">"stm32"</span><span class="nt">></span>
<span class="nt"><instance</span> <span class="na">value=</span><span class="s">"1"</span><span class="nt">/></span>
<span class="nt"><instance</span> <span class="na">device-pin=</span><span class="s">"c|r|v"</span> <span class="na">value=</span><span class="s">"2"</span><span class="nt">/></span>
<span class="nt"></driver></span>
<span class="nt"><driver</span> <span class="na">name=</span><span class="s">"spi"</span> <span class="na">type=</span><span class="s">"stm32"</span><span class="nt">></span>
<span class="nt"><instance</span> <span class="na">value=</span><span class="s">"1"</span><span class="nt">/></span>
<span class="nt"><instance</span> <span class="na">device-pin=</span><span class="s">"c|r|v"</span> <span class="na">value=</span><span class="s">"2"</span><span class="nt">/></span>
<span class="nt"></driver></span>
<span class="nt"><driver</span> <span class="na">name=</span><span class="s">"usart"</span> <span class="na">type=</span><span class="s">"stm32"</span><span class="nt">></span>
<span class="nt"><instance</span> <span class="na">value=</span><span class="s">"1"</span><span class="nt">/></span>
<span class="nt"><instance</span> <span class="na">value=</span><span class="s">"2"</span><span class="nt">/></span>
<span class="nt"><instance</span> <span class="na">device-pin=</span><span class="s">"c|r|v"</span> <span class="na">value=</span><span class="s">"3"</span><span class="nt">/></span>
<span class="nt"></driver></span>
</code></pre></div></div>
<p>Of course both the pin count and the package influence the number of available
GPIOs and signals. The algorithm here detected that using the pin count as a
filter is enough to safely reconstruct the tree, so the <code class="language-plaintext highlighter-rouge">device-package</code> is
missing (it prioritizes traits further “left” in the identifier):</p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><driver</span> <span class="na">name=</span><span class="s">"gpio"</span> <span class="na">type=</span><span class="s">"stm32-f1"</span><span class="nt">></span>
<span class="c"><!-- ... --></span>
<span class="nt"><gpio</span> <span class="na">device-pin=</span><span class="s">"r|v"</span> <span class="na">port=</span><span class="s">"c"</span> <span class="na">pin=</span><span class="s">"10"</span><span class="nt">/></span>
<span class="nt"><gpio</span> <span class="na">device-pin=</span><span class="s">"r|v"</span> <span class="na">port=</span><span class="s">"c"</span> <span class="na">pin=</span><span class="s">"11"</span><span class="nt">></span>
<span class="nt"><signal</span> <span class="na">driver=</span><span class="s">"adc"</span> <span class="na">instance=</span><span class="s">"1"</span> <span class="na">name=</span><span class="s">"exti11"</span><span class="nt">/></span>
<span class="nt"><signal</span> <span class="na">driver=</span><span class="s">"adc"</span> <span class="na">instance=</span><span class="s">"2"</span> <span class="na">name=</span><span class="s">"exti11"</span><span class="nt">/></span>
<span class="nt"></gpio></span>
<span class="nt"><gpio</span> <span class="na">device-pin=</span><span class="s">"r|v"</span> <span class="na">port=</span><span class="s">"c"</span> <span class="na">pin=</span><span class="s">"12"</span><span class="nt">/></span>
<span class="nt"><gpio</span> <span class="na">device-pin=</span><span class="s">"c|r|v"</span> <span class="na">port=</span><span class="s">"c"</span> <span class="na">pin=</span><span class="s">"13"</span><span class="nt">></span>
<span class="nt"><signal</span> <span class="na">driver=</span><span class="s">"rtc"</span> <span class="na">name=</span><span class="s">"out"</span><span class="nt">/></span>
<span class="nt"><signal</span> <span class="na">driver=</span><span class="s">"rtc"</span> <span class="na">name=</span><span class="s">"tamper"</span><span class="nt">/></span>
<span class="nt"></gpio></span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">device-</code> filter traits are ORed, multiple filters on the same node ANDed,
and the nodes themselves ORed together again. Keen observers will point out that
this can create overly broad filters which would make for incorrect reconstruction.
For these cases we have to create two nodes with the same data, but different
filters to avoid ambiguity. Here is an example from
<a href="https://github.com/modm-io/modm-devices/blob/5d5285ae1b6e889676b6d04a653d26977bf127e8/devices/stm32/stm32f4-27_29_37_39.xml#L586-L587">the STM32F4{27,29,37,39} device file</a>:</p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="nt"><gpio</span> <span class="na">port=</span><span class="s">"c"</span> <span class="na">pin=</span><span class="s">"3"</span><span class="nt">></span>
<span class="c"><!-- ... --></span>
<span class="nt"><signal</span> <span class="na">device-name=</span><span class="s">"27|37"</span> <span class="na">device-pin=</span><span class="s">"a|i|v|z"</span> <span class="na">af=</span><span class="s">"12"</span> <span class="na">driver=</span><span class="s">"fmc"</span> <span class="na">name=</span><span class="s">"sdcke0"</span><span class="nt">/></span>
<span class="nt"><signal</span> <span class="na">device-name=</span><span class="s">"29|39"</span> <span class="na">device-pin=</span><span class="s">"a|b|i|n|z"</span> <span class="na">af=</span><span class="s">"12"</span> <span class="na">driver=</span><span class="s">"fmc"</span> <span class="na">name=</span><span class="s">"sdcke0"</span><span class="nt">/></span>
<span class="nt"></gpio></span>
</code></pre></div></div>
<p>Hm, but that filter does look suspiciously noisy, doesn’t it? This filter pattern is
repeated for the <a href="https://github.com/modm-io/modm-devices/blob/5d5285ae1b6e889676b6d04a653d26977bf127e8/devices/stm32/stm32f4-27_29_37_39.xml#L457-L458"><code class="language-plaintext highlighter-rouge">sdne[1:0]</code></a>
and <a href="https://github.com/modm-io/modm-devices/blob/5d5285ae1b6e889676b6d04a653d26977bf127e8/devices/stm32/stm32f4-27_29_37_39.xml#L558-L559"><code class="language-plaintext highlighter-rouge">sdnwe</code></a>
signals, which all belong to the SDRAM controller in the FMC.
And according to this data set they seem to be unavailable for the LQFP100
package? Hm, better <del>call Saul</del> check the datasheets:</p>
<center>
<p><img invertible="" src="stm32f4xx_fmc_sdcke0.png" width="65%" /></p>
<p><img invertible="" src="stm32f4xx_fmc_sdcke0_af.png" /><br />
Huh, but the signals <em>do</em> exist for the LQFP100 package!?</p>
<p><img invertible="" src="stm32f4xx_fmc_100.png" /><br />
“FMC: Yes<sup>(1)</sup>”. Oh, FFS!</p>
</center>
<p>I checked with CubeMX and the GPIO configurator doesn’t allow you to set SDRAM
signals in the LQFP100 package, and there are no <code class="language-plaintext highlighter-rouge">STM32F4[23]7[BN]</code> devices, so
everything is fine, I guess? Nothing to see here folks, move along,
the filter algorithm encoded this shit correctly. 🙃</p>
<center>
<p><img dimmable="" src="do_not_want.gif" width="30%" /></p>
</center>
<p>Anyways, I like our device file format a lot, since it describes the device’s
hardware in such a compact and concise form. However, it doesn’t scale graciously
at all for data that shares less commonalities between devices in the current
clusters.</p>
<h3 id="data-pipeline">Data Pipeline</h3>
<p>For my rewrite of the DFG for modm I wanted to improve the correctness of device
merges, remove device specific knowledge as much as possible, support multiple
output formats and rename less data.
I’ve already hinted at solutions to some of these in the previous chapters, so
let’s have a proper look at them now.</p>
<center>
<p><img src="dfg_architecture.png" alt="" /></p>
</center>
<p>The DFG has three parts: frontend, optimizer and backend. Here yellow stands for
<span style="background-color:rgb(255,255,202);">input data</span>, blue for
<span style="background-color:rgb(192,217,254);">data conversion</span>, red for
<span style="background-color:rgb(250,202,199);">intermediate representation (IR)</span> and green for
<span style="background-color:rgb(211,234,205);">output data</span>.
I’ve already covered the vendor input data and the device merging in much detail.</p>
<p><a href="https://github.com/modm-io/modm-devices/blob/8d38650186764c879309fd946b29e94821e6579d/tools/generator/dfg/stm32/stm_device_tree.py#L42-L345">All the ugly is in the parser</a>,
it reads the CubeMX data in the same manner I’ve described previously, performs
plausibility and format checks on it, and finally normalizes it into a simple
Python dictionary. This is just mostly mind-numbingly stupid code to write,
since you have to XPath query the CubeMX sources, deal with all the edge cases
in the results and normalize all data relative to all devices.
Ugly to write, ugly to read, but it gets the job done.</p>
<p>Additional curated data gets injected in this step too. The CubeMX data
contains a hardware IP version, which seems to correlate loosely to the peripherals
feature set, however, I didn’t find it very useful to distinguish between them.
So instead I looked up how all peripherals work in the documentation and <a href="https://github.com/modm-io/modm-devices/blob/5d5285ae1b6e889676b6d04a653d26977bf127e8/tools/generator/dfg/stm32/stm_peripherals.py#L298-L325">grouped
them again manually</a>.
The device file <code class="language-plaintext highlighter-rouge">driver/@type</code> name comes from this data.</p>
<p>For example, here we can see that the entire STM32 platform only has three
different I<sup>2</sup>C hardware implementations, one of which only differs
with the addition of a digital noise filter.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">'i2c'</span><span class="p">:</span> <span class="p">[{</span>
<span class="s">'instances'</span><span class="p">:</span> <span class="s">'*'</span><span class="p">,</span>
<span class="s">'groups'</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span>
<span class="c1"># This hardware can go up to 1MHz (Fast Mode Plus)
</span> <span class="s">'hardware'</span><span class="p">:</span> <span class="s">'stm32-extended'</span><span class="p">,</span>
<span class="s">'features'</span><span class="p">:</span> <span class="p">[],</span>
<span class="s">'devices'</span><span class="p">:</span> <span class="p">[{</span><span class="s">'family'</span><span class="p">:</span> <span class="p">[</span><span class="s">'f0'</span><span class="p">,</span> <span class="s">'f3'</span><span class="p">,</span> <span class="s">'f7'</span><span class="p">]}]</span>
<span class="p">},{</span>
<span class="s">'hardware'</span><span class="p">:</span> <span class="s">'stm32l4'</span><span class="p">,</span>
<span class="s">'features'</span><span class="p">:</span> <span class="p">[</span><span class="s">'dnf'</span><span class="p">],</span>
<span class="s">'devices'</span><span class="p">:</span> <span class="p">[{</span><span class="s">'family'</span><span class="p">:</span> <span class="p">[</span><span class="s">'l4'</span><span class="p">]}]</span>
<span class="p">},{</span>
<span class="c1"># Some F4 have a digital noise filter
</span> <span class="s">'hardware'</span><span class="p">:</span> <span class="s">'stm32'</span><span class="p">,</span>
<span class="s">'features'</span><span class="p">:</span> <span class="p">[</span><span class="s">'dnf'</span><span class="p">],</span>
<span class="s">'devices'</span><span class="p">:</span> <span class="p">[{</span><span class="s">'family'</span><span class="p">:</span> <span class="p">[</span><span class="s">'f4'</span><span class="p">],</span> <span class="s">'name'</span><span class="p">:</span> <span class="p">[</span><span class="s">'27'</span><span class="p">,</span> <span class="s">'29'</span><span class="p">,</span> <span class="s">'37'</span><span class="p">,</span> <span class="s">'39'</span><span class="p">,</span> <span class="s">'46'</span><span class="p">,</span> <span class="s">'69'</span><span class="p">,</span> <span class="s">'79'</span><span class="p">]}]</span>
<span class="p">},{</span>
<span class="s">'hardware'</span><span class="p">:</span> <span class="s">'stm32'</span><span class="p">,</span>
<span class="s">'features'</span><span class="p">:</span> <span class="p">[],</span>
<span class="s">'devices'</span><span class="p">:</span> <span class="s">'*'</span>
<span class="p">}</span>
<span class="p">]</span>
<span class="p">}]</span>
</code></pre></div></div>
<p>All names of peripherals, instances, signals are preserved as they are, so that
the name matches the documentation. The only exception are names that wouldn’t
be valid identifiers in most programming languages.
For our STM32F103RBT example, we split up and duplicate these system signals:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SYS_JTCK-SWCLK => sys.jtck + sys.swclk
SYS_JTDO-TRACESWO => sys.jtdo + sys.traceswo
SYS_JTMS-SWDIO => sys.jtms + sys.swdio
</code></pre></div></div>
<p>The dictionary returned by the parser is then passed onto <a href="https://github.com/modm-io/modm-devices/blob/8d38650186764c879309fd946b29e94821e6579d/tools/generator/dfg/stm32/stm_device_tree.py#L360-L487">a platform specific
converter</a>
that transforms it into the DFGs intermediate representation.
Here the raw data is formatted into a glorified tree structure, which has similar
semantics to a very restricted form of XML (ie. attributes are stored separately
from its children) and annotates each node with the device’s identifier.</p>
<p>Here the memory maps and the interrupt vector table is added to the <code class="language-plaintext highlighter-rouge">name="core"</code>
driver node we saw before. The raw data already contains the memories and
vectors with the right naming scheme, so it’s easy to just add them here.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">section</span> <span class="ow">in</span> <span class="n">p</span><span class="p">[</span><span class="s">"memories"</span><span class="p">]:</span>
<span class="n">memory_node</span> <span class="o">=</span> <span class="n">core_driver</span><span class="p">.</span><span class="n">addChild</span><span class="p">(</span><span class="s">"memory"</span><span class="p">)</span>
<span class="n">memory_node</span><span class="p">.</span><span class="n">setAttributes</span><span class="p">([</span><span class="s">"name"</span><span class="p">,</span> <span class="s">"access"</span><span class="p">,</span> <span class="s">"start"</span><span class="p">,</span> <span class="s">"size"</span><span class="p">],</span> <span class="n">section</span><span class="p">)</span>
<span class="k">for</span> <span class="n">vector</span> <span class="ow">in</span> <span class="n">p</span><span class="p">[</span><span class="s">"interrupts"</span><span class="p">]:</span>
<span class="n">vector_node</span> <span class="o">=</span> <span class="n">core_driver</span><span class="p">.</span><span class="n">addChild</span><span class="p">(</span><span class="s">"vector"</span><span class="p">)</span>
<span class="n">vector_node</span><span class="p">.</span><span class="n">setAttributes</span><span class="p">([</span><span class="s">"position"</span><span class="p">,</span> <span class="s">"name"</span><span class="p">],</span> <span class="n">vector</span><span class="p">)</span>
<span class="c1"># sort the node children by start address and size
</span><span class="n">core_driver</span><span class="p">.</span><span class="n">addSortKey</span><span class="p">(</span><span class="k">lambda</span> <span class="n">e</span><span class="p">:</span> <span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">e</span><span class="p">[</span><span class="s">"start"</span><span class="p">],</span> <span class="mi">16</span><span class="p">),</span> <span class="nb">int</span><span class="p">(</span><span class="n">e</span><span class="p">[</span><span class="s">"size"</span><span class="p">]))</span>
<span class="k">if</span> <span class="n">e</span><span class="p">.</span><span class="n">name</span> <span class="o">==</span> <span class="s">"memory"</span> <span class="k">else</span> <span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">))</span>
<span class="c1"># sort the node children by vector number and name
</span><span class="n">core_driver</span><span class="p">.</span><span class="n">addSortKey</span><span class="p">(</span><span class="k">lambda</span> <span class="n">e</span><span class="p">:</span> <span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">e</span><span class="p">[</span><span class="s">"position"</span><span class="p">]),</span> <span class="n">e</span><span class="p">[</span><span class="s">"name"</span><span class="p">])</span>
<span class="k">if</span> <span class="n">e</span><span class="p">.</span><span class="n">name</span> <span class="o">==</span> <span class="s">"vector"</span> <span class="k">else</span> <span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="s">""</span><span class="p">))</span>
</code></pre></div></div>
<p>I’m adding two sort keys to the core driver node here, to bring the entire
tree into canonical order. This an absolute requirement for the reproducibility of
the results, otherwise I wouldn’t be able to tell what data changed if the
line order came out differently on each invocation.</p>
<p>It’s time to merge the device IRs now. The device clustering is curated manually, by
<a href="https://github.com/modm-io/modm-devices/blob/5d5285ae1b6e889676b6d04a653d26977bf127e8/tools/generator/dfg/stm32/stm_groups.py">a large list of identifier trait groups</a>.
I considered using some kind of heuristic to automate this,
but this works really well, particularly for the AVR and STM32F1 devices.
It’s difficult to come up with a metric that accurately describes how annoyed
I feel when looking at wrongfully merged device files with lotsa noisy filters. 😤</p>
<p>The STM32F103 devices are split into these four groups:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span>
<span class="s">'family'</span><span class="p">:</span> <span class="p">[</span><span class="s">'f1'</span><span class="p">],</span>
<span class="s">'name'</span><span class="p">:</span> <span class="p">[</span><span class="s">'03'</span><span class="p">],</span>
<span class="s">'size'</span><span class="p">:</span> <span class="p">[</span><span class="s">'4'</span><span class="p">,</span> <span class="s">'6'</span><span class="p">]</span>
<span class="p">},{</span>
<span class="s">'family'</span><span class="p">:</span> <span class="p">[</span><span class="s">'f1'</span><span class="p">],</span>
<span class="s">'name'</span><span class="p">:</span> <span class="p">[</span><span class="s">'03'</span><span class="p">],</span>
<span class="s">'size'</span><span class="p">:</span> <span class="p">[</span><span class="s">'8'</span><span class="p">,</span> <span class="s">'b'</span><span class="p">]</span>
<span class="p">},{</span>
<span class="s">'family'</span><span class="p">:</span> <span class="p">[</span><span class="s">'f1'</span><span class="p">],</span>
<span class="s">'name'</span><span class="p">:</span> <span class="p">[</span><span class="s">'03'</span><span class="p">],</span>
<span class="s">'size'</span><span class="p">:</span> <span class="p">[</span><span class="s">'c'</span><span class="p">,</span> <span class="s">'d'</span><span class="p">,</span> <span class="s">'e'</span><span class="p">]</span>
<span class="p">},{</span>
<span class="s">'family'</span><span class="p">:</span> <span class="p">[</span><span class="s">'f1'</span><span class="p">],</span>
<span class="s">'name'</span><span class="p">:</span> <span class="p">[</span><span class="s">'03'</span><span class="p">],</span>
<span class="s">'size'</span><span class="p">:</span> <span class="p">[</span><span class="s">'f'</span><span class="p">,</span> <span class="s">'g'</span><span class="p">]</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In case you’re curious how bad it would be with just one large F103 group,
<a href="https://gist.github.com/salkinium/95e3bf6322468c56beef9dc6c7bbaa3f">here is a gist with the resulting device file</a>.
It’s not as bad as it could be, but still much harder to read.</p>
<p>At this point the merged IR for our F103RBT device basically already looks like
the finished device file, including identifier filters:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>device <> stm32f103[c|r|t|v][8|b][h|i|t|u]
. driver <name:core type:cortex-m3>
. memory <name:flash access:rx start:0x8000000 size:65536> stm32f103[c|r|t|v]8[h|t|u]
. memory <name:flash access:rx start:0x8000000 size:131072> stm32f103[c|r|t|v]b[h|i|t|u]
. memory <name:sram1 access:rwx start:0x20000000 size:20480>
. vector <position:0 name:WWDG>
...
. vector <position:42 name:USBWakeUp>
. driver <name:i2c type:stm32>
. instance <value:1>
. instance <value:2> stm32f103[c|r|v][8|b][h|i|t|u]
. driver <name:spi type:stm32>
. instance <value:1>
. instance <value:2> stm32f103[c|r|v][8|b][h|i|t|u]
</code></pre></div></div>
<p>I’ve already described the device file format above, however, one additional
testing step is done before the DFG is finished: A copy of every single device
file is taken before merging, so that it can be compared with the device files
that are extracted from this merged one. This is a brute-force test to make sure
the filter algorithms did perform correctly.</p>
<p>On a side note, the conversion from IR to device file format can be performed at
any time, so that last merge step is strictly speaking optional. This is useful for debugging
but also if you want to output this data in a format that does not support a
merge mechanism similar to the device file’s one, like plain JSON.</p>
<h2 id="using-device-files">Using Device Files</h2>
<p>So now that we have all this data, let’s have some fun with it.
modm-devices comes not only with the DFG but also with a device file parser,
which can be used like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">modm.parser</span><span class="p">,</span> <span class="n">glob</span>
<span class="o">>>></span> <span class="n">devices</span> <span class="o">=</span> <span class="p">{}</span>
<span class="o">>>></span> <span class="k">for</span> <span class="n">filename</span> <span class="ow">in</span> <span class="n">glob</span><span class="p">.</span><span class="n">glob</span><span class="p">(</span><span class="s">"path/to/modm-devices/devices/**/*.xml"</span><span class="p">):</span>
<span class="o">>>></span> <span class="k">for</span> <span class="n">device</span> <span class="ow">in</span> <span class="n">modm</span><span class="p">.</span><span class="n">parser</span><span class="p">.</span><span class="n">DeviceParser</span><span class="p">().</span><span class="n">parse</span><span class="p">(</span><span class="n">filename</span><span class="p">).</span><span class="n">get_devices</span><span class="p">():</span>
<span class="o">>>></span> <span class="n">devices</span><span class="p">[</span><span class="n">device</span><span class="p">.</span><span class="n">partname</span><span class="p">]</span> <span class="o">=</span> <span class="n">device</span>
<span class="o">>>></span> <span class="n">devices</span><span class="p">[</span><span class="s">"stm32f103rbt"</span><span class="p">].</span><span class="n">properties</span>
<span class="p">{</span><span class="s">'driver'</span><span class="p">:</span> <span class="p">[{</span><span class="s">'memory'</span><span class="p">:</span> <span class="p">[{</span><span class="s">'access'</span><span class="p">:</span> <span class="s">'rx'</span><span class="p">,</span>
<span class="s">'name'</span><span class="p">:</span> <span class="s">'flash'</span><span class="p">,</span>
<span class="s">'size'</span><span class="p">:</span> <span class="s">'131072'</span><span class="p">,</span>
<span class="s">'start'</span><span class="p">:</span> <span class="s">'0x8000000'</span><span class="p">},</span>
<span class="p">{</span><span class="s">'access'</span><span class="p">:</span> <span class="s">'rwx'</span><span class="p">,</span>
<span class="s">'name'</span><span class="p">:</span> <span class="s">'sram1'</span><span class="p">,</span>
<span class="s">'size'</span><span class="p">:</span> <span class="s">'20480'</span><span class="p">,</span>
<span class="s">'start'</span><span class="p">:</span> <span class="s">'0x20000000'</span><span class="p">}],</span>
<span class="s">'name'</span><span class="p">:</span> <span class="s">'core'</span><span class="p">,</span>
<span class="s">'type'</span><span class="p">:</span> <span class="s">'cortex-m3'</span><span class="p">,</span>
<span class="p">...</span> <span class="p">}]</span>
<span class="p">}</span>
</code></pre></div></div>
<p>There are some built-in convenience functions for accessing some of the common
data in the device files:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">device</span> <span class="o">=</span> <span class="n">devices</span><span class="p">[</span><span class="s">"stm32f103rbt"</span><span class="p">]</span>
<span class="o">>>></span> <span class="n">device</span><span class="p">.</span><span class="n">identifier</span>
<span class="n">OrderedDict</span><span class="p">([(</span><span class="s">'platform'</span><span class="p">,</span> <span class="s">'stm32'</span><span class="p">),</span> <span class="p">(</span><span class="s">'family'</span><span class="p">,</span> <span class="s">'f1'</span><span class="p">),</span> <span class="p">(</span><span class="s">'name'</span><span class="p">,</span> <span class="s">'03'</span><span class="p">),</span> <span class="p">(</span><span class="s">'pin'</span><span class="p">,</span> <span class="s">'r'</span><span class="p">),</span> <span class="p">(</span><span class="s">'size'</span><span class="p">,</span> <span class="s">'b'</span><span class="p">),</span> <span class="p">(</span><span class="s">'package'</span><span class="p">,</span> <span class="s">'t'</span><span class="p">)])</span>
<span class="o">>>></span> <span class="n">device</span><span class="p">.</span><span class="n">has_driver</span><span class="p">(</span><span class="s">"usart:avr"</span><span class="p">)</span>
<span class="bp">False</span>
<span class="o">>>></span> <span class="n">device</span><span class="p">.</span><span class="n">has_driver</span><span class="p">(</span><span class="s">"usart:stm32"</span><span class="p">)</span>
<span class="bp">True</span>
<span class="o">>>></span> <span class="n">device</span><span class="p">.</span><span class="n">get_driver</span><span class="p">(</span><span class="s">"usart:stm32"</span><span class="p">)</span>
<span class="p">{</span><span class="s">'instance'</span><span class="p">:</span> <span class="p">[</span><span class="s">'1'</span><span class="p">,</span> <span class="s">'2'</span><span class="p">,</span> <span class="s">'3'</span><span class="p">],</span> <span class="s">'name'</span><span class="p">:</span> <span class="s">'usart'</span><span class="p">,</span> <span class="s">'type'</span><span class="p">:</span> <span class="s">'stm32'</span><span class="p">}</span>
</code></pre></div></div>
<p>I’ve also written a short <code class="language-plaintext highlighter-rouge">stats</code> script that allows you to compute some very basic
information about the device file collection:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="nv">$ </span>python3 tools/device/scripts/stats <span class="nt">--count</span>
1355 devices
<span class="nv">$ </span>python3 tools/device/scripts/stats <span class="nt">--driver</span>
<span class="o">{</span>
<span class="s2">"ac"</span>: 234,
<span class="s2">"adc"</span>: 1339,
<span class="s2">"aes"</span>: 133,
<span class="s2">"awex"</span>: 26,
<span class="s2">"bandgap"</span>: 8,
<span class="s2">"battery_protection"</span>: 7,
<span class="s2">"bdma"</span>: 20,
<span class="s2">"bod"</span>: 30,
<span class="s2">"can"</span>: 683,
<span class="s2">"ccl"</span>: 30,
<span class="s2">"cell_balancing"</span>: 5,
<span class="s2">"cfd"</span>: 2,
<span class="s2">"charger_detect"</span>: 4,
<span class="s2">"clk"</span>: 45,
<span class="s2">"clock"</span>: 242,
<span class="s2">"comp"</span>: 577,
<span class="s2">"core"</span>: 1355,
...
<span class="o">}</span>
</code></pre></div></div>
<!-- TODO: insert bar graphs for Flash/RAM distributions? -->
<p><code class="language-plaintext highlighter-rouge">stats</code> also allows you to dump expanded JSON for a prefix of devices and then
query that with the tool of your choice to, for example, get all the I<sup>2</sup>C
related signals on port B for the STM32F4 device family.
Not sure why you’d want that, but it’s possible.</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="nv">$ </span>python3 tools/device/scripts/stats <span class="nt">--json</span> stm32f4 | jq <span class="s1">'[.[] | .device.driver[] | select(.name == "gpio").gpio[] | . as $gpio | .signal[]? | select(.driver == "i2c" and $gpio.port == "b") | ($gpio.port + $gpio.pin + ":" + .name)] | unique'</span>
<span class="o">[</span>
<span class="s2">"b10:scl"</span>,
<span class="s2">"b11:sda"</span>,
<span class="s2">"b12:smba"</span>,
<span class="s2">"b3:sda"</span>,
<span class="s2">"b4:sda"</span>,
<span class="s2">"b5:smba"</span>,
<span class="s2">"b6:scl"</span>,
<span class="s2">"b7:sda"</span>,
<span class="s2">"b8:scl"</span>,
<span class="s2">"b8:sda"</span>,
<span class="s2">"b9:sda"</span>
<span class="o">]</span>
</code></pre></div></div>
<p>I’ll discuss in more detail how we use the device files in the next blog post
about the modm library.</p>
<h3 id="try-it-yourself">Try it Yourself</h3>
<p>The <a href="https://github.com/modm-io/modm-devices">device file as well as the DFG are available on GitHub</a>
for you to play with. It automatically downloads and extracts all the raw data
into <code class="language-plaintext highlighter-rouge">modm-devices/tools/generator/raw-device-data</code> folder.</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone <span class="nt">--recursive</span> <span class="nt">--depth</span><span class="o">=</span>1 https://github.com/modm-io/modm-devices.git
<span class="nb">cd </span>modm-devices/tools/generator
<span class="c"># Extract and generate STM32 device data</span>
make extract-data-stm32
make generate-stm32
<span class="c"># Extract and generate AVR device data</span>
make extract-data-avr
make generate-avr
</code></pre></div></div>
<p>Not everything I described here is fully implemented, for example, the <a href="https://github.com/salkinium/save-the-clocktrees">clock
graph extractor is just a proof-of-concept</a>
for now. modm-devices is also supposed to be a Python package installable via pip,
but that’s not implemented yet.</p>
<p>Please help me maintain this project, I only used devices from a few STM32 families,
so it’s difficult to judge the correctness of some of this data.
If you know of any other machine readable data, please open an issue or preferrably
a pull request.</p>
<p>Two more device file checks are currently not implemented:
a XML schema validation, and a semantical checker, that verifies the contents
consistency. For example, every GPIO signal should be associable with a driver,
and no signal name should start with a number (otherwise difficult to map into
most programming languages). These are ideas for the future.</p>
<p>With some effort and additional data sources (CMSIS-SVD files for example),
directly outputting to Device Tree format should be possible too. I leave that
one to the experts though. 😇</p>
<h2 id="conclusion">Conclusion</h2>
<p>It was important to use not to bind this data to any preconceptions of its use
by, for example, integrating it tightly into our HAL generator. Instead we’ve very
carefully separated modm-devices from our use of it, so that it can stand on its
own and be integrated into all sorts of projects by the community.
You’re not bound to using this in code either, you can also generate Markdown
documentation, or maybe build your own GPIO configurator as a web UI.</p>
<p>You can go and use it as is with its Python <code class="language-plaintext highlighter-rouge">DeviceFile</code> interface, however,
for larger projects, I’d recommend you write your own wrapper class, that can
format the data as you need it.
The Device File format may change at any time, so that I can fit in new data
or once I don’t like the format anymore, change it completely. So don’t
depend on the format directly.</p>
<p>The next few blog posts will be about applying this data in our own modm library,
how CMSIS-SVD compares to CMSIS Headers as additional data sources, and what it
means to model check your HAL with this data.</p>
<h3 id="on-a-personal-note">On a Personal Note</h3>
<p>The last 5 years working on this have been quite a ride. It has completely
changed my view on embedded software engineering and it took a while for me
understand this different way of thinking. As far as I know, nobody has deployed
hardware description methods on such a large and diverse device base. And we’re
just getting started.</p>
<p>I’ve been fortunate to have found similarly minded people in the RCA, who
provided me with valuable feedback and thoughtful discussions, who mentored me
and tolerated my rants about our robot’s code quality. The RCA is self organized,
so we don’t have anyone telling us what to do, or <em>how</em> to do it.
As a result, we do reinvent the wheel a lot, sometimes for worse,
but mostly for the better, like with this project.</p>
<p>During this time I’ve not had the best experience with the “professional” C/C++
embedded community. There are too many established developers convinced of their
own opinions that won’t stop arguing until they’ve “won” (just ask about using
<a href="https://gist.github.com/salkinium/cc7236328a532c8c0f05f74c9ceb30a4">C++ on µCs</a>
and bring some 🍿).
Together with the growth in amateur interest in embedded software (absolutely
<em>not</em> a bad thing), this completely drowned out any worthwhile online discussions
on new approaches to embedded software that are different from the “approved”
norm. I’m not talking so much about the programming language itself, which is
relatively exchangeable for HALs (a rather unpopular opinion), but about HAL
design concepts and perhaps most importantly, support tools.</p>
<p>Let me give you an example: ST has committed <a href="https://github.com/ARMmbed/mbed-os/graphs/contributors">at least 4-6 engineers</a>
to porting its devices to Arm Mbed OS. Good for ST, that’s a lot of money.
But: ST only supports <a href="https://gist.github.com/salkinium/f2140b4ba2bbf7cb3c9a99c215392048#file-targets-md">55 of their ~1100 STM32 targets</a>
on Mbed OS, with every single one of them ported <em>by hand</em>.
This means at least all <a href="https://gist.github.com/salkinium/f2140b4ba2bbf7cb3c9a99c215392048#file-startuplinker-md">startup code and linkerscripts</a>
are mostly duplicated for each target and <a href="https://gist.github.com/salkinium/f2140b4ba2bbf7cb3c9a99c215392048#file-gpio_signals-md">all GPIO signal data is added manually</a>
by an unfortunate soul with all the <a href="https://github.com/ARMmbed/mbed-os/blob/8f647beacb6f14ce1af7f2eff01d0a497f94f7ae/targets/TARGET_STM/TARGET_STM32F1/TARGET_NUCLEO_F103RB/PeripheralPins.c#L35-L37">side-effects of manual labor</a>.
That’s insane, as you’ve seen above, ST is already maintaining and using this data
already to generate code with CubeMX. How is this not automated?</p>
<p>Fortunately, in the last few years there was some significant progress in enabling
(new) programming languages on embedded, like <a href="http://micropython.org">MicroPython</a>,
<a href="https://www.espruino.com">Javascript runtimes</a> and perhaps the most significant
of them: <a href="http://blog.japaric.io">Embedded in Rust</a>.
I’ve been particularly impressed with the progress of the community surrounding
<a href="https://twitter.com/japaricious">@japaricious</a>, who are currently tackling some
very hard issues, <a href="http://blog.japaric.io/brave-new-io/#no-pin-overlap">like IO signal grouping</a>
or <a href="http://blog.japaric.io/safe-dma/">safe DMA APIs</a>.
I’ve kinda written this blog post for them, since I think they are <a href="https://internals.rust-lang.org/t/announcing-the-embedded-devices-working-group/">best organized
to actually use it</a>
and they don’t seem afraid to tackle these issues. (Your move, C++ people!)</p>Niklas Hauserniklas@salkinium.comFor the last 2 years Fabian Greif and I have been working on a secret project called modm: a toolkit for data-driven code generation. In a nutshell, we feed detailed hardware description data for almost all AVR and STM32 targets into a code generator to create a C++ Hardware Abstraction Layer (HAL), startup & linkerscript code, documentation and support tools. This isn’t exactly a new idea, after all very similar ideas have been floating around before, most notably in the Linux Kernel with its Device Tree (DT) effort. In fact, modm itself is based entirely on xpcc which matured the idea of data-driven HAL generation in the first place. However, for modm we focused on what goes on behind the scenes: how to acquire detailed target description data and how to use it with reasonable effort. We now have a toolbox that transcends its use as our C++ HAL generator and instead can be applied generically to any project in any language (*awkwardly winks at the Rust community*). That’s pretty powerful stuff. So let me first ease you into this topic with some historic background and then walk you through the data sources we use and the design decisions of our data engine. All with plenty of examples for you to follow along, just stay well clear of those hairy yaks in the distance. The Origin Story All the usual suspects in this case were members of the Roboterclub Aachen e. V. (@RCA_eV). Around 2006 the team surrounding Fabian had built a communication library called RCCP for doing remote procedure calls over CAN. Back then the only affordable microcontrollers were AVRs, but neither were they powerful enough to perform all the computations needed for autonomy nor did they have enough pins to interface with all the motors and sensors we stuffed in our robots. So an embedded PC programmed in various languages did all the heavy lifting and talked via CAN to the AVR actuators and sensors. (It has been passed on for many generations of robot builders, that the embedded PC did a disk check once during its boot process, which rendered the robot unresponsive for a few minutes. Unfortunately it did this during the a Eurobot finals game and we lost due to that. Since then our robots don’t have a kernel in their critical path anymore.) RCCP was eventually refactored into the Cross Platform Component Communication (XPCC) library and open-sourced on Sourceforge in 2009. Around 2012 when Fabian was leaving us to go work on satellites at the German space agency (DLR), I took over stewardship of the project and moved it over to GitHub where it exists to this day. It’s the foundation of all the RCAs robots. From AVR to STM32 By the time I joined in 2010, the team had been using C++ on AVRs for years. Around 2012 we finally outgrew the AVRs used to control our autonomous robots and switched over to Arm Cortex-M devices, specifically the STM32 series. So began the cumbersome task of porting the HAL that worked so well on the AVRs to the STM32F1 and F4 families, both of which have much more capable peripherals. We had inherited a C++ API that passed around static classes containing the peripheral abstraction to template classes wrapping these classes. It’s the clear anti-thesis of polymorphic interface design, almost a form of “compile time duck-typing”: class GpioB0 { public: // one class for every GPIO on the device static void set(bool state); }; class SpiMaster0 { public: // one class for every Spi peripheral static uint8_t swap(uint8_t data); }; template< class SpiMaster, class ChipSelect > class SensorDriver { public: uint8_t read() { ChipSelect::set(Gpio::Low); uint8_t result = SpiMaster::swap(foobar); ChipSelect::set(Gpio::High); return result; } }; // Hey look, a generic sensor driver SensorDriver< SpiMaster0, GpioB0 > compass; uint8_t heading = compass.read(); C++ concepts sure would be useful here for asserting SpiMaster traits. *cough* This technique resulted in a rather unusual HAL, but when used in moderation it yields ridiculously small binary sizes! And this was absolutely a requirement on our AVRs which wanted to stuff full of control code for our autonomous robots. The size reduction didn’t so much come from using C++ features like templates, but from being able to very accurately dissect special cases into their own functions. This is particularly useful on AVRs where the IO memory map is very irregular and differs quite a bit between devices. Writing one function to handle all variations at runtime can be more expensive than writing a couple of specialized functions and letting the linker throw away all the unused ones. But it does have one significant and obvious disadvantage: Our HAL had to have a class for every peripheral you want to use. And adding these classes manually didn’t scale very well with us and it proved an even bigger problem for a device with the peripheral amount and features of an STM32. And so the inevitable happened: we started using preprocessor macros to “instantiate” these peripheral classes, or switched between different implementation with extensive, often nested, #if/#else/#endif trees. It was such an ugly solution. We also had a mechanism for generating code manually calling a Jinja2 template engine and committing the result, in fact, already since Nov. 2009. It was first used to create the AVR’s UART classes and slowly expanded to other platforms. But it didn’t really scale either because you still had to explicitly provide all the substitution data to the engine, which usually only was the number, or letter, identifying the peripheral. It wasn’t until 2013 that Kevin Läufer generalized this idea by moving it into our SCons-based build system and collecting all template substitution data into one common file per target, which we just called “The Device File” (naming things is hard, ok?). This made it much easier to generate new peripheral drivers and it even did so on-the-fly during the build process due to being included into SCons’ dependency graph, which eliminated the need for manually committing these generated files and keeping them up-to-date. First Steps The first draft of the STM32F407’s device file was assembled by hand and lacked a clear structure. In retrospect, we also had trouble deciding which data goes in the device file and which stays embedded in the templates, but, we didn’t sweat the details, since we had an entire library to refactor and a robot to build. The major limitation of our system of course was getting the required data and manually assembling it didn’t scale, and so we were stuck in the same bottleneck as before, albeit with a slightly better build process. And then, after researching how avr-gcc actually generate the <avr/io.h> headers, a solution presented itself: Atmel publishes a bunch of XML files called Part Description Files, or PDFs (lolwut?), containing the memory map of their AVR devices, and we just had to reformat this a little bit. Right? If only I knew what I was getting into… <module name="USART"> <instance name="USART0" caption="USART"> <register-group name="USART0" name-in-module="USART0" offset="0x00" address-space="data" caption="USART"/> <signals> <signal group="TXD" function="default" pad="PD1"/> <signal group="RXD" function="default" pad="PD0"/> <signal group="XCK" function="default" pad="PD4"/> </signals> </instance> </module> <module name="TWI"> <instance name="TWI" caption="Two Wire Serial Interface"> <register-group name="TWI" name-in-module="TWI" offset="0x00" address-space="data" caption="Two Wire Serial Interface"/> <signals> <signal group="SDA" function="default" pad="PC4"/> <signal group="SCL" function="default" pad="PC5"/> </signals> </instance> </module> <module name="PORT"> <instance name="PORTB" caption="I/O Port"> <register-group name="PORTB" name-in-module="PORTB" offset="0x00" address-space="data" caption="I/O Port"/> <signals> <signal group="P" function="default" pad="PB0" index="0"/> <signal group="P" function="default" pad="PB1" index="1"/> <signal group="P" function="default" pad="PB2" index="2"/> <signal group="P" function="default" pad="PB3" index="3"/> <signal group="P" function="default" pad="PB4" index="4"/> <signal group="P" function="default" pad="PB5" index="5"/> <signal group="P" function="default" pad="PB6" index="6"/> <signal group="P" function="default" pad="PB7" index="7"/> </signals> </instance> Excerpt of the ATmega328P.atdf part description file. It really turned out to be a great, but very much incomplete, source of information about AVRs. Even today, over 4 years later, 110 AVR memory maps are still missing GPIO signal definitions. So I did what any student with too much time on their hands would do: I began to manually assemble the missing information by downloading all existing AVR device datasheets, reading through all of them and collecting the pinouts in a spreadsheet. I then manually reformatted this data into a Python data structure, where it still exists today. Don’t do this! I did get the job done, but I wasted two weeks of my life with this crap and even though I was being really diligent, I still made a lot of mistakes. Ah, the insanities of youth 🙄 I also wrote a memory map comparison tool, which was really useful for understanding the batshit-insane AVR IO maps. Since the AVR can only address a certain amount of IO memory directly, the hardware engineers have to “compress” (more like “forcefully stuff”) the IO map and this quickly becomes very ugly. For example, the ATtiny*61 series features differential ADC inputs with selectable gains, configurable in 64 combinations, but register ADMUX only has space for 5 bits (MUX0 - MUX4). So Atmel decided to cram MUX5 into register ADCSRB: Wait, did the ADLAR bit just move around? Nah, must be an illusion. 😒 This memory map comparison tool was vital in understanding how all the AVRs memory maps differ and coming up with strategies on how to map this functionality into our HAL. It’s all about tools, tools, tools, tools! Peeking into STM32CubeMX ST maintains the CubeMX initialization code generator, which contains “a pinout-conflict solver, a clock-tree setting helper, a power-consumption calculator, and an utility performing MCU peripheral configuration”. Hm, doesn’t that sound interesting? How did they implement these features, we wondered. Back in 2013 CubeMX was still called MicroXplorer and wasn’t nearly as nice to use as today. It also launched as a Windows-only application, even though it was clearly written in Java (those “beautiful” GUI elements give it away). Nevertheless, CubeMX indeed is a very useful application, giving you a number of visual configuration editors: Configuring the USART1_TX signal on pin PB6 on the popular STM32F103RBT. During installation, CubeMX kindly unpacks a huge plaintext (!) database to disk at STM32CubeMX.app/Contents/Resources/db (on OSX) and even updates it for you on every app launch. This database consists out of a lot of XML files, one for every STM32 device in ST’s portfolio, plus detailed descriptions of peripheral configurations. It really is an insane amount of data. So I invite you to join me on a stroll through the colorful fields of XML that power the core of the CubeMX’s configurators. I’ll be using the STM32F103RBT, which is a very popular controller that can be found all ST Links and on the Plue Pill board available on ebay for a few bucks. GPIO Alternate Functions We start by searching for the unique device identifier STM32F103RBTx in mcu/families.xml (which is >30.000 lines long, btw). The minimal information about the device here is used by the parametric search engine in CubeMX. <Mcu Name="STM32F103R(8-B)Tx" PackageName="LQFP64" RefName="STM32F103RBTx"> <Core>ARM Cortex-M3</Core> <Frequency>72</Frequency> <Ram>20</Ram> <Flash>128</Flash> <Voltage Max="3.6" Min="2.0"/> <Current Lowest="1.7" Run="373.0"/> <Temperature Max="105.0" Min="-40.0"/> <Peripheral Type="ADC 12-bit" MaxOccurs="16"/> <Peripheral Type="CAN" MaxOccurs="1"/> <Peripheral Type="I2C" MaxOccurs="2"/> <Peripheral Type="RTC" MaxOccurs="1"/> <Peripheral Type="SPI" MaxOccurs="2"/> <Peripheral Type="Timer 16-bit" MaxOccurs="4"/> <Peripheral Type="USART" MaxOccurs="3"/> <Peripheral Type="USB Device" MaxOccurs="1"/> </Mcu> Following the Mcu/@Name leads us to STM32F103R(8-B)Tx.xml containing what peripherals and how many (mcu/IP/@InstanceName) as well as what pins exists on this package and where and what alternate functions they can be connected to. <Core>ARM Cortex-M3</Core> <Ram>20</Ram> <Flash>64</Flash> <Flash>128</Flash> <!-- ... --> <IP InstanceName="USART3" Name="USART" Version="sci2_v1_1_Cube"/> <IP InstanceName="RCC" Name="RCC" Version="STM32F102_rcc_v1_0"/> <IP InstanceName="NVIC" Name="NVIC" Version="STM32F103G"/> <IP InstanceName="GPIO" Name="GPIO" Version="STM32F103x8_gpio_v1_0"/> <!-- ... --> <Pin Name="PB5" Position="57" Type="I/O"> <Signal Name="I2C1_SMBA"/> <Signal Name="SPI1_MOSI"/> <Signal Name="TIM3_CH2"/> </Pin> <Pin Name="PB6" Position="58" Type="I/O"> <Signal Name="I2C1_SCL"/> <Signal Name="TIM4_CH1"/> <Signal Name="USART1_TX"/> </Pin> <Pin Name="PB7" Position="59" Type="I/O"> <Signal Name="I2C1_SDA"/> <Signal Name="TIM4_CH2"/> <Signal Name="USART1_RX"/> </Pin> Each peripheral has a IP/@Version, which leads to a configuration file containing even more data. Don’t cha just love the smell of freshly unpacked data in the morning? For this device’s GPIO peripheral we’ll look for any pins with the USART1_TX signal in the mcu/IP/GPIO-STM32F103x8_gpio_v1_0_Modes.xml file: <GPIO_Pin PortName="PB" Name="PB6"> <PinSignal Name="USART1_TX"> <RemapBlock Name="USART1_REMAP1"> <SpecificParameter Name="GPIO_AF"> <PossibleValue>__HAL_AFIO_REMAP_USART1_ENABLE</PossibleValue> </SpecificParameter> </RemapBlock> </PinSignal> </GPIO_Pin> <!-- ... --> <GPIO_Pin PortName="PA" Name="PA9"> <PinSignal Name="USART1_TX"> <RemapBlock Name="USART1_REMAP0" DefaultRemap="true"/> </PinSignal> </GPIO_Pin> So USART1_TX maps to pin PB6 with USART1_REMAP1 or pin PA9 with USART1_REMAP0. The STM32F1 series remap signals either in (overlapping) groups or not at all. This is controlled by the AFIO_MAPRx registers, where we can find PB6/PA9 again: The __HAL_AFIO_REMAP_USART1_ENABLE in the XML is actually just a C function name, and is placed by CubeMX in the generated init code. void HAL_UART_MspInit(UART_HandleTypeDef* huart) { GPIO_InitTypeDef GPIO_InitStruct; if(huart->Instance==USART1) { /* Peripheral clock enable */ __HAL_RCC_USART1_CLK_ENABLE(); /**USART1 GPIO Configuration PB6 ------> USART1_TX PB7 ------> USART1_RX */ GPIO_InitStruct.Pin = GPIO_PIN_6; GPIO_InitStruct.Mode = GPIO_MODE_AF_PP; GPIO_InitStruct.Speed = GPIO_SPEED_FREQ_HIGH; HAL_GPIO_Init(GPIOB, &GPIO_InitStruct); GPIO_InitStruct.Pin = GPIO_PIN_7; GPIO_InitStruct.Mode = GPIO_MODE_INPUT; GPIO_InitStruct.Pull = GPIO_NOPULL; HAL_GPIO_Init(GPIOB, &GPIO_InitStruct); __HAL_AFIO_REMAP_USART1_ENABLE(); } } The IP files do contain a very large amount of information, however, it’s mostly directed at the code generation capabilities of the CubeMX project exporter, and as such, not very useful as stand-alone information. For example, the above GPIO signal information relies on the existence of a __HAL_AFIO_REMAP_USART1_ENABLE() function that performs the remapping. The mapping between the bits in the AFIO_MAPRx registers and the remap groups is therefore encoded in two separate places: these xml files, and the family’s CubeHAL. The mcu/IP/NVIC-STM32F103G_Modes.xml configuration file, used to configure the NVIC in the CubeMX, exemplifies this quite well: here we see the first 10 interrupt vectors paired with additional metadata (PossibleValue/@Value seems to contain some : separated conditionals for visibility inside the GUI tool). <RefParameter Comment="Interrupt Table" Name="IRQn" Type="list"> <PossibleValue Comment="Non maskable interrupt" Value="NonMaskableInt_IRQn:N,IF_HAL::HAL_RCC_NMI_IRQHandler:CSSEnabled"/> <PossibleValue Comment="Hard fault interrupt" Value="HardFault_IRQn:N,W1:::"/> <PossibleValue Comment="Memory management fault" Value="MemoryManagement_IRQn:Y,W1:::"/> <PossibleValue Comment="Prefetch fault, memory access fault" Value="BusFault_IRQn:Y,W1:::"/> <PossibleValue Comment="Undefined instruction or illegal state" Value="UsageFault_IRQn:Y,W1:::"/> <PossibleValue Comment="System service call via SWI instruction" Value="SVCall_IRQn:Y,RTOS::NONE:"/> <PossibleValue Comment="Debug monitor" Value="DebugMonitor_IRQn:Y::NONE:"/> <PossibleValue Comment="Pendable request for system service" Value="PendSV_IRQn:Y,RTOS::NONE:"/> <PossibleValue Comment="System tick timer" Value="SysTick_IRQn:Y:::"/> <PossibleValue Comment="Window watchdog interrupt" Value="WWDG_IRQn:Y:WWDG:HAL_WWDG_IRQHandler:"/> However, their actual position in the interrupt vector table is missing, and so this data cannot be used to extract a valid interrupt table. Instead an alias is used here to pair the interrupt with its actual table position, as defined in the STM32F103xB CMSIS header file. For example, the WWDG interrupt vector is located at position 16 (=16+0), while the SVCall vector is located at position 11 (=16-5), or 5 positions behind the UsageFault vector: /*!< Interrupt Number Definition */ typedef enum { NonMaskableInt_IRQn = -14, /*!< 2 Non Maskable Interrupt */ HardFault_IRQn = -13, /*!< 3 Cortex-M3 Hard Fault Interrupt */ MemoryManagement_IRQn = -12, /*!< 4 Cortex-M3 Memory Management Interrupt */ BusFault_IRQn = -11, /*!< 5 Cortex-M3 Bus Fault Interrupt */ UsageFault_IRQn = -10, /*!< 6 Cortex-M3 Usage Fault Interrupt */ SVCall_IRQn = -5, /*!< 11 Cortex-M3 SV Call Interrupt */ DebugMonitor_IRQn = -4, /*!< 12 Cortex-M3 Debug Monitor Interrupt */ PendSV_IRQn = -2, /*!< 14 Cortex-M3 Pend SV Interrupt */ SysTick_IRQn = -1, /*!< 15 Cortex-M3 System Tick Interrupt */ WWDG_IRQn = 0, /*!< Window WatchDog Interrupt */ // ... } IRQn_Type; So keep in mind that this data is not meant to be a sensible hardware description format and it just often lacks basic information that would make it much more useful. Then again, the only consumer of this information is supposed to be CubeMX for its fairly narrow goal of code generation. Clock Tree Let’s look at another very interesting data source in CubeMX: the clock configuration wizard: What’s so interesting about this configurator is that it knows what the maximum frequencies of the respective clock segments are, and more importantly, how to set the prescalers to resolve these issues and this for every device. You surely know where this is going by know. Yup, it’s backed by data, and here is what it looks like rendered with graphviz. Here is a beautified excerpt from plugins/clock/STM32F102.xml, which only shows the connections highlighted in red. Note how the text in the nodes maps to the Element/@type and Element/@id attributes, and how the Element/Output and Element/Input children declare a (unique) @signalId and which node they are connecting to: <Tree id="ClockTree"> <!-- HSE --> <Element id="HSEOSC" type="variedSource" refParameter="HSE_VALUE"> <Output signalId="HSE" to="HSEDivPLL"/> </Element> <!-- PLL div input from HSE --> <Element id="HSEDivPLL" type="devisor" refParameter="HSEDivPLL"> <Input signalId="HSE" from="HSEOSC"/> <Output signalId="HSE_PLL" to="PLLSource"/> </Element> <Tree id="PLL"> <!-- PLLsource MUX source pour PLL mul --> <Element id="PLLSource" type="multiplexor" refParameter="PLLSourceVirtual"> <Input signalId="HSE_PLL" from="HSEDivPLL" refValue="RCC_PLLSOURCE_HSE"/> <Output signalId="VCOInput" to="VCO2output"/> </Element> <Element id="VCO2output" type="output" refParameter="VCOOutput2Freq_Value"> <Input signalId="VCOInput" from="PLLSource"/> <Output signalId="VCO2Input" to="PLLMUL"/> </Element> <Element id="PLLMUL" type="multiplicator" refParameter="PLLMUL"> <Input signalId="VCO2Input" from="VCO2output"/> <Output signalId="PLLCLK" to="SysClkSource"/> </Element> </Tree> <!--Sysclock mux --> <Element id="SysClkSource" type="multiplexor" refParameter="SYSCLKSource"> <Input signalId="PLLCLK" from="PLLMUL" refValue="RCC_SYSCLKSOURCE_PLLCLK"/> <Output signalId="SYSCLK" to="SysCLKOutput"/> </Element> <Element id="SysCLKOutput" type="output" refParameter="SYSCLKFreq_VALUE"> <Input signalId="SYSCLK" from="SysClkSource"/> <Output signalId="SYSCLKOUT" to="AHBPrescaler"/> </Element> <!-- AHB input**SYSclock** --> <Element id="AHBPrescaler" type="devisor" refParameter="AHBCLKDivider"> <Input signalId="SYSCLKOUT" from="SysCLKOutput"/> <Output signalId="HCLK" to="AHBOutput"/> </Element> <!-- AHB input**SYSclock** output**FHCLK,HCLK,Diviseurcortex,APB1,APB2 --> <Element id="AHBOutput" type="activeOutput" refParameter="HCLKFreq_Value"> <Input signalId="HCLK" from="AHBPrescaler"/> <Output to="FCLKCortexOutput" signalId="AHBCLK"/> <Output to="FSMClkOutput" signalId="AHBCLK"/> <Output to="SDIOClkOutput" signalId="AHBCLK"/> <Output to="HCLKDiv2" signalId="AHBCLK"/> <Output to="HCLKOutput" signalId="AHBCLK"/> <Output to="TimSysPresc" signalId="AHBCLK"/> <Output to="APB1Prescaler" signalId="AHBCLK"/> <Output to="APB2Prescaler" signalId="AHBCLK"/> </Element> </Tree> We still don’t know how CubeMX is able to do it actual calculations, because the clock graph above doesn’t contain any numbers at all. Some digging around later we can trace the Element/@refParameter attribute to the IP/RCC-STM32F102_rcc_v1_0_Modes.xml which contains *drumroll* numbers, and lots of ‘em: <!-- Les frequences des sources --> <RefParameter Name="HSE_VALUE" Min="4000000" Max="16000000" Display="value/1000000" Unit="MHz"/> <!-- frequence PLL --> <RefParameter Name="VCOOutput2Freq_Value" Min="1000000" Max="25000000" Display="value/1000000" Unit="MHz"/> <!-- les diviseurs --> <RefParameter Name="HSEDivPLL" DefaultValue="RCC_HSE_PREDIV_DIV1"> <PossibleValue Comment="1" Value="RCC_HSE_PREDIV_DIV1"/> <PossibleValue Comment="2" Value="RCC_HSE_PREDIV_DIV2"/> </RefParameter> <!-- Les multiplicateurs --> <RefParameter Name="PLLMUL" DefaultValue="RCC_PLL_MUL2"> <PossibleValue Comment="2" Value="RCC_PLL_MUL2"/> <!-- ... --> <PossibleValue Comment="16" Value="RCC_PLL_MUL16"/> </RefParameter> <!-- Les frequences des signaux --> <!-- SYS clock freq de l'output --> <RefParameter Name="SYSCLKFreq_VALUE" Max="72000000" Display="value/1000000" Unit="MHz"/> <!-- diviseur AHB 1..512 --> <RefParameter Name="AHBCLKDivider" DefaultValue="RCC_SYSCLK_DIV1"> <PossibleValue Comment="1" Value="RCC_SYSCLK_DIV1"/> <PossibleValue Comment="2" Value="RCC_SYSCLK_DIV2"/> <PossibleValue Comment="4" Value="RCC_SYSCLK_DIV4"/> <PossibleValue Comment="8" Value="RCC_SYSCLK_DIV8"/> <PossibleValue Comment="16" Value="RCC_SYSCLK_DIV16"/> <PossibleValue Comment="64" Value="RCC_SYSCLK_DIV64"/> <PossibleValue Comment="128" Value="RCC_SYSCLK_DIV128"/> <PossibleValue Comment="256" Value="RCC_SYSCLK_DIV256"/> <PossibleValue Comment="512" Value="RCC_SYSCLK_DIV512"/> </RefParameter> <!-- AHB out freq --> <RefParameter Name="HCLKFreq_Value" Max="72000000" Display="value/1000000" Unit="MHz"/> Did you know that ST is a French-Italian company? Cos those XML comments clearly aren’t in English. 🤔 Well, that and they seem keen on calling it a “devisor” when they really mean “divider”. What is this, I don’t even. French comments in XML Anyways, here you can see the RefParameter/@min and RefParameter/@max frequency values as well as prescaler values encoded as PossibleValue/@Comment, which are all used by CubeMX to check and fix your clock tree. That’s pretty amazing actually. Ok, so I’m not going into the data of their board support packages, because I don’t think any health insurance covers this much exposure to XML, especially not XML containing French comments. But feel free to take a look at your own risk, it’s just waiting there in plugins/boardmanager/boards for your prying eyes. Let’s move on to how we can extract this data programmatically and use it to bring order to chaos, one example at a time. A bit like the Avengers franchise *drags out blog post to infinity* Generating Device Files The goal of finding machine-readable device description data obviously was to write a program to import, clean-up and convert it into a format that’s more agreeable to our use-case of generating a HAL. Ironically the Device File Generator (DFG) started out in mid 2013 with the innocently named commit “Cheap and simple parsing of the XML files”. It’s not cheap and simple anymore. The DFG started out as a glorified XPath wrapper in xpcc, but then quickly devolved into some messy monster, that pulled in data from all over the place and arranged it without much concept. Back then we were busy building porting the HAL, writing sensor drivers and building robots, so we didn’t approach this problem structurally, and rather fixed bugs when they occurred. I won’t talk about xpcc’s DFG architecture issues in detail, instead I’ll be showing you the problems it caused us. This way, the lessons learned are more transferable to other format (*cough* Device Tree *cough*), since the device data is immutable whereas the DFG’s architecture is not. Note that I rewrote the DFG from scratch for modm, so you can have a look at the source code while reading this. I’m continuing to use the STM32F103RBT6 for illustration, but this all works very similarly for all STM32 and AVR devices. Device Identifiers We needed a way to identify what device to build our HAL for, and of course we use the manufacturers identifier, since it’s (hopefully) unique. We also needed to split up the identifier string, so that the HAL can query its traits to select what code templates to use. For example, in xpcc we split stm32f103rbt6 into: stm32 f1 103 r b {platform}{family}{name}{pin-id}{size-id} Note how we forgot the t6 suffix. If we compare this with the documentation on the ST ordering information scheme, you’ll see why this was a huge mistake: Yup, that’s right, we forgot to encode the package type, causing the DFG to select the first device matching STM32F103RB! And that would be the STM32F103RBHx device, since it occurs first in families.xml. <Mcu Name="STM32F103R(8-B)Hx" PackageName="TFBGA64" RefName="STM32F103RBHx"> <!-- ... --> <Mcu Name="STM32F103R(8-B)Tx" PackageName="LQFP64" RefName="STM32F103RBTx"> So we actually used the definitions for the TFBGA64 packaged device instead of the LQFP64 packaged device. 🤦 Incredibly this didn’t cause immediate problems, since we first focussed on the STM32F3 and F4 families, whose functionality is almost identical between packages. However, we did notice some changes when a new version of CubeMX was released which added or reordered devices in families.xml. And then all hell broke loose when I added support for parsing the STM32F1 device family, which couples peripheral features to memory size and(!) pin count: “32 KB Flash(1)” aka. this table isn’t complicated enough already If you’re a hardware engineer at $vendor, PLEASE DON’T DO THIS! This is pure punishment for anyone writing software for these chips. PLEASE DO NOT DO THIS! You should not have to query for combinations of identifier traits to get your hardware feature set. Expand your device lineup into new (orthogonal) identifier space instead. To be fair, the STM32F1 family was the first ST product to feature a Cortex-M processor and they didn’t use this approach for any of their other STM32 families. I forgive you, ST. So for modm I looked very carefully at how to split the identifier into traits. I made the trait composition and naming transparent to the DFG, it only operates on a dictionary of items, sharing the same identifier mechanism with the AVRs. Since we currently don’t have any information that depends on the temperature range, I left it out for now. Similarly, the device revision is not considered either. stm32 f1 03 r b t {platform}{family}{name}{pin}{size}{package} Note how both the xpcc and modm identifier encodings differ from the official ST ordering scheme. Since we are sharing some code across vendors (like the Cortex-M startup code), we need to have a common naming scheme, at least for {platform} and {family} or the equivalent for other vendors. Also note that {name} now does not contain part the trailing 1 of the family. This is to prevent the problem in xpcc where the code template authors only checked for the {name} instead of the {family} and {name}, for example, id["name"] == "103" vs. id["family"] == "f1" and id["name"] == "03". This lead to issues when we ported some peripheral drivers to the L1 family (similar to F0/L0, F4/L4 and F7/H7). Encoding Commonality You’ve undoubtedly already noticed that the AVR and CubeMX data is quite verbose and noisy. We didn’t want to use this data directly, hence the DFG. However, we wanted to go a step further and cut down on duplicated data, so that we have an easier time verifying the output of the DFG by not having to look through thousands of files, but rather dozens. At the time of this writing, families.xml contains 1171 STM32 devices, but modm-devices/devices/stm32 only contains 62 device files, that’s ~19x less files than devices. We observed that ST clusters their devices on their website, in their technical documentation and in their software offerings. The coarsest regular cluster pattern is the family, which denotes the type of Cortex-M code used among other features. The subfamilies are then more or less arbitrarily clustered around whatever combination of functionality ST wanted to bring to market, but the cluster patterns of pin count, memory size and package are very regular and often explicitly called out. We wanted to reflect this in our data structure too. This STM32F4x9 feature matrix is extremely regular. The Device Tree format deals with data duplication by allowing data specialization through an inheritance tree and tree inclusion nodes. However, you still have to create one leaf node for every device, so in the best case you’d have one DT per device, or if you moved common data up the inheritance tree, you’d have more files than devices. We decided instead to merge our data trees for devices within similar enough clusters and then filter out the data for one device on access. We use logical OR (|) to combine identifier traits to declare what devices are merged. You’ll recognize the <naming-schema> from the previous chapter: <device platform="stm32" family="f1" name="03" pin="c|r|t|v" size="8|b" package="h|i|t|u"> <naming-schema>{platform}{family}{name}{pin}{size}{package}</naming-schema> <valid-device>stm32f103c8t</valid-device> <!-- ... --> <valid-device>stm32f103rbt</valid-device> This device file for the F103x8/b devices therefore contains all that match the identifier pattern of r"stm32f103[crtv][8b][hitu]". The engine extracting the data set for a single device will first construct a list of all possible identifier strings via the naming schema and the device combinations: 4*2*4 = 32 identifiers in this example. It then filters these identifiers by the list in <valid-device>, since not every combination actually exists. Whatever device file contains the requested identifier string is then used. The identifier schema does not have to include all traits either, it only has to be unambiguous. For example the AVR device identifier schema does not contain {platform} but we can infer it anyways: <device platform="avr" family="mega" name="48|88|168|328" type="|a|p|pa"> <naming-schema>at{family}{name}{type}</naming-schema> It first seems unnecessary to do this reverse lookup, but it gives us a very important property for free: The extractor does not need to know anything about the identifier, and still understands the mapping of string to traits. So passing stm32f103rbt is now understood as stm32 f1 03 r b t. The disadvantage is having to first build all identifier strings, before returning the corresponding device file. However, this mapping can be cached. The device file can now use the traits as filters by prefixing them with device-. For our example, the device file continues with declaring the core driver instance, which contains the memory map and vector table. The devices here only differ in Flash size, otherwise they are identical: <driver name="core" type="cortex-m3"> <memory device-size="8" name="flash" access="rx" start="0x8000000" size="65536"/> <memory device-size="b" name="flash" access="rx" start="0x8000000" size="131072"/> <memory name="sram1" access="rwx" start="0x20000000" size="20480"/> <vector position="0" name="WWDG"/> <vector position="1" name="PVD"/> <!-- ... --> <vector position="42" name="USBWakeUp"/> By applying some simple combinatorics math we can find the minimal trait set that uniquely describes this difference and can push this filter as far up the data tree as possible while still being unambiguous and therefore losslessly reconstructible for all merged device data. This is all done for the sole purpose of optimizing for human readability, so an embedded engineer with some experience can just look at this data and say: “This filter looks too noisy to me, so something is probably is wrong here” 🤓 *sound of datasheet pages flipping*. Here is an example of what I so dramatically complained about before: The STM32F1 peripheral feature set is coupled to the device’s pin count: F103 devices with just 36 pins have fewer instances of these peripherals: <driver name="i2c" type="stm32"> <instance value="1"/> <instance device-pin="c|r|v" value="2"/> </driver> <driver name="spi" type="stm32"> <instance value="1"/> <instance device-pin="c|r|v" value="2"/> </driver> <driver name="usart" type="stm32"> <instance value="1"/> <instance value="2"/> <instance device-pin="c|r|v" value="3"/> </driver> Of course both the pin count and the package influence the number of available GPIOs and signals. The algorithm here detected that using the pin count as a filter is enough to safely reconstruct the tree, so the device-package is missing (it prioritizes traits further “left” in the identifier): <driver name="gpio" type="stm32-f1"> <!-- ... --> <gpio device-pin="r|v" port="c" pin="10"/> <gpio device-pin="r|v" port="c" pin="11"> <signal driver="adc" instance="1" name="exti11"/> <signal driver="adc" instance="2" name="exti11"/> </gpio> <gpio device-pin="r|v" port="c" pin="12"/> <gpio device-pin="c|r|v" port="c" pin="13"> <signal driver="rtc" name="out"/> <signal driver="rtc" name="tamper"/> </gpio> The device- filter traits are ORed, multiple filters on the same node ANDed, and the nodes themselves ORed together again. Keen observers will point out that this can create overly broad filters which would make for incorrect reconstruction. For these cases we have to create two nodes with the same data, but different filters to avoid ambiguity. Here is an example from the STM32F4{27,29,37,39} device file: <gpio port="c" pin="3"> <!-- ... --> <signal device-name="27|37" device-pin="a|i|v|z" af="12" driver="fmc" name="sdcke0"/> <signal device-name="29|39" device-pin="a|b|i|n|z" af="12" driver="fmc" name="sdcke0"/> </gpio> Hm, but that filter does look suspiciously noisy, doesn’t it? This filter pattern is repeated for the sdne[1:0] and sdnwe signals, which all belong to the SDRAM controller in the FMC. And according to this data set they seem to be unavailable for the LQFP100 package? Hm, better call Saul check the datasheets: Huh, but the signals do exist for the LQFP100 package!? “FMC: Yes(1)”. Oh, FFS! I checked with CubeMX and the GPIO configurator doesn’t allow you to set SDRAM signals in the LQFP100 package, and there are no STM32F4[23]7[BN] devices, so everything is fine, I guess? Nothing to see here folks, move along, the filter algorithm encoded this shit correctly. 🙃 Anyways, I like our device file format a lot, since it describes the device’s hardware in such a compact and concise form. However, it doesn’t scale graciously at all for data that shares less commonalities between devices in the current clusters. Data Pipeline For my rewrite of the DFG for modm I wanted to improve the correctness of device merges, remove device specific knowledge as much as possible, support multiple output formats and rename less data. I’ve already hinted at solutions to some of these in the previous chapters, so let’s have a proper look at them now. The DFG has three parts: frontend, optimizer and backend. Here yellow stands for input data, blue for data conversion, red for intermediate representation (IR) and green for output data. I’ve already covered the vendor input data and the device merging in much detail. All the ugly is in the parser, it reads the CubeMX data in the same manner I’ve described previously, performs plausibility and format checks on it, and finally normalizes it into a simple Python dictionary. This is just mostly mind-numbingly stupid code to write, since you have to XPath query the CubeMX sources, deal with all the edge cases in the results and normalize all data relative to all devices. Ugly to write, ugly to read, but it gets the job done. Additional curated data gets injected in this step too. The CubeMX data contains a hardware IP version, which seems to correlate loosely to the peripherals feature set, however, I didn’t find it very useful to distinguish between them. So instead I looked up how all peripherals work in the documentation and grouped them again manually. The device file driver/@type name comes from this data. For example, here we can see that the entire STM32 platform only has three different I2C hardware implementations, one of which only differs with the addition of a digital noise filter. 'i2c': [{ 'instances': '*', 'groups': [ { # This hardware can go up to 1MHz (Fast Mode Plus) 'hardware': 'stm32-extended', 'features': [], 'devices': [{'family': ['f0', 'f3', 'f7']}] },{ 'hardware': 'stm32l4', 'features': ['dnf'], 'devices': [{'family': ['l4']}] },{ # Some F4 have a digital noise filter 'hardware': 'stm32', 'features': ['dnf'], 'devices': [{'family': ['f4'], 'name': ['27', '29', '37', '39', '46', '69', '79']}] },{ 'hardware': 'stm32', 'features': [], 'devices': '*' } ] }] All names of peripherals, instances, signals are preserved as they are, so that the name matches the documentation. The only exception are names that wouldn’t be valid identifiers in most programming languages. For our STM32F103RBT example, we split up and duplicate these system signals: SYS_JTCK-SWCLK => sys.jtck + sys.swclk SYS_JTDO-TRACESWO => sys.jtdo + sys.traceswo SYS_JTMS-SWDIO => sys.jtms + sys.swdio The dictionary returned by the parser is then passed onto a platform specific converter that transforms it into the DFGs intermediate representation. Here the raw data is formatted into a glorified tree structure, which has similar semantics to a very restricted form of XML (ie. attributes are stored separately from its children) and annotates each node with the device’s identifier. Here the memory maps and the interrupt vector table is added to the name="core" driver node we saw before. The raw data already contains the memories and vectors with the right naming scheme, so it’s easy to just add them here. for section in p["memories"]: memory_node = core_driver.addChild("memory") memory_node.setAttributes(["name", "access", "start", "size"], section) for vector in p["interrupts"]: vector_node = core_driver.addChild("vector") vector_node.setAttributes(["position", "name"], vector) # sort the node children by start address and size core_driver.addSortKey(lambda e: (int(e["start"], 16), int(e["size"])) if e.name == "memory" else (-1, -1)) # sort the node children by vector number and name core_driver.addSortKey(lambda e: (int(e["position"]), e["name"]) if e.name == "vector" else (-1, "")) I’m adding two sort keys to the core driver node here, to bring the entire tree into canonical order. This an absolute requirement for the reproducibility of the results, otherwise I wouldn’t be able to tell what data changed if the line order came out differently on each invocation. It’s time to merge the device IRs now. The device clustering is curated manually, by a large list of identifier trait groups. I considered using some kind of heuristic to automate this, but this works really well, particularly for the AVR and STM32F1 devices. It’s difficult to come up with a metric that accurately describes how annoyed I feel when looking at wrongfully merged device files with lotsa noisy filters. 😤 The STM32F103 devices are split into these four groups: { 'family': ['f1'], 'name': ['03'], 'size': ['4', '6'] },{ 'family': ['f1'], 'name': ['03'], 'size': ['8', 'b'] },{ 'family': ['f1'], 'name': ['03'], 'size': ['c', 'd', 'e'] },{ 'family': ['f1'], 'name': ['03'], 'size': ['f', 'g'] } In case you’re curious how bad it would be with just one large F103 group, here is a gist with the resulting device file. It’s not as bad as it could be, but still much harder to read. At this point the merged IR for our F103RBT device basically already looks like the finished device file, including identifier filters: device <> stm32f103[c|r|t|v][8|b][h|i|t|u] . driver <name:core type:cortex-m3> . memory <name:flash access:rx start:0x8000000 size:65536> stm32f103[c|r|t|v]8[h|t|u] . memory <name:flash access:rx start:0x8000000 size:131072> stm32f103[c|r|t|v]b[h|i|t|u] . memory <name:sram1 access:rwx start:0x20000000 size:20480> . vector <position:0 name:WWDG> ... . vector <position:42 name:USBWakeUp> . driver <name:i2c type:stm32> . instance <value:1> . instance <value:2> stm32f103[c|r|v][8|b][h|i|t|u] . driver <name:spi type:stm32> . instance <value:1> . instance <value:2> stm32f103[c|r|v][8|b][h|i|t|u] I’ve already described the device file format above, however, one additional testing step is done before the DFG is finished: A copy of every single device file is taken before merging, so that it can be compared with the device files that are extracted from this merged one. This is a brute-force test to make sure the filter algorithms did perform correctly. On a side note, the conversion from IR to device file format can be performed at any time, so that last merge step is strictly speaking optional. This is useful for debugging but also if you want to output this data in a format that does not support a merge mechanism similar to the device file’s one, like plain JSON. Using Device Files So now that we have all this data, let’s have some fun with it. modm-devices comes not only with the DFG but also with a device file parser, which can be used like this: >>> import modm.parser, glob >>> devices = {} >>> for filename in glob.glob("path/to/modm-devices/devices/**/*.xml"): >>> for device in modm.parser.DeviceParser().parse(filename).get_devices(): >>> devices[device.partname] = device >>> devices["stm32f103rbt"].properties {'driver': [{'memory': [{'access': 'rx', 'name': 'flash', 'size': '131072', 'start': '0x8000000'}, {'access': 'rwx', 'name': 'sram1', 'size': '20480', 'start': '0x20000000'}], 'name': 'core', 'type': 'cortex-m3', ... }] } There are some built-in convenience functions for accessing some of the common data in the device files: >>> device = devices["stm32f103rbt"] >>> device.identifier OrderedDict([('platform', 'stm32'), ('family', 'f1'), ('name', '03'), ('pin', 'r'), ('size', 'b'), ('package', 't')]) >>> device.has_driver("usart:avr") False >>> device.has_driver("usart:stm32") True >>> device.get_driver("usart:stm32") {'instance': ['1', '2', '3'], 'name': 'usart', 'type': 'stm32'} I’ve also written a short stats script that allows you to compute some very basic information about the device file collection: $ python3 tools/device/scripts/stats --count 1355 devices $ python3 tools/device/scripts/stats --driver { "ac": 234, "adc": 1339, "aes": 133, "awex": 26, "bandgap": 8, "battery_protection": 7, "bdma": 20, "bod": 30, "can": 683, "ccl": 30, "cell_balancing": 5, "cfd": 2, "charger_detect": 4, "clk": 45, "clock": 242, "comp": 577, "core": 1355, ... } stats also allows you to dump expanded JSON for a prefix of devices and then query that with the tool of your choice to, for example, get all the I2C related signals on port B for the STM32F4 device family. Not sure why you’d want that, but it’s possible. $ python3 tools/device/scripts/stats --json stm32f4 | jq '[.[] | .device.driver[] | select(.name == "gpio").gpio[] | . as $gpio | .signal[]? | select(.driver == "i2c" and $gpio.port == "b") | ($gpio.port + $gpio.pin + ":" + .name)] | unique' [ "b10:scl", "b11:sda", "b12:smba", "b3:sda", "b4:sda", "b5:smba", "b6:scl", "b7:sda", "b8:scl", "b8:sda", "b9:sda" ] I’ll discuss in more detail how we use the device files in the next blog post about the modm library. Try it Yourself The device file as well as the DFG are available on GitHub for you to play with. It automatically downloads and extracts all the raw data into modm-devices/tools/generator/raw-device-data folder. git clone --recursive --depth=1 https://github.com/modm-io/modm-devices.git cd modm-devices/tools/generator # Extract and generate STM32 device data make extract-data-stm32 make generate-stm32 # Extract and generate AVR device data make extract-data-avr make generate-avr Not everything I described here is fully implemented, for example, the clock graph extractor is just a proof-of-concept for now. modm-devices is also supposed to be a Python package installable via pip, but that’s not implemented yet. Please help me maintain this project, I only used devices from a few STM32 families, so it’s difficult to judge the correctness of some of this data. If you know of any other machine readable data, please open an issue or preferrably a pull request. Two more device file checks are currently not implemented: a XML schema validation, and a semantical checker, that verifies the contents consistency. For example, every GPIO signal should be associable with a driver, and no signal name should start with a number (otherwise difficult to map into most programming languages). These are ideas for the future. With some effort and additional data sources (CMSIS-SVD files for example), directly outputting to Device Tree format should be possible too. I leave that one to the experts though. 😇 Conclusion It was important to use not to bind this data to any preconceptions of its use by, for example, integrating it tightly into our HAL generator. Instead we’ve very carefully separated modm-devices from our use of it, so that it can stand on its own and be integrated into all sorts of projects by the community. You’re not bound to using this in code either, you can also generate Markdown documentation, or maybe build your own GPIO configurator as a web UI. You can go and use it as is with its Python DeviceFile interface, however, for larger projects, I’d recommend you write your own wrapper class, that can format the data as you need it. The Device File format may change at any time, so that I can fit in new data or once I don’t like the format anymore, change it completely. So don’t depend on the format directly. The next few blog posts will be about applying this data in our own modm library, how CMSIS-SVD compares to CMSIS Headers as additional data sources, and what it means to model check your HAL with this data. On a Personal Note The last 5 years working on this have been quite a ride. It has completely changed my view on embedded software engineering and it took a while for me understand this different way of thinking. As far as I know, nobody has deployed hardware description methods on such a large and diverse device base. And we’re just getting started. I’ve been fortunate to have found similarly minded people in the RCA, who provided me with valuable feedback and thoughtful discussions, who mentored me and tolerated my rants about our robot’s code quality. The RCA is self organized, so we don’t have anyone telling us what to do, or how to do it. As a result, we do reinvent the wheel a lot, sometimes for worse, but mostly for the better, like with this project. During this time I’ve not had the best experience with the “professional” C/C++ embedded community. There are too many established developers convinced of their own opinions that won’t stop arguing until they’ve “won” (just ask about using C++ on µCs and bring some 🍿). Together with the growth in amateur interest in embedded software (absolutely not a bad thing), this completely drowned out any worthwhile online discussions on new approaches to embedded software that are different from the “approved” norm. I’m not talking so much about the programming language itself, which is relatively exchangeable for HALs (a rather unpopular opinion), but about HAL design concepts and perhaps most importantly, support tools. Let me give you an example: ST has committed at least 4-6 engineers to porting its devices to Arm Mbed OS. Good for ST, that’s a lot of money. But: ST only supports 55 of their ~1100 STM32 targets on Mbed OS, with every single one of them ported by hand. This means at least all startup code and linkerscripts are mostly duplicated for each target and all GPIO signal data is added manually by an unfortunate soul with all the side-effects of manual labor. That’s insane, as you’ve seen above, ST is already maintaining and using this data already to generate code with CubeMX. How is this not automated? Fortunately, in the last few years there was some significant progress in enabling (new) programming languages on embedded, like MicroPython, Javascript runtimes and perhaps the most significant of them: Embedded in Rust. I’ve been particularly impressed with the progress of the community surrounding @japaricious, who are currently tackling some very hard issues, like IO signal grouping or safe DMA APIs. I’ve kinda written this blog post for them, since I think they are best organized to actually use it and they don’t seem afraid to tackle these issues. (Your move, C++ people!)The Curious Case of xpcc’s Error Model2017-03-04T00:00:00+01:002017-03-04T00:00:00+01:00http://blog.salkinium.com/xpccs-error-model<p>In hindsight it is quite apparent that <a href="https://github.com/roboterclubaachen/xpcc">xpcc</a> and therefore also the <a href="https://twitter.com/RCA_eV">@RCA_eV robot code</a> was missing a good error model.
Until now xpcc’s way of dealing with failures included using <code class="language-plaintext highlighter-rouge">static_assert</code> at compile time and returning error codes at runtime whenever it was deemed necessary. We never considered runtime assertions, nor catching hardware errors like the ARM Cortex-M Fault exceptions. We crashed and burned, a few times literally.</p>
<p>So what can we do that is simple to use and efficient on AVR and Cortex-M devices, but still powerful enough to be useful? It’s time we thought about our error model.</p>
<p><strong>Update 2019: For <a href="https://modm.io">xpcc’s successor modm</a> this error model got improved for efficiency and flexibility, however, the main principle is still the same. <a href="https://modm.io/reference/module/modm-architecture-assert">See the <code class="language-plaintext highlighter-rouge">modm:architecture:assert</code> docs</a>.</strong></p>
<!--more-->
<h2 id="the-problem">The Problem</h2>
<p><a href="http://www.roboterclub.rwth-aachen.de/">The RCA robots</a> are controlled by a number of software components that communicate by Remote Procedure Calls (PRCs) via an event loop locally or over CAN.
We call this Cross Platform Component Communication (XPCC) and it’s an under-appreciated (and under-documented) part of the xpcc framework
It allows us to distribute components over many microcontrollers if needed and helps us understand what is happening in the robot at runtime by listening in on the CAN bus.</p>
<p>However, we are constantly fine tuning our robots before and after a match and if we accidentally leave the CAN bus disconnected the robot turns into a (very expensive) paper weight and we loose the game. It is therefore paramount that we detect this situation on CAN initialization and let the robot emit loud and annoying sounds so that the <del>slaves</del> students can fix it. There are several other places in the initialization that must not fail for the same reason.</p>
<p>It wasn’t clear to us how and where to handle this type of failure though. Should the initialization code return an error code? What if we forgot to check it? Isn’t this a recurring problem?
It seemed like a good opportunity to heartily consult The Internet™ on the topic of error models, since surely other, smarter people have solved this problem already. Oh boy.</p>
<h2 id="the-research">The Research</h2>
<p><a href="http://joeduffyblog.com/2016/02/07/the-error-model/">Joe Duffy wrote a fantastically detailed article on the many considerations that went into the error model used in the Midori research project</a>. (You should read his <a href="http://joeduffyblog.com/2015/11/03/blogging-about-midori/">entire series on Midori</a>, there is a lot of gold there.)</p>
<p>There are couple of points in there that resonated very strongly with me:</p>
<ol>
<li><a href="http://joeduffyblog.com/2016/02/07/the-error-model/#unchecked-exceptions">“Unchecked Exceptions”</a>: We can’t use C++ exceptions since the AVR toolchain does not support it. But even if we could, we wouldn’t, for the many reasons pointed out in this section. It’s actually quite horrifying to me how bad a match C++ exception are for a reliable system.</li>
<li><a href="http://joeduffyblog.com/2016/02/07/the-error-model/#to-build-a-reliable-system">“To Build a Reliable System”</a>: XPCC deals with failures prominently: RPC delivery can fail, components can decline RPCs (“I’m busy”) or simply fail during their execution (“I couldn’t grab this object”). We have to deal with these failures in order to get a reliable system that doesn’t get stuck on the first failure. You’d be surprised how many failures there can be during a Eurobot game under real world conditions. The fact that we can relatively simply retry actions or ultimately give up and move on is actually quite amazing.</li>
<li><a href="http://joeduffyblog.com/2016/02/07/the-error-model/#bugs-arent-recoverable-errors">“Bugs Aren’t Recoverable Errors!”</a>: This was the most important realization for me. When we are talking about the system clock or the CAN bus not initializing correctly, these are bugs. You cannot recover from them and the robot is stuck. However, XPCC failures as described above are recoverable errors and it’s fine for them to happen happen in normal operation.</li>
<li><a href="http://joeduffyblog.com/2016/02/07/the-error-model/#abandonment">“Abandonment”</a>: xpcc didn’t have a concept of abandonment and it doesn’t call any libc <code class="language-plaintext highlighter-rouge">exit()</code> functions. There are a couple of <code class="language-plaintext highlighter-rouge">while(1)</code> loops in the vector table (and hard fault handler), but there is no controlled teardown (with reporting) of failures. It’s crash’n’burn all the way down.</li>
</ol>
<p>Of course Midori’s goal of writing an entire operating system from scratch is a little higher on the scale of epicness than us coding our robots. And considering that they rolled their own language and compiler to implement this error model, it’s pretty clear that our solution can’t really compete with their very thorough approach.</p>
<h2 id="the-proposal">The Proposal</h2>
<p>We propose to continue returning error codes for recoverable errors but use assertions for bugs which can lead to abandonment. There is something appealing about the simplicity of using an <code class="language-plaintext highlighter-rouge">assert(condition)</code> in the code, so we decided to expand the function signature:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">xpcc_assert</span><span class="p">(</span><span class="n">bool</span> <span class="n">condition</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">module</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">location</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">failure</span><span class="p">,</span> <span class="kt">uintptr_t</span> <span class="n">context</span> <span class="o">=</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>
<p>Yes, we’re using C-style <code class="language-plaintext highlighter-rouge">"strings"</code> to declare the assertion location and failure type instead of using enumerations or similar.
We came to the conclusion that it is a lot simpler to encode structured information using strings rather than keeping all error enumerations in sync to prevent duplicates.
Strings also consume significantly less memory than using a stringified test condition or a “pretty” function string, or even just <code class="language-plaintext highlighter-rouge">__LINE__</code> and <code class="language-plaintext highlighter-rouge">__FILE__</code> strings. It also makes it trivial to print the failure.
It made sense to us that the developer writing code with assertions categorizes the failure for the developer calling the code. It’s often difficult to assess the exact reason <em>why</em> an assertion failed from the stringified test condition alone.</p>
<p>When an assertion fails, it calls all registered assertion handlers one by one.
Assertion handlers have this signature:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Abondonment</span> <span class="nf">handler</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">module</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">location</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">failure</span><span class="p">,</span> <span class="kt">uintptr_t</span> <span class="n">context</span><span class="p">);</span>
</code></pre></div></div>
<p>The identifiers allows these failure handlers to assess the scope and type of failure programmatically and return <code class="language-plaintext highlighter-rouge">Fail</code>, <code class="language-plaintext highlighter-rouge">DontCare</code> or <code class="language-plaintext highlighter-rouge">Ignore</code>.
If any of them returns <code class="language-plaintext highlighter-rouge">Fail</code> or all of them return <code class="language-plaintext highlighter-rouge">DontCare</code>, then execution is abandoned. Otherwise if at least one of them <code class="language-plaintext highlighter-rouge">Ignore</code> the assertion, execution continues.
This allows us to ignore some select failures that we don’t care about.</p>
<p>The abandonment handler is called last and has the same signature as the assertion handler. It is required that all assertion handlers are not blocking, so that they can all get called, and whatever blocking code is required can then run in the abandonment handler, where execution is trapped until the next reset anyway.</p>
<h2 id="the-example">The Example</h2>
<p>For our problem with CAN bus timout, an assertion is called and the <code class="language-plaintext highlighter-rouge">context</code> contains the instance of the CAN (<code class="language-plaintext highlighter-rouge">1</code> or <code class="language-plaintext highlighter-rouge">2</code>) that failed initialization.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Can1</span><span class="o">::</span><span class="n">initialize</span><span class="p">()</span>
<span class="p">{</span>
<span class="c1">// [...] initialize CAN peripheral</span>
<span class="c1">// wait for CAN bus to be ready</span>
<span class="kt">int</span> <span class="n">deadlockPreventer</span> <span class="o">=</span> <span class="mi">1000</span><span class="p">;</span> <span class="c1">// max ~1ms</span>
<span class="k">while</span> <span class="p">(</span><span class="n">not</span> <span class="n">busIsReady</span><span class="p">()</span> <span class="n">and</span> <span class="p">(</span><span class="n">deadlockPreventer</span><span class="o">--</span> <span class="o">></span> <span class="mi">0</span><span class="p">))</span>
<span class="n">xpcc</span><span class="o">::</span><span class="n">delayMicroseconds</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="n">xpcc_assert</span><span class="p">(</span><span class="n">deadlockPreventer</span> <span class="o">></span> <span class="mi">0</span><span class="p">,</span> <span class="s">"can"</span><span class="p">,</span> <span class="s">"init"</span><span class="p">,</span> <span class="s">"timeout"</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>An assertion handler then compares the first three characters to <code class="language-plaintext highlighter-rouge">"can"</code> and return <code class="language-plaintext highlighter-rouge">Fail</code> and execution is abandoned:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">xpcc</span><span class="o">::</span><span class="n">Abandonment</span> <span class="nf">can_assertion_handler</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">module</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="p">,</span> <span class="kt">uintptr_t</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">strncmp</span><span class="p">(</span><span class="n">module</span><span class="p">,</span> <span class="s">"can"</span><span class="p">,</span> <span class="mi">3</span><span class="p">))</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">xpcc</span><span class="o">::</span><span class="n">Abandonment</span><span class="o">::</span><span class="n">Fail</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">xpcc</span><span class="o">::</span><span class="n">Abandonment</span><span class="o">::</span><span class="n">DontCare</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// Register assertion handler with system</span>
<span class="n">XPCC_ASSERTION_HANDLER</span><span class="p">(</span><span class="n">can_assertion_handler</span><span class="p">);</span>
</code></pre></div></div>
<p>The abandon handler finally prints the failed assertion to the log and makes some loud bleepy noises:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">xpcc_abandon</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">module</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">location</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">failure</span><span class="p">,</span> <span class="kt">uintptr_t</span> <span class="n">context</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">XPCC_LOG_ERROR</span><span class="p">.</span><span class="n">printf</span><span class="p">(</span><span class="s">"Assertion '%s.%s.%s' (0x%p) failed! Abandoning!</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span>
<span class="n">module</span><span class="p">,</span> <span class="n">location</span><span class="p">,</span> <span class="n">failure</span><span class="p">,</span> <span class="n">context</span><span class="p">);</span>
<span class="c1">// Make some noise!</span>
<span class="n">PiezoBuzzer</span><span class="o">::</span><span class="n">setOutput</span><span class="p">();</span>
<span class="k">while</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
<span class="n">PiezoBuzzer</span><span class="o">::</span><span class="n">set</span><span class="p">();</span>
<span class="n">xpcc</span><span class="o">::</span><span class="n">delayMilliseconds</span><span class="p">(</span><span class="mi">200</span><span class="p">);</span>
<span class="n">PiezoBuzzer</span><span class="o">::</span><span class="n">reset</span><span class="p">();</span>
<span class="n">xpcc</span><span class="o">::</span><span class="n">delayMilliseconds</span><span class="p">(</span><span class="mi">100</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>On an STM32 this prints:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Assertion 'can.init.timeout' (0x00000001) failed! Abandoning!
</code></pre></div></div>
<p>We also log internal robot state via UART backed by a ring buffer of fixed size. If too much is logged at once, the buffer runs out of space, and we loose log output, which is undesirable. However, we cannot wait synchronously for space to become available in the buffer either, as this would impair the timing loops in our robot code.
Since continuing the game is obviously more important than preserving the log, we therefore ignore this failure in game mode:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Abandonment</span> <span class="nf">logger_buffer_overflow</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">module</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">location</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">failure</span><span class="p">,</span> <span class="kt">uintptr_t</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">strncmp</span><span class="p">(</span><span class="n">module</span><span class="p">,</span> <span class="s">"uart"</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span> <span class="n">and</span>
<span class="o">!</span><span class="n">strncmp</span><span class="p">(</span><span class="n">location</span><span class="p">,</span> <span class="s">"tx"</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="n">and</span>
<span class="o">!</span><span class="n">strncmp</span><span class="p">(</span><span class="n">failure</span><span class="p">,</span> <span class="s">"overflow"</span><span class="p">,</span> <span class="mi">8</span><span class="p">))</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">xpcc</span><span class="o">::</span><span class="n">Abandonment</span><span class="o">::</span><span class="n">Ignore</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">xpcc</span><span class="o">::</span><span class="n">Abandonment</span><span class="o">::</span><span class="n">DontCare</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// Register assertion handler with system</span>
<span class="n">XPCC_ASSERTION_HANDLER</span><span class="p">(</span><span class="n">logger_buffer_overflow</span><span class="p">);</span>
</code></pre></div></div>
<p>Note how the assertion handlers only react to the failures they care about and otherwise leaving the decision to other, potentially more specialized handlers.</p>
<h2 id="the-implementation">The Implementation</h2>
<p>Since we want to use assertions a lot in our code, but still keep the code size overhead as low as possible, we use two optimizations: <code class="language-plaintext highlighter-rouge">xpcc_assert</code> is actually a macro which:</p>
<ol>
<li>moves the condition test out of the function into the calling context, and</li>
<li>concatenates the module, location and failure strings into one big string.</li>
</ol>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#define xpcc_assert(condition, module, location, failure, context) \
if (condition) {} else { \
xpcc_assert_fail(FLASH_STORAGE(module "\0" location "\0" failure), (uintptr_t) context); }
</code></pre></div></div>
<p>We cannot change that the test condition has to always be evaluated, but we don’t have to pass it as an argument into the assert function. That would require the compiler to cast the test result into a numeric value and move it into a register to comply with the ABI. If we branch outside of the assertion, the compiler can test the CPU flags directly.</p>
<p>Similarly, by concatenating the assertion identifier strings into one long string, the compiler only has to populate one register so it can save the code that fetches the other two pointers. (ARMv7-M use literal pools for constants, while AVRs generate them ad-hoc using several load instructions, both actually quite expensive for code size.) The <code class="language-plaintext highlighter-rouge">xpcc_assert_fail</code> function then breaks the long string apart and passes them to the failure handlers as individual arguments.</p>
<p>Also note the <code class="language-plaintext highlighter-rouge">FLASH_STORAGE</code> macro, which keeps the strings in Flash on AVRs and thus does not use any SRAM as it would normally do. This means that assertion handlers on AVRs need to use the <code class="language-plaintext highlighter-rouge">*_P</code> variants of the string compare functions. This is an acceptable caveat for us, since assertion and abandon handlers are part of the application and not the library and there don’t need to be shared across platforms.</p>
<h3 id="registering-assertion-handlers">Registering assertion handlers</h3>
<p>The tricky part is how to register the assertion handlers to the <code class="language-plaintext highlighter-rouge">xpcc_assert_fail</code> function. We use the linker to collect all assertion handlers across the entire executable and place pointers to them into the same linker section using the <code class="language-plaintext highlighter-rouge">XPCC_ASSERTION_HANDLER</code> macro. Note how it forces the assertion handler to have the right signature by using the <code class="language-plaintext highlighter-rouge">xpcc::AssertionHandler</code> type:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#define XPCC_ASSERTION_HANDLER(handler) \
__attribute__((section(XPCC_ASSERTION_LINKER_SECTION), used)) \
const xpcc::AssertionHandler \
handler ## _assertion_handler_ptr = handler
</code></pre></div></div>
<p>Adding custom linker sections to ARM Cortex-M devices is trivial, especially since xpcc generates the linkerscript from a central template. It’s literally just adding these lines:</p>
<pre><code class="language-ld">.assertion : ALIGN(4)
{
__assertion_table_start = .;
KEEP(*(.assertion))
__assertion_table_end = .;
} >FLASH
</code></pre>
<p>The code for <code class="language-plaintext highlighter-rouge">xpcc_assert_fail</code> which calls all assertion handlers is pretty simple. <code class="language-plaintext highlighter-rouge">xpcc_abandon</code> here is a weak function that can be overwritten by the application:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">extern</span> <span class="n">AssertionHandler</span> <span class="n">__assertion_table_start</span><span class="p">;</span>
<span class="k">extern</span> <span class="n">AssertionHandler</span> <span class="n">__assertion_table_end</span><span class="p">;</span>
<span class="kt">void</span> <span class="nf">xpcc_assert_fail</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span> <span class="n">identifier</span><span class="p">,</span> <span class="kt">uintptr_t</span> <span class="n">context</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// split up the identifier back into three pointers</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span> <span class="n">module</span> <span class="o">=</span> <span class="n">identifier</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span> <span class="n">location</span> <span class="o">=</span> <span class="n">module</span> <span class="o">+</span> <span class="n">strlen</span><span class="p">(</span><span class="n">module</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span> <span class="n">failure</span> <span class="o">=</span> <span class="n">location</span> <span class="o">+</span> <span class="n">strlen</span><span class="p">(</span><span class="n">location</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
<span class="c1">// initialize with DontCare in case no assertion handlers were registered</span>
<span class="n">Abandonment</span> <span class="n">state</span> <span class="o">=</span> <span class="n">Abandonment</span><span class="o">::</span><span class="n">DontCare</span><span class="p">;</span>
<span class="c1">// call all assertion handlers</span>
<span class="n">AssertionHandler</span> <span class="o">*</span> <span class="n">handler</span> <span class="o">=</span> <span class="o">&</span><span class="n">__assertion_table_start</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(;</span> <span class="n">handler</span> <span class="o"><</span> <span class="o">&</span><span class="n">__assertion_table_end</span><span class="p">;</span> <span class="n">handler</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">state</span> <span class="o">|=</span> <span class="p">(</span><span class="o">*</span><span class="n">handler</span><span class="p">)(</span><span class="n">module</span><span class="p">,</span> <span class="n">location</span><span class="p">,</span> <span class="n">failure</span><span class="p">,</span> <span class="n">context</span><span class="p">);</span>
<span class="p">}</span>
<span class="c1">// abandon if all returned DontCare, or any returned</span>
<span class="k">if</span> <span class="p">(</span><span class="n">state</span> <span class="o">==</span> <span class="n">Abandonment</span><span class="o">::</span><span class="n">DontCare</span> <span class="n">or</span>
<span class="n">state</span> <span class="o">&</span> <span class="n">Abandonment</span><span class="o">::</span><span class="n">Fail</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">xpcc_abandon</span><span class="p">(</span><span class="n">module</span><span class="p">,</span> <span class="n">location</span><span class="p">,</span> <span class="n">failure</span><span class="p">,</span> <span class="n">context</span><span class="p">);</span>
<span class="k">while</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This code is the same for Linux and OS X, except we need to adapt the section names, so that the dynamic linker can generate symbols for these custom sections at load time. The section names must not have a period in their name and the symbols follow a certain naming convention, all of which are different for these platforms:</p>
<table>
<thead>
<tr>
<th style="text-align: center">platform</th>
<th style="text-align: center">section name</th>
<th style="text-align: center">symbol names</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">AVR <br /> Cortex-M</td>
<td style="text-align: center"><code class="language-plaintext highlighter-rouge">".assertion"</code></td>
<td style="text-align: center"><code class="language-plaintext highlighter-rouge">__assertion_table_start</code> <br /> <code class="language-plaintext highlighter-rouge">__assertion_table_end</code></td>
</tr>
<tr>
<td style="text-align: center">OS X</td>
<td style="text-align: center"><code class="language-plaintext highlighter-rouge">"__DATA,xpcc_assertion"</code></td>
<td style="text-align: center"><code class="language-plaintext highlighter-rouge">"section$start$__DATA$xpcc_assertion"</code><br /><code class="language-plaintext highlighter-rouge">"section$end$__DATA$xpcc_assertion"</code></td>
</tr>
<tr>
<td style="text-align: center">Linux</td>
<td style="text-align: center"><code class="language-plaintext highlighter-rouge">"xpcc_assertion"</code></td>
<td style="text-align: center"><code class="language-plaintext highlighter-rouge">__start_xpcc_assertion</code> <br /> <code class="language-plaintext highlighter-rouge">__stop_xpcc_assertion</code></td>
</tr>
</tbody>
</table>
<p>To access the symbols on OS X you need to bind them to their assembly name:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">extern</span> <span class="n">AssertionHandler</span> <span class="n">__assertion_table_start</span> <span class="nf">__asm</span><span class="p">(</span><span class="s">"section$start$__DATA$xpcc_assertion"</span><span class="p">);</span>
<span class="k">extern</span> <span class="n">AssertionHandler</span> <span class="n">__assertion_table_end</span> <span class="nf">__asm</span><span class="p">(</span><span class="s">"section$end$__DATA$xpcc_assertion"</span><span class="p">);</span>
</code></pre></div></div>
<p><strong>3 Feb 2018 – Update:</strong> We define some default assertion handlers inside the xpcc library source, which is first compiled into the <code class="language-plaintext highlighter-rouge">libxpcc.a</code> archive, then linked against by the application. However, the linker by default only searches archives for <em>referenced</em> symbols, which our handlers are obviously not, and therefore these handlers are omitted from the final executable. This can cause some very subtle and annoying bugs!</p>
<p>The solution is to wrap the archive in <code class="language-plaintext highlighter-rouge">-Wl,--whole-archive -lxpcc -Wl,--no-whole-archive</code>. The <a href="https://sourceware.org/binutils/docs/ld/Options.html#Options">GNU ld documentation</a> describes this quite well: “For each archive mentioned on the command line after the <code class="language-plaintext highlighter-rouge">--whole-archive</code> option, include every object file in the archive in the link, rather than searching the archive for the required object files.”</p>
<p>Note that this just makes all symbols <em>visible</em> to the linker, it does not force inclusion of all symbols, especially not if you pass the <code class="language-plaintext highlighter-rouge">--gc-sections</code> option as well.</p>
<h4 id="avrs-are-annoying">AVRs are annoying</h4>
<p>The most pain was getting this to work on AVRs though. The issue is that their address space is limited to 16-bit and instructions and data are placed into physically separate memories each with their own 16-bit address space. Or in other words, <a href="https://en.wikipedia.org/wiki/Harvard_architecture">AVRs implement a Harvard architecture</a> and one does not simply read data from the instruction memory on a Harvard architecture. AVRs load their read-only data from Flash to SRAM at boot time, <em>including all strings</em>, since there is no way of telling from a 16-bit address whether it points to the instruction or the data memory. Hey, don’t look at me, it’s a 8-bit CPU, you get what you pay for!</p>
<p>This does, however, mean that there now need to be two versions of the same section in memory. GNU ld deals with this by allowing to specify two addresses per section: <a href="https://sourceware.org/binutils/docs/ld/Output-Section-LMA.html">the virtual address (VMA) and the load address (LMA)</a>.
For read-only data the LMA is in Flash somewhere, while the VMA is in SRAM and they are both <em>different</em> memories even when the section addresses overlap numerically!</p>
<p>Let me illustrate the problem with a simplified excerpt of the linkerscript itself.
You can see the <code class="language-plaintext highlighter-rouge">.data</code> section is appended onto the <code class="language-plaintext highlighter-rouge">text</code> memory after the <code class="language-plaintext highlighter-rouge">.text</code> section (LMA), but placed into the <code class="language-plaintext highlighter-rouge">data</code> memory too (VMA):</p>
<pre><code class="language-ld">MEMORY
{
text (rx) : ORIGIN = 0, LENGTH = 8k
data (rw!x) : ORIGIN = 0x800060, LENGTH = 0xffa0
}
/* everything in Flash */
.text :
{
*(.progmem*) /* things tagged with `PROGMEM` go here! */
*(.text*) /* the actual code */
} > text
/* everything in SRAM */
.data :
{
*(.data*) /* modifiable data */
*(.rodata*) /* read-only data */
} > data AT> text
</code></pre>
<p>This is shown more obviously in the listing of the linked executable:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00000850 00000000 00000000 000000b4 2**1
1 .data 00000014 00800100 00000850 00000904 2**0
</code></pre></div></div>
<p>So what we need to do is simply™ append our section to the <code class="language-plaintext highlighter-rouge">text</code> memory after the <code class="language-plaintext highlighter-rouge">.data</code> section, right? Well…
<code class="language-plaintext highlighter-rouge">avr-gcc</code> uses its own linkerscripts (which can be found in <code class="language-plaintext highlighter-rouge">avr-binutils/avr/lib/ldscripts</code>), so we cannot just add our custom section as we did for the ARM platform.
Fortunately, GNU ld allows to extend default linkerscript using the <a href="https://sourceware.org/binutils/docs/ld/Miscellaneous-Commands.html"><code class="language-plaintext highlighter-rouge">INSERT [ AFTER | BEFORE ] output_section</code> command</a>.
We can pass this script to <code class="language-plaintext highlighter-rouge">avr-ld</code> via the <code class="language-plaintext highlighter-rouge">-T</code> option:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SECTIONS
{
.xpcc_assertion : ALIGN(2)
{
__assertion_table_start = .;
KEEP(*(.assertion))
__assertion_table_end = .;
}
}
INSERT AFTER .data
</code></pre></div></div>
<p>This places the section exactly where we want it:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00000850 00000000 00000000 000000b4 2**1
1 .data 00000014 00800100 00000850 00000904 2**0
2 .xpcc_assertion 00000006 00000864 00000864 00000918 2**1
</code></pre></div></div>
<p>The code for <code class="language-plaintext highlighter-rouge">xpcc_assert_fail</code> also needs to be adapted for reading from Flash:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// use *_P string functions from <avr/pgmspace.h></span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span> <span class="n">module</span> <span class="o">=</span> <span class="n">identifier</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span> <span class="n">location</span> <span class="o">=</span> <span class="n">module</span> <span class="o">+</span> <span class="n">strlen_P</span><span class="p">(</span><span class="n">module</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span> <span class="n">failure</span> <span class="o">=</span> <span class="n">location</span> <span class="o">+</span> <span class="n">strlen_P</span><span class="p">(</span><span class="n">location</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
<span class="c1">// we can't access the function pointer directly, cos it's not in RAM</span>
<span class="n">AssertionHandler</span> <span class="o">*</span> <span class="n">table_addr</span> <span class="o">=</span> <span class="o">&</span><span class="n">__assertion_table_start</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(;</span> <span class="n">table_addr</span> <span class="o"><</span> <span class="o">&</span><span class="n">__assertion_table_end</span><span class="p">;</span> <span class="n">table_addr</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// first fetch the function pointer from flash, then jump to it</span>
<span class="n">AssertionHandler</span> <span class="n">handler</span> <span class="o">=</span> <span class="p">(</span><span class="n">AssertionHandler</span><span class="p">)</span> <span class="n">pgm_read_word</span><span class="p">(</span><span class="n">table_addr</span><span class="p">);</span>
<span class="n">state</span> <span class="o">|=</span> <span class="n">handler</span><span class="p">(</span><span class="n">module</span><span class="p">,</span> <span class="n">location</span><span class="p">,</span> <span class="n">failure</span><span class="p">,</span> <span class="n">context</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Well, that was easy. This code works fine until the AVR <code class="language-plaintext highlighter-rouge">.text + .data</code> section size gets so large that it pushes the <code class="language-plaintext highlighter-rouge">.xpcc_assertion</code> section above the 64kB address boundary (AVRs can have up to 128kB Flash, don’t ask /o\). Then <code class="language-plaintext highlighter-rouge">table_addr</code> would wrap around and read garbage. For us this is an acceptable caveat. I mean, if you really get to <em>that</em> point, you should sit down and ask yourself some hard questions about your life.</p>
<h2 id="the-evaluation">The Evaluation</h2>
<p>So what are the properties of our solution?</p>
<h3 id="overhead">Overhead</h3>
<p>Our assertions are a simple concept, with a very low overall code size overhead and when the assertion succeeds also low execution time penalty, even on AVRs.
There is obviously an unavoidable overhead for checking the test condition, safety doesn’t come for free.
But what is the code size penalty per assertion in the code? We’ll benchmark using this assertion:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">xpcc_assert</span><span class="p">(</span><span class="n">timeout</span> <span class="o">></span> <span class="mi">0</span><span class="p">,</span> <span class="s">"can"</span><span class="p">,</span> <span class="s">"init"</span><span class="p">,</span> <span class="s">"timeout"</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
</code></pre></div></div>
<p>In AVRs, the assembly shows a simple condition check, a branch over for when the assertion passes, otherwise 4 loads and a call to <code class="language-plaintext highlighter-rouge">xpcc_assert_fail</code>:</p>
<pre><code class="language-asm">2e4: 81 11 cpse r24, r1 ; condition check
2e6: 05 c0 rjmp .+10 ; branch over
2e8: 60 e0 ldi r22, 0x01 ; context is 16-bit
2ea: 70 e0 ldi r23, 0x00 ; constant and 1
2ec: 83 ea ldi r24, 0xA3 ; load ptr to progmem string
2ee: 90 e0 ldi r25, 0x00 ; progmem below text, hence 0 here
2f0: 5d d1 rcall .+698 ; call <xpcc_assert_fail>
</code></pre>
<p>On ARMv7-M the assembly is a little different. The simple condition check branches over if the assertion passes, otherwise <code class="language-plaintext highlighter-rouge">mov</code>es and loads the two arguments before loading and calling <code class="language-plaintext highlighter-rouge">xpcc_assert_fail</code>:</p>
<pre><code class="language-asm">80001ca: f003 01ff and.w r1, r3, #255 ; condition check
80001cc: b913 cbnz r3, 80001d8 ; branch over
80001d0: 2100 movs r1, #1 ; context is constant and 1
80001d2: 4803 ldr r0, [pc, #12] ; load value @ 80001e0
80001d4: 4b03 ldr r3, [pc, #12] ; load value @ 80001e4
80001d6: 4798 blx r3 ; call <xpcc_assert_fail>
... ; hey look, a literal pool
80001e0: 08000d8c .word 0x08000d8c ; pointer to string
80001e4: 08000521 .word 0x08000521 ; pointer to function
</code></pre>
<p>The minimal code overheads per assertion call are 14B on AVR and 20B on ARMv7-M, but depending on the complexity of the test condition, more code can be generated.
However, if an assertion fails a time penalty exists: All assertion handlers will be called always. Furthermore everything executes on the currently active stack, maybe we’ll change that in the future.</p>
<h3 id="atomicity">Atomicity</h3>
<p>A failed assert disables interrupts since its implementation is not reentrant!
Also keep in mind that our ARMv7-M HardFault handler also eventually calls <code class="language-plaintext highlighter-rouge">xpcc_assert_fail</code> and due to its hardcoded priority, it cannot be interrupted anyway. So it’s best to always have the same behavior everywhere.</p>
<p>The abandon handler may choose to re-enable interrupts if required, for example to allow the UART driver to print the failure reason.
Furthermore if mission critical systems need to continue running, then the abandon handler can keep them alive. For us this would include maybe putting the robot in a mechanically safe configuration before shutting down the motor drivers.</p>
<h3 id="nesting">Nesting</h3>
<p>Failing an assertion while already handling a failed assertion is not allowed and leads to an immediate termination (aka. an infinte loop). This can happen quicker than you think. Remember the abandon handler printing the failure over UART? What if the failure is the UART buffer overflowing? Yeah, that.</p>
<h3 id="documentation">Documentation</h3>
<p>There is no way of knowing if the function you’re calling can fail an assert, except from documentation. This can be a big issue, especially when inadvertently failing assertions from inside an interrupt context, which would call all assertion handlers and the abandon handler from this context too.</p>
<p>This is a difficult problem to fix in general, but it doesn’t need to be solved perfectly: The application could be compiled in “assertion debug mode” where every assertion calls an “awareness” handler regardless of the test condition. This could also help with profiling assertion usage.</p>
<h3 id="ignoring-assertions">Ignoring Assertions</h3>
<p>It is a bit weird that contrary to C++ exceptions, the caller cannot handle the assertion directly at the call site, but only globally.
We tried to make it easier by allowing declarations of global assertion handlers anywhere, so that they can at least be declared closer to the call site.
But if you ignore an assertion, execution will continue, and there is no way to let the caller know that an assertion occurred, except to set a flag in shared memory:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">bool</span> <span class="n">assertion_failed</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="n">Abandonment</span> <span class="nf">ignore_uart_buffer</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="p">,</span> <span class="kt">uintptr_t</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">strncmp</span><span class="p">(</span><span class="n">module</span><span class="p">,</span> <span class="s">"uart"</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span> <span class="p">{</span>
<span class="n">assertion_failed</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="k">return</span> <span class="n">xpcc</span><span class="o">::</span><span class="n">Abandonment</span><span class="o">::</span><span class="n">Ignore</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">xpcc</span><span class="o">::</span><span class="n">Abandonment</span><span class="o">::</span><span class="n">DontCare</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">XPCC_ASSERTION_HANDLER</span><span class="p">(</span><span class="n">ignore_uart_buffer</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">caller_function</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">call_function_with_assertion</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">assertion_failed</span><span class="p">)</span> <span class="p">{</span>
<span class="n">assertion_failed</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
<span class="c1">// do something else</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Admittedly, this is an edge case and the vast amount of assertion failures cannot be ignored, as there is nothing the caller can do and abandonment is exactly the right choice.</p>
<h3 id="abandonment-causes">Abandonment Causes</h3>
<p>As food for thought, here are the causes of abandonment in Midori and the possible implementations in xpcc. Note that AVRs don’t have fault handlers, they just quietly choke on their bits until they die in a plume of blue smoke.</p>
<table>
<thead>
<tr>
<th style="text-align: left">bug description</th>
<th style="text-align: left">xpcc implementation</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">An incorrect cast</td>
<td style="text-align: left">undetectable at runtime</td>
</tr>
<tr>
<td style="text-align: left">An attempt to dereference a <code class="language-plaintext highlighter-rouge">null</code> pointer</td>
<td style="text-align: left">Hard Fault or unpredictable (AVR)</td>
</tr>
<tr>
<td style="text-align: left">An attempt to access an array outside of its bounds</td>
<td style="text-align: left">detectable only with wrapper code</td>
</tr>
<tr>
<td style="text-align: left">Divide-by-zero</td>
<td style="text-align: left">Hard Fault or <code class="language-plaintext highlighter-rouge">xpcc_assert</code> (software)</td>
</tr>
<tr>
<td style="text-align: left">An unintended mathematical over/underflow</td>
<td style="text-align: left">detectable only with wrapper code</td>
</tr>
<tr>
<td style="text-align: left">Out-of-memory</td>
<td style="text-align: left"><code class="language-plaintext highlighter-rouge">xpcc_assert</code> in dynamic allocator</td>
</tr>
<tr>
<td style="text-align: left">Stack overflow</td>
<td style="text-align: left">Hard Fault or undetectable (AVR)</td>
</tr>
<tr>
<td style="text-align: left">Explicit abandonment</td>
<td style="text-align: left"><code class="language-plaintext highlighter-rouge">xpcc_assert(false, ...)</code></td>
</tr>
<tr>
<td style="text-align: left">Contract failures</td>
<td style="text-align: left">not a part of C/C++ (sadly)</td>
</tr>
<tr>
<td style="text-align: left">Assertion failures</td>
<td style="text-align: left">uh, well, <code class="language-plaintext highlighter-rouge">xpcc_assert</code></td>
</tr>
</tbody>
</table>
<h2 id="the-conclusion">The Conclusion</h2>
<p>Our solution isn’t anywhere near as polished and well thought out as Midori’s, but considering our restrictions it’s not completely terrible.
I would claim that it works for enough of our use cases to be useful and it allows for a lot of flexibility in responding to failed assertions.
Our approach of encoding the failure as a string is novel in the context of microcontrollers and is very efficient too.</p>
<p>We see this as a good enough alternative to C++ exceptions and will be using it a lot in xpcc.</p>Niklas Hauserniklas@salkinium.comIn hindsight it is quite apparent that xpcc and therefore also the @RCA_eV robot code was missing a good error model. Until now xpcc’s way of dealing with failures included using static_assert at compile time and returning error codes at runtime whenever it was deemed necessary. We never considered runtime assertions, nor catching hardware errors like the ARM Cortex-M Fault exceptions. We crashed and burned, a few times literally. So what can we do that is simple to use and efficient on AVR and Cortex-M devices, but still powerful enough to be useful? It’s time we thought about our error model. Update 2019: For xpcc’s successor modm this error model got improved for efficiency and flexibility, however, the main principle is still the same. See the modm:architecture:assert docs.Computing and Asserting Baudrate Settings at Compile Time2015-06-08T00:00:00+02:002015-06-08T00:00:00+02:00http://blog.salkinium.com/computing-baudrates-at-compile-time<p>Prescaler and baudrate calculations are a tricky topic.
I have had many situations where the baudrate turned out to be off by a couple of percent, which was enough to render my serial output streams unreadable.
Sure, calculating the baudrate error beforehand would have saved me some hours of useless debugging, however, that would require understanding the often complicated mathematical formula hidden somewhere in the depths of the datasheet describing the prescaler vs. baudrate relationship.</p>
<p>And <em>that</em> seemed to be more work than just using a logic analyzer to measure the resulting error.
Of course, this felt like using a sledgehammer to crack a nut and it was neither a fast nor practical solution.</p>
<p>I think there exists a better solution and I think it can be done using pure C++.
This solution needs to be able to:</p>
<ol>
<li>compute the best possible prescaler settings for the desired baudrate, and</li>
<li>notify me when the desired baudrate cannot be achieved without unresonable error.</li>
</ol>
<!--more-->
<h2 id="qualifying-baudrates">Qualifying Baudrates</h2>
<p>An important characteristic of clock prescalers is their finite range and resolution, which has an obvious impact on baudrate generation.</p>
<p>Let’s look at the characteristics of the three most commonly used prescalers:</p>
<ol>
<li>the power-of-two prescaler,</li>
<li>the linear prescaler, and</li>
<li>the fractional prescaler.</li>
</ol>
<h4 id="power-of-two">Power of Two</h4>
<p>This type of prescaler is often used to clock peripherals which do not require a high resolution and can operate in a wide range of frequencies such as ADCs and even SPI.
It’s behaviour is described by this formula:</p>
<center>
<p><img invertible="" src="prescaler_power_of_two.svg" /></p>
</center>
<h4 id="linear">Linear</h4>
<p>Linear prescalers are the most common type of prescaler found in microcontrollers. They typically generate clocks for timers and synchonous communication peripherals such as I<sup>2</sup>C and SPI.
Since the divisor must not be zero for obvious reasons, the input values are either mapped so that writing a zero turns the peripheral off, or the hardware adds a one to the input (mapping 0⟶1, 1⟶2, etc…).</p>
<center>
<p><img invertible="" src="prescaler_linear.svg" /></p>
</center>
<h4 id="fractional">Fractional</h4>
<p>This prescaler is used whenever a clock is required, which cannot be generated purely by integer division. The most typical application is baudrate generation for asynchronous communication such as UART.
The divisor is usually formatted as a fixed point binary fractional.
It must be understood that these prescalers cannot generate a <em>true</em> fractional output frequency, but use a <a href="http://en.wikipedia.org/wiki/Dual-modulus_prescaler">dual-modulus hardware logic</a>, so that the desired output frequency is met <strong>on average</strong>!</p>
<center>
<p><img invertible="" src="prescaler_fractional.svg" /></p>
</center>
<h4 id="analysis">Analysis</h4>
<p>Here is the graphical comparison of these three prescaler functions, plotting 10 input values onto the normalized output value for all three functions. The power-of-two prescaler is light gray, the linear prescaler dark gray and is overlaid on the fractional prescaler:</p>
<center>
<p><img invertible="" src="prescaler_graphs.svg" /></p>
</center>
<p>There are three very interesting observations to be made:</p>
<ol>
<li>the power-of-two prescaler falls a lot faster than the others: In 10 steps it reaches <sup>1</sup>/<sub>1024</sub> instead of <sup>1</sup>/<sub>10</sub> for the linear prescaler.</li>
<li>neither the power-of-two nor the linear prescaler can generate anything between 1.0 and 0.5.</li>
<li>the distribution of generatable output frequencies is (obviously) not evenly spaced.</li>
</ol>
<p>All prescalers are not particularly suited to generate high arbitrary output frequencies.
This also goes for the fractional prescaler, which can only switch between two fixed frequencies.</p>
<p>Assume you have a SPI slave that can be clocked up to 30 MHz, however your primary clock input is 40 MHz.
A fractional prescaler will clock the slave half the time with 40 MHz and the other half with 20 MHz to achieve the desired 30 MHz on average.
However, the slave might start to glitch on the 40 MHz part of the clocking, due to its electrical and timing characteristics, therefore this is not a practical solution.</p>
<h2 id="choosing-divisors">Choosing Divisors</h2>
<p>So now that we know the formulas and properties of the most common prescalers, let’s find out how we can choose the best divisor for a desired baudrate.</p>
<p>Between two generatable baudrates <em>B<sub>n</sub></em> and <em>B<sub>n+1</sub></em> lies a half-point for which there is an equal amount of baudrate error when choosing either <em>n</em> or <em>n+1</em> as a divisor.
The formula for calculating this half-point is trivial:</p>
<center>
<p><img invertible="" src="prescaler_half_point.svg" /></p>
</center>
<p>So the general approach here is to find an divisor pair (<em>n, n+1</em>) so that the desired baudrate <em>B<sub>d</sub></em> lies between <em>B<sub>n</sub></em> and <em>B<sub>n+1</sub></em> and then choose the divisor whose baudrate is closer to the desired one.
So if <em>B<sub>d</sub></em> is above the half-point, we choose <em>n</em>, otherwise <em>n+1</em>.</p>
<p>It is important to understand that we <strong>cannot</strong> use this approach on the divisors directly, since there is no linear correlation between the input and output frequency.
This becomes clear in the prescaler plot above, where the half-point between 1.0 and 0.5 for the linear prescaler clearly does not lie on divisor 1.5, but somewhere around 1.3!</p>
<h4 id="power-of-two-1">Power of Two</h4>
<p>However, with some more math we can calculate the exact divisor <em>ratio</em> of this half-point.
We start with the power-of-two prescaler, where <em>B<sub>n+1</sub></em> is always half of <em>B<sub>n</sub></em>:</p>
<center>
<p><img invertible="" src="prescaler_p2_1.svg" /></p>
</center>
<p>By entering these into our half-point formula we get:</p>
<center>
<p><img invertible="" src="prescaler_p2_2.svg" /></p>
</center>
<p>However, since we wanted a divisor and not a baudrate, we divide the input frequency with the half-point baudrate:</p>
<center>
<p><img invertible="" src="prescaler_p2_3.svg" /></p>
</center>
<p>Choosing the divisor with the least error for any desired baudrate becomes easy now.
Here is a code example (taken from the <a href="https://github.com/roboterclubaachen/xpcc/blob/develop/src/xpcc/architecture/platform/driver/spi/at90_tiny_mega/spi_master.hpp.in?ts=4#L69">AVR’s SPI module</a>):</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="n">desired_div</span> <span class="o">=</span> <span class="n">input_frequency</span> <span class="o">/</span> <span class="n">desired_baudrate</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">nearest_div</span> <span class="o">=</span> <span class="p">(</span>
<span class="p">(</span><span class="n">desired_div</span> <span class="o">>=</span> <span class="p">(</span> <span class="mi">64</span> <span class="o">*</span> <span class="mf">4.</span><span class="n">f</span><span class="o">/</span><span class="mi">3</span><span class="p">))</span> <span class="o">?</span> <span class="mi">128</span> <span class="o">:</span> <span class="p">(</span>
<span class="p">(</span><span class="n">desired_div</span> <span class="o">>=</span> <span class="p">(</span> <span class="mi">32</span> <span class="o">*</span> <span class="mf">4.</span><span class="n">f</span><span class="o">/</span><span class="mi">3</span><span class="p">))</span> <span class="o">?</span> <span class="mi">64</span> <span class="o">:</span> <span class="p">(</span>
<span class="p">(</span><span class="n">desired_div</span> <span class="o">>=</span> <span class="p">(</span> <span class="mi">16</span> <span class="o">*</span> <span class="mf">4.</span><span class="n">f</span><span class="o">/</span><span class="mi">3</span><span class="p">))</span> <span class="o">?</span> <span class="mi">32</span> <span class="o">:</span> <span class="p">(</span>
<span class="p">(</span><span class="n">desired_div</span> <span class="o">>=</span> <span class="p">(</span> <span class="mi">8</span> <span class="o">*</span> <span class="mf">4.</span><span class="n">f</span><span class="o">/</span><span class="mi">3</span><span class="p">))</span> <span class="o">?</span> <span class="mi">16</span> <span class="o">:</span> <span class="p">(</span>
<span class="p">(</span><span class="n">desired_div</span> <span class="o">>=</span> <span class="p">(</span> <span class="mi">4</span> <span class="o">*</span> <span class="mf">4.</span><span class="n">f</span><span class="o">/</span><span class="mi">3</span><span class="p">))</span> <span class="o">?</span> <span class="mi">8</span> <span class="o">:</span> <span class="p">(</span>
<span class="p">(</span><span class="n">desired_div</span> <span class="o">>=</span> <span class="p">(</span> <span class="mi">2</span> <span class="o">*</span> <span class="mf">4.</span><span class="n">f</span><span class="o">/</span><span class="mi">3</span><span class="p">))</span> <span class="o">?</span> <span class="mi">4</span> <span class="o">:</span>
<span class="mi">2</span>
<span class="p">))))));</span>
</code></pre></div></div>
<p>First the divisor of the input frequency and the desired baudrate is computed.
This divisor is then compared with all half-point divisors of our prescaler and the best value is chosen.</p>
<p>Notice how this algorithm will choose a divisor of 128 when the desired baudrate is too slow, and a divisor of 2 when it is too fast.
This mirrors the range limitation of the AVR’s SPI prescaler!</p>
<h4 id="linear-1">Linear</h4>
<p>Unfortunately this elegant solution is not available for the properties of the linear prescaler.
Here <em>B<sub>n</sub></em> and <em>B<sub>n+1</sub></em> are defined as follows:</p>
<center>
<p><img invertible="" src="prescaler_lin_1.svg" /></p>
</center>
<p>Using these definition in our half-point formula gets us nowhere really:</p>
<center>
<p><img invertible="" src="prescaler_lin_2.svg" /></p>
</center>
<p>And the half-point divisor is just insulting:</p>
<center>
<p><img invertible="" src="prescaler_lin_3.svg" /></p>
</center>
<p>However, a quick look at the value table of this formula does reinforce a suspicion:</p>
<table>
<thead>
<tr>
<th style="text-align: center"><em>n</em></th>
<th style="text-align: center"><em>d<sub>half(n)</sub></em></th>
<th style="text-align: center">approx.</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">4/3</td>
<td style="text-align: center">1.33333</td>
</tr>
<tr>
<td style="text-align: center">2</td>
<td style="text-align: center">12/5</td>
<td style="text-align: center">2.40000</td>
</tr>
<tr>
<td style="text-align: center">3</td>
<td style="text-align: center">24/7</td>
<td style="text-align: center">3.42857</td>
</tr>
<tr>
<td style="text-align: center">4</td>
<td style="text-align: center">40/9</td>
<td style="text-align: center">4.44444</td>
</tr>
<tr>
<td style="text-align: center">5</td>
<td style="text-align: center">60/11</td>
<td style="text-align: center">5.45455</td>
</tr>
<tr>
<td style="text-align: center">6</td>
<td style="text-align: center">84/13</td>
<td style="text-align: center">6.46154</td>
</tr>
<tr>
<td style="text-align: center">7</td>
<td style="text-align: center">112/15</td>
<td style="text-align: center">7.46667</td>
</tr>
<tr>
<td style="text-align: center">8</td>
<td style="text-align: center">144/17</td>
<td style="text-align: center">8.47059</td>
</tr>
<tr>
<td style="text-align: center">9</td>
<td style="text-align: center">180/19</td>
<td style="text-align: center">9.47368</td>
</tr>
</tbody>
</table>
<p>The divisors seem to approach <em>(n + 1/2)</em> for larger values, which is indeed the case and becomes clear when looking at the series expansion for <em>n</em> to infinity:</p>
<center>
<p><img invertible="" src="prescaler_lin_4.svg" /></p>
</center>
<p>Not that this is of any help to us, it’s just nice to know ☺</p>
<h4 id="fractional-1">Fractional</h4>
<p>Just… no. It doesn’t get better.</p>
<h4 id="generic">Generic</h4>
<p>Okay, so even after this small binge into the underlying mathematics we still do not know how to choose a divisor for linear and fractional prescalers.</p>
<p>There is of course a generic solution where we just brute force this:</p>
<ol>
<li>compute the desired divisor for the desired baudrate,</li>
<li>get <em>n</em> and <em>n+1</em> using <em>floor(desired_div)</em> and <em>ceil(desired_div)</em>,</li>
<li>compute the according baudrates <em>B<sub>n</sub></em> and <em>B<sub>n+1</sub></em>,</li>
<li>compare with the half-point baudrate and choose accordingly.</li>
</ol>
<p>Here is a code example of this algorithm (taken from the <a href="https://github.com/roboterclubaachen/xpcc/blob/develop/src/xpcc/architecture/platform/driver/uart/at90_tiny_mega/uart.hpp.in?ts=4#L66">AVR’s UART module</a>):</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// calculate the fractional prescaler value</span>
<span class="kt">float</span> <span class="n">desired</span> <span class="o">=</span> <span class="n">input_frequency</span> <span class="o">/</span> <span class="n">desired_baudrate</span><span class="p">;</span>
<span class="c1">// respect the prescaler range of 1 to 4096</span>
<span class="kt">int</span> <span class="n">div_floor</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">floor</span><span class="p">(</span><span class="n">desired</span><span class="p">)</span> <span class="o"><</span> <span class="mi">1</span> <span class="o">?</span> <span class="mi">1</span> <span class="o">:</span> <span class="n">std</span><span class="o">::</span><span class="n">floor</span><span class="p">(</span><span class="n">desired</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">div_ceil</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">ceil</span><span class="p">(</span><span class="n">desired</span><span class="p">)</span> <span class="o">></span> <span class="mi">4096</span> <span class="o">?</span> <span class="mi">4096</span> <span class="o">:</span> <span class="n">std</span><span class="o">::</span><span class="n">ceil</span><span class="p">(</span><span class="n">desired</span><span class="p">);</span>
<span class="c1">// calculate the baudrates above and below the requested baudrate</span>
<span class="kt">int</span> <span class="n">baud_lower</span> <span class="o">=</span> <span class="n">input_frequency</span> <span class="o">/</span> <span class="n">div_ceil</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">baud_upper</span> <span class="o">=</span> <span class="n">input_frequency</span> <span class="o">/</span> <span class="n">div_floor</span><span class="p">;</span>
<span class="c1">// calculate the half-point between the upper and lower baudrate</span>
<span class="kt">int</span> <span class="n">baud_middle</span> <span class="o">=</span> <span class="p">(</span><span class="n">baud_upper</span> <span class="o">+</span> <span class="n">baud_lower</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">;</span>
<span class="c1">// decide which divisor is closer to a possible baudrate</span>
<span class="c1">// lower baudrate means higher divisor!</span>
<span class="kt">int</span> <span class="n">nearest</span> <span class="o">=</span> <span class="p">(</span><span class="n">baudrate</span> <span class="o"><</span> <span class="n">baud_middle</span><span class="p">)</span> <span class="o">?</span> <span class="n">div_ceil</span> <span class="o">:</span> <span class="n">div_floor</span><span class="p">;</span>
<span class="c1">// map to correct range (0 is 1, 1 is 2, etc…)</span>
<span class="kt">int</span> <span class="n">prescaler</span> <span class="o">=</span> <span class="n">nearest</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
</code></pre></div></div>
<p>This algorithm can be adapted to work with non-continuous ranges (as in <em>2<sup>n</sup></em> for the power-of-two prescaler) and also with fractional prescaler using binary scaling.</p>
<h2 id="choosing-tolerances">Choosing Tolerances</h2>
<p>Although these algorithms will choose the divisor with the least baudrate error, we want to have some form of control over <em>how much</em> error is still acceptable.</p>
<p>We are only looking at relative error which is defined as:</p>
<center>
<p><img invertible="" src="prescaler_relative_error.svg" /></p>
</center>
<p>A set of default tolerances should be chosen so that without any effort required from the programmer, they act as a useful guard against unreasonable baudrate errors.
So, how much error is still acceptable?</p>
<p>For internal peripherals like ADCs, which usually have a power-of-two prescaler and can operate in a wide range of frequencies, we chose a ±10% default tolerance.</p>
<p>For synchronous protocols such as SPI and I<sup>2</sup>C, the master clocks the bus and the exact baudrate does not really matter.
Of course, when operating the aforementioned SPI slave at 30 MHz you want to be reasonably certain that you aren’t clocking it at 35 MHz, which causes it to glitch.
So for xpcc we chose a ±5% default tolerance.</p>
<p>However, asynchronous protocols simply do not allow for much tolerance.
The relative baudrate error tolerance for UART with 8N1 configuration (8 databits, 1 startbit and 1 stopbit) as shown below is only ±5%. The sample point of the stop bit may only shift by at most ±<em>t<sub>Symbol</sub> /2</em> and with 10 bits to read, one <em>t<sub>Symbol</sub></em> equals one tenth of the symbol transmission time, hence a relative tolerance of ±5%. For example, the tolerance for 7-bit transfers (9 baudtimes) increases to ±5.56%.</p>
<center>
<p><img invertible="" src="prescaler_uart.svg" width="500" /></p>
</center>
<p>However, since both transmitter and receiver may not generate the exact baudrate, the error must not exceed <strong>±5% in total</strong>, which in the worst case (one too fast, one too slow) imposes a tight allowed deviation of +2.5% and -2.5% on the modules respectively.
In xpcc we therefore chose ±2% default tolerance for linear prescalers and ±1% for fractional prescalers.</p>
<p>If a generated baudrate is found to be outside of the default tolerance, this information must be conveyed to the programmer!
Of course, (s)he must able to overwrite the default tolerances to make them more or less restrictive, depending on the application.</p>
<h2 id="at-compile-time">At Compile Time</h2>
<p>Now, we could slack off and just implement all this at runtime.
There are a couple of issues with this on a microcontroller:</p>
<ol>
<li>It’s simply inefficient: How often do you set your baudrates? Once?</li>
<li>How do you communicate to the programmer that your generated baudrate is above your declared tolerance? Serial output?</li>
<li>What is the runtime supposed to do with a test failure? Automatically switch to another baudrate?</li>
</ol>
<p>Especially on AVRs the computational toll of using floating point and 32bit values to compute a one-time value is quite immense.
Even if you have multiple baudrates that you need to switch to at runtime, it is cheaper in both storage and execution time to use a lookup table!</p>
<p>However, the second and third points are the real culprit.
It would be plain stupid to even attempt to output an error string over UART that the generated (UART) baudrate is outside of the declared tolerance.
Automatically switching to another baudrate is even more stupid, as this defies the purpose of having chosen a particular baudrate.</p>
<p>No, this is a problem that can and must be solved at compile time.
Fortunately with C++11 is has become possible to use constexpr functions and static assertions, which make compile-time computation and communication a lot easier.</p>
<h4 id="implementation">Implementation</h4>
<p>Here is the full compile-time implementation of the AVR’s SPI initialize method:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/**
* Initializes the hardware and sets the baudrate.
*
* @tparam SystemClock
* the currently active system clock
* @tparam baudrate
* the desired baudrate in Hz
* @tparam tolerance
* the allowed relative tolerance for the resulting baudrate
*/</span>
<span class="k">template</span><span class="o"><</span> <span class="k">class</span> <span class="nc">SystemClock</span><span class="p">,</span> <span class="kt">uint32_t</span> <span class="n">baudrate</span><span class="p">,</span>
<span class="kt">uint16_t</span> <span class="n">tolerance</span> <span class="o">=</span> <span class="n">Tolerance</span><span class="o">::</span><span class="n">FivePercent</span> <span class="p">></span>
<span class="k">static</span> <span class="kt">void</span>
<span class="nf">initialize</span><span class="p">()</span>
<span class="p">{</span>
<span class="c1">// calculate the nearest prescaler from the baudrate</span>
<span class="k">constexpr</span> <span class="kt">float</span> <span class="n">desired</span> <span class="o">=</span> <span class="kt">float</span><span class="p">(</span><span class="n">SystemClock</span><span class="o">::</span><span class="n">Spi</span><span class="p">)</span> <span class="o">/</span> <span class="n">baudrate</span><span class="p">;</span>
<span class="k">constexpr</span> <span class="kt">uint8_t</span> <span class="n">nearest</span> <span class="o">=</span> <span class="p">(</span>
<span class="p">(</span><span class="n">nearest</span> <span class="o">>=</span> <span class="p">(</span> <span class="mi">64</span> <span class="o">*</span> <span class="mf">4.</span><span class="n">f</span><span class="o">/</span><span class="mi">3</span><span class="p">))</span> <span class="o">?</span> <span class="mi">128</span> <span class="o">:</span> <span class="p">(</span>
<span class="p">(</span><span class="n">nearest</span> <span class="o">>=</span> <span class="p">(</span> <span class="mi">32</span> <span class="o">*</span> <span class="mf">4.</span><span class="n">f</span><span class="o">/</span><span class="mi">3</span><span class="p">))</span> <span class="o">?</span> <span class="mi">64</span> <span class="o">:</span> <span class="p">(</span>
<span class="p">(</span><span class="n">nearest</span> <span class="o">>=</span> <span class="p">(</span> <span class="mi">16</span> <span class="o">*</span> <span class="mf">4.</span><span class="n">f</span><span class="o">/</span><span class="mi">3</span><span class="p">))</span> <span class="o">?</span> <span class="mi">32</span> <span class="o">:</span> <span class="p">(</span>
<span class="p">(</span><span class="n">nearest</span> <span class="o">>=</span> <span class="p">(</span> <span class="mi">8</span> <span class="o">*</span> <span class="mf">4.</span><span class="n">f</span><span class="o">/</span><span class="mi">3</span><span class="p">))</span> <span class="o">?</span> <span class="mi">16</span> <span class="o">:</span> <span class="p">(</span>
<span class="p">(</span><span class="n">nearest</span> <span class="o">>=</span> <span class="p">(</span> <span class="mi">4</span> <span class="o">*</span> <span class="mf">4.</span><span class="n">f</span><span class="o">/</span><span class="mi">3</span><span class="p">))</span> <span class="o">?</span> <span class="mi">8</span> <span class="o">:</span> <span class="p">(</span>
<span class="p">(</span><span class="n">nearest</span> <span class="o">>=</span> <span class="p">(</span> <span class="mi">2</span> <span class="o">*</span> <span class="mf">4.</span><span class="n">f</span><span class="o">/</span><span class="mi">3</span><span class="p">))</span> <span class="o">?</span> <span class="mi">4</span> <span class="o">:</span>
<span class="mi">2</span>
<span class="p">))))));</span>
<span class="c1">// check if we found a prescaler which generates</span>
<span class="c1">// a baudrate within the declared tolerance</span>
<span class="n">assertBaudrateInTolerance</span><span class="o"><</span>
<span class="n">SystemClock</span><span class="o">::</span><span class="n">Spi</span> <span class="o">/</span> <span class="n">nearest</span><span class="p">,</span> <span class="c1">// available baudrate</span>
<span class="n">baudrate</span><span class="p">,</span> <span class="c1">// desired baudrate</span>
<span class="n">tolerance</span> <span class="o">></span><span class="p">();</span> <span class="c1">// allowed tolerance</span>
<span class="c1">// translate the prescaler into the bitmapping</span>
<span class="k">constexpr</span> <span class="n">Prescaler</span> <span class="n">prescaler</span> <span class="o">=</span> <span class="p">(</span>
<span class="p">(</span><span class="n">nearest</span> <span class="o">>=</span> <span class="mi">128</span><span class="p">)</span> <span class="o">?</span> <span class="n">Prescaler</span><span class="o">::</span><span class="n">Div128</span> <span class="o">:</span> <span class="p">(</span>
<span class="p">(</span><span class="n">nearest</span> <span class="o">>=</span> <span class="mi">64</span><span class="p">)</span> <span class="o">?</span> <span class="n">Prescaler</span><span class="o">::</span><span class="n">Div64</span> <span class="o">:</span> <span class="p">(</span>
<span class="p">(</span><span class="n">nearest</span> <span class="o">>=</span> <span class="mi">32</span><span class="p">)</span> <span class="o">?</span> <span class="n">Prescaler</span><span class="o">::</span><span class="n">Div32</span> <span class="o">:</span> <span class="p">(</span>
<span class="p">(</span><span class="n">nearest</span> <span class="o">>=</span> <span class="mi">16</span><span class="p">)</span> <span class="o">?</span> <span class="n">Prescaler</span><span class="o">::</span><span class="n">Div16</span> <span class="o">:</span> <span class="p">(</span>
<span class="p">(</span><span class="n">nearest</span> <span class="o">>=</span> <span class="mi">8</span><span class="p">)</span> <span class="o">?</span> <span class="n">Prescaler</span><span class="o">::</span><span class="n">Div8</span> <span class="o">:</span> <span class="p">(</span>
<span class="p">(</span><span class="n">nearest</span> <span class="o">>=</span> <span class="mi">4</span><span class="p">)</span> <span class="o">?</span> <span class="n">Prescaler</span><span class="o">::</span><span class="n">Div4</span> <span class="o">:</span>
<span class="n">Prescaler</span><span class="o">::</span><span class="n">Div2</span>
<span class="p">))))));</span>
<span class="c1">// do the actual initialization at runtime</span>
<span class="n">initialize</span><span class="p">(</span><span class="n">prescaler</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The primary algorithm has already been described before.
What’s new is <code class="language-plaintext highlighter-rouge">SystemClock</code> which is a static class that contains the current clock tree configuration (also computed at compile time using similar methods).
On the AVR this contains the compile-time constant <code class="language-plaintext highlighter-rouge">SystemClock::Spi</code> with the input clock frequency of the SPI peripheral.
This unburdens the programmer to know from which clock domain the peripheral is clocked.</p>
<p>The <code class="language-plaintext highlighter-rouge">assertBaudrateInTolerance</code> is given <em>B<sub>available</sub></em>, <em>B<sub>desired</sub></em> and the allowed tolerance and raises a <code class="language-plaintext highlighter-rouge">static_assert</code> if the test fails.
The <code class="language-plaintext highlighter-rouge">nearest</code> divisor is then mapped onto the register bit representation and this is then used to initialize the prescaler and peripheral.</p>
<p>And all of this happens at compile-time, the runtime only knows one 8bit program-space constant and simply copies it into the prescaler register.</p>
<h4 id="usage">Usage</h4>
<p>All the programmer has to write is this:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Spi</span><span class="o">::</span><span class="n">initialize</span><span class="o"><</span><span class="n">systemClock</span><span class="p">,</span> <span class="n">MHz8</span><span class="o">></span><span class="p">();</span> <span class="c1">// with ±5% tolerance</span>
</code></pre></div></div>
<p>Should you want to change the SPI baudrate at runtime, you can do that simply by re-initializing with a different baudrate:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 4.4 MHz with explicit ±10% tolerance</span>
<span class="n">Spi</span><span class="o">::</span><span class="n">initialize</span><span class="o"><</span><span class="n">systemClock</span><span class="p">,</span> <span class="mi">4400000</span><span class="p">,</span> <span class="n">xpcc</span><span class="o">::</span><span class="n">Tolerance</span><span class="o">::</span><span class="n">TenPercent</span><span class="o">></span><span class="p">();</span>
</code></pre></div></div>
<p>Changing prescaler values often requires the peripheral to be switched off and then restarted.
Calling the <code class="language-plaintext highlighter-rouge">initialize</code> method again guarantees correct operation.
The overhead of this is only the loading of the compile-time constant which contains the prescaler value for 4 MHz and a call to the real initialize method of the peripheral.</p>
<p>If you have several baudrates that you need to choose at runtime, a switch-case “lookup table” is still more efficient than a computation at runtime (while guaranteeing tolerance compliance):</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">switch</span><span class="p">(</span><span class="n">baudrate</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">case</span> <span class="mi">8000000</span><span class="p">:</span>
<span class="n">Spi</span><span class="o">::</span><span class="n">initialize</span><span class="o"><</span><span class="n">systemClock</span><span class="p">,</span> <span class="n">MHz8</span><span class="o">></span><span class="p">();</span>
<span class="k">break</span><span class="p">;</span>
<span class="k">case</span> <span class="mi">4000000</span><span class="p">:</span>
<span class="n">Spi</span><span class="o">::</span><span class="n">initialize</span><span class="o"><</span><span class="n">systemClock</span><span class="p">,</span> <span class="n">MHz4</span><span class="o">></span><span class="p">();</span>
<span class="k">break</span><span class="p">;</span>
<span class="k">case</span> <span class="mi">2000000</span><span class="p">:</span>
<span class="n">Spi</span><span class="o">::</span><span class="n">initialize</span><span class="o"><</span><span class="n">systemClock</span><span class="p">,</span> <span class="n">MHz2</span><span class="o">></span><span class="p">();</span>
<span class="k">break</span><span class="p">;</span>
<span class="k">case</span> <span class="mi">1000000</span><span class="p">:</span>
<span class="n">Spi</span><span class="o">::</span><span class="n">initialize</span><span class="o"><</span><span class="n">systemClock</span><span class="p">,</span> <span class="n">MHz1</span><span class="o">></span><span class="p">();</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>
<p>It is apparent that the usage is incredibly simple.</p>
<h4 id="on-failures">On Failures</h4>
<p>Should the tolerance check fail, then the compiler will show you the baudrate it computed.
Unfortunately the output is relatively unreadable, since there are templates involved. However, it’s still better than nothing, so stop complaining.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Here I want *exactly* 115.2kBaud! No problem.</span>
<span class="n">Uart</span><span class="o">::</span><span class="n">initialize</span><span class="o"><</span><span class="n">systemClock</span><span class="p">,</span> <span class="mi">115200</span><span class="p">,</span> <span class="n">xpcc</span><span class="o">::</span><span class="n">Tolerance</span><span class="o">::</span><span class="n">Exact</span><span class="o">></span><span class="p">();</span>
</code></pre></div></div>
<p>Compiling the above example on an AVR clocked with 16MHz will lead to a static assertion failure, since the desired baudrate of 115.2kBaud cannot be generated:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>interface.hpp: In instantiation of 'static void xpcc::Peripheral::assertBaudrateInTolerance() [with long unsigned int available = 111111ul; long unsigned int requested = 115200ul; unsigned int tolerance = 0u]':
...
interface.hpp:94:3: error: static assertion failed: The closest available baudrate exceeds the tolerance of the requested baudrate!
static_assert(xpcc::Tolerance::isValueInTolerance(requested, available, tolerance),
</code></pre></div></div>
<p>We can see that the closest available baudrate seems to be 111.1kBaud which has a full 3.5% relative error, which would not even have been allowed with the default tolerance.</p>
<p>Now you can just start trying different baudrates, for example 38.4kBaud, which has almost no error with the actual baudrate being 38.461kBaud.
Piece a cake, am I right?</p>
<h2 id="conclusions">Conclusions</h2>
<p>Apart from the technical elegance of computing these values at compile-time, there is a real improvement in the programmer’s experience of using prescalers:</p>
<ol>
<li>You are declaring <strong>what</strong> you want, not <strong>how</strong> to get it.</li>
<li>You can now specify and enforce baudrate <em>quality</em> directly in your code.</li>
<li>There is no need to read the datasheet anymore, trial and error suffices.</li>
<li>The compiler can give you an alternative baudrate with <em>zero</em> error!</li>
<li>Your code is your documentation, since tolerance compliance is enforced.</li>
</ol>
<p>Of course, the framework developers now have to do the grunt work of understanding how the prescaler works and implement the algorithms accordingly.
However, the reward outweighs the effort many times over, and might save you a lot of time not having to debug your prescaler calculations.</p>
<p><em>This post was first published at blog.xpcc.io.</em></p>Niklas Hauserniklas@salkinium.comPrescaler and baudrate calculations are a tricky topic. I have had many situations where the baudrate turned out to be off by a couple of percent, which was enough to render my serial output streams unreadable. Sure, calculating the baudrate error beforehand would have saved me some hours of useless debugging, however, that would require understanding the often complicated mathematical formula hidden somewhere in the depths of the datasheet describing the prescaler vs. baudrate relationship. And that seemed to be more work than just using a logic analyzer to measure the resulting error. Of course, this felt like using a sledgehammer to crack a nut and it was neither a fast nor practical solution. I think there exists a better solution and I think it can be done using pure C++. This solution needs to be able to: compute the best possible prescaler settings for the desired baudrate, and notify me when the desired baudrate cannot be achieved without unresonable error.Typesafe Register Access in C++2015-02-25T00:00:00+01:002015-02-25T00:00:00+01:00http://blog.salkinium.com/typesafe-register-access-in-c++<p>When you are writing software for microcontrollers, reading and writing hardware registers becomes second nature.
Registers and bit mappings are typically “modeled” using C preprocessor defines, and usually provided to you by your cross compiler toolchain in device specific header files.</p>
<p>Setting up and toggling PG13 on the STM32F4 this way looks rather… unreadable:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// set push-pull, output</span>
<span class="n">GPIOG</span><span class="o">-></span><span class="n">OSPEEDR</span> <span class="o">=</span> <span class="p">(</span><span class="n">GPIOG</span><span class="o">-></span><span class="n">OSPEEDR</span> <span class="o">&</span> <span class="o">~</span><span class="p">(</span><span class="mi">3</span> <span class="o"><<</span> <span class="mi">26</span><span class="p">))</span> <span class="o">|</span> <span class="p">(</span><span class="mi">3</span> <span class="o"><<</span> <span class="mi">26</span><span class="p">);</span>
<span class="n">GPIOG</span><span class="o">-></span><span class="n">MODER</span> <span class="o">=</span> <span class="p">(</span><span class="n">GPIOG</span><span class="o">-></span><span class="n">MODER</span> <span class="o">&</span> <span class="o">~</span><span class="p">(</span><span class="mi">3</span> <span class="o"><<</span> <span class="mi">26</span><span class="p">))</span> <span class="o">|</span> <span class="p">(</span><span class="mi">1</span> <span class="o"><<</span> <span class="mi">26</span><span class="p">);</span>
<span class="n">GPIOG</span><span class="o">-></span><span class="n">OTYPER</span> <span class="o">&=</span> <span class="o">~</span><span class="p">(</span><span class="mi">1</span> <span class="o"><<</span> <span class="mi">13</span><span class="p">);</span>
<span class="n">GPIOG</span><span class="o">-></span><span class="n">PUPDR</span> <span class="o">&=</span> <span class="o">~</span><span class="p">(</span><span class="mi">1</span> <span class="o"><<</span> <span class="mi">13</span><span class="p">);</span>
<span class="k">while</span><span class="p">(</span><span class="nb">true</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">GPIOG</span><span class="o">-></span><span class="n">ODR</span> <span class="o">^=</span> <span class="p">(</span><span class="mi">1</span> <span class="o"><<</span> <span class="mi">13</span><span class="p">);</span> <span class="c1">// toggle</span>
<span class="c1">// delay</span>
<span class="p">}</span>
</code></pre></div></div>
<p>It did not really dawn on me how primitive this concept was until I was forced to model a memory map myself for one of our many device drivers.
Since I have never been a friend of using the C preprocessor in C++ unless absolutely necessary, it seemed like a good opportunity to research how best to implement this in pure C++.</p>
<p><strong>Update 2022: Note that this technique is outdated for C++20! Please consult the internet for the current state-of-the-art.</strong></p>
<!--more-->
<h2 id="existing-concepts">Existing Concepts</h2>
<p>Martin Moene has compiled <a href="http://www.eld.leidenuniv.nl/~moene/Home/papers/accu/overload95-register/">an excellent overview</a> of the relevant publications regarding C++ hardware register access.
Perhaps the most relevant of those is a paper written by Ken Smith titled <a href="http://yogiken.files.wordpress.com/2010/02/c-register-access.pdf">“C++ Hardware Register Access Redux”</a> written in 2010.</p>
<p>Smith’s policy based design is quite complete and there even exists a <a href="https://github.com/JinShil/memory_mapped_io">functioning implementation</a> by Jin Shil keyed to embedded systems — albeit in the D programming language.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>
The author has implemented part of the STM32F4 memory map with these classes and uses them in a simple <a href="https://github.com/JinShil/stm32f42_discovery_demo/blob/0f355d63bd7823f593ef770db1703bc2cf3454a6/source/start.d#L253">D program for toggling a pin</a> (a C++ version would look similar):</p>
<div class="language-d highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// set push-pull, output</span>
<span class="n">GPIOG</span><span class="p">.</span><span class="n">OSPEEDR</span><span class="p">.</span><span class="n">OSPEEDR13</span><span class="p">.</span><span class="n">value</span> <span class="p">=</span> <span class="mi">3</span><span class="p">;</span>
<span class="n">GPIOG</span><span class="p">.</span><span class="n">MODER</span><span class="p">.</span><span class="n">MODER13</span><span class="p">.</span><span class="n">value</span> <span class="p">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">GPIOG</span><span class="p">.</span><span class="n">OTYPER</span><span class="p">.</span><span class="n">OT13</span><span class="p">.</span><span class="n">value</span> <span class="p">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">GPIOG</span><span class="p">.</span><span class="n">PUPDR</span><span class="p">.</span><span class="n">PUPDR13</span><span class="p">.</span><span class="n">value</span> <span class="p">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">while</span><span class="p">(</span><span class="kc">true</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">GPIOG</span><span class="p">.</span><span class="n">ODR</span><span class="p">.</span><span class="n">ODR13</span><span class="p">.</span><span class="n">value</span> <span class="p">=</span> <span class="p">!</span><span class="n">GPIOG</span><span class="p">.</span><span class="n">ODR</span><span class="p">.</span><span class="n">ODR13</span><span class="p">.</span><span class="n">value</span><span class="p">;</span> <span class="c1">// toggle</span>
<span class="c1">// delay</span>
<span class="p">}</span>
</code></pre></div></div>
<p><strong>4 Mar 2015 – Update:</strong> I seem to have completely missed <a href="https://github.com/kensmith/cppmmio">Ken Smith’s own implementation</a>. He also has a <a href="https://github.com/kensmith/cortex-from-scratch">Cortex-M example</a>.<br />
<strong>5 Mar 2015 – Update:</strong> For an excellent example of a similar implementation for the AVR using C++11, see <a href="https://github.com/chrism333/yalla">the yalla library</a>.<br />
<strong>8 Sep 2015 – Update:</strong> There also exists the <a href="https://github.com/kvasir-io/Kvasir/tree/master/Lib/Register">Kvasir register implementation</a> which has a few more tricks up its sleeve regarding atomicity and efficiency. Impressive.</p>
<p>Here are a few of my observations:</p>
<ol>
<li>How does one generate the C++ memory map? By hand for every device? How do you keep it up-to-date for new devices?</li>
<li>The above code is syntactically already a huge improvement. However, its semantics are just as cryptic. What does writing value <code class="language-plaintext highlighter-rouge">3</code> into register <code class="language-plaintext highlighter-rouge">GPIOG.OSPEEDR.OSPEEDR13</code> actually mean?</li>
<li>The papers compiled by Moene are assuming we want to access the devices internal memory. For devices connected via an external bus, accessing a memory location can be an expensive operation, which can take a long time and even fail.</li>
</ol>
<h4 id="high-effort-solution">High Effort Solution</h4>
<p>Even though Smith’s policy based design does not carry a runtime penalty, implementing and maintaining it clearly comes with some overhead.
Every device family requires its own memory map implementation, which would be crazy to code by hand. Not only for the sheer effort required, but also since this would be incredibly prone to errors. (<a href="https://github.com/JinShil/stm32_datasheet_to_d">Jin Shin seems to have realized this too</a>.)
If the manufacturer adds a new device to the family, you would have to update and perhaps extend this memory map implementation.</p>
<p>A generator is required, which converts a computer readable memory map into the C++ counterpart and this with as little additional effort as possible.
Notice that the existing memory files in your compiler toolchain (the ones with the defines) only provide you with names for memory addresses, bits and configuration, but lacks the information required for generating the correct policies.</p>
<p>Therefore you would need to find an “annotated” memory map, which is probably only available directly from the manufacturer. Atmel has so-called “Part Description Files” hidden somewhere deep in AVR Studio, which are used by the avr-gcc developers to generate the <code class="language-plaintext highlighter-rouge">io.h</code> memory files for the AVRs. ST Microelectronics has similar files hidden in their STM32Cube initialization code generator.</p>
<p>I know this, because xpcc uses exactly these memory maps to generate its <a href="https://github.com/roboterclubaachen/xpcc/tree/develop/src/xpcc/architecture/platform/devices">own device files</a>. I can tell you that writing and maintaining a parser for these files is painful, since they are littered with inconsistencies. I cannot imagine maintaining this for the entire memory map. The errors in Atmel’s memory maps drive me crazy enough.</p>
<p>So unless the manufacturer goes the extra mile and publishes their device memory maps preferably open-source on GitHub (unlikely) or directly provide the C++ implementations themselves (even less likely), the burden to generate these implementations and keep them up-to-date is placed on the library maintainers.
That’s not really an option.</p>
<p><strong>17 Mar 2015 – Update:</strong> As part of <a href="http://cmsis.arm.com">CMSIS</a>, ARM has standardized <a href="http://www.keil.com/pack/doc/CMSIS/SVD/html/index.html">System View Description (SVD)</a> files which describe the memory map of vendor devices. These files could be the foundation for generation, however, a vendor-specific EULA needs to be agreed to before download.<br />
<strong>12 Sep 2015 – Update:</strong> Paul has created a <a href="https://github.com/posborne/cmsis-svd">GitHub repository containing most SVD files and a python parser</a>. This could be a gamechanger!</p>
<h4 id="semantics-matter">Semantics Matter</h4>
<p>Let me put forward this blunt theory:
Regardless of how elegantly register access is realized in a language, there is hardly a semantical advantage of this, since you are still writing magic numbers into a lot of magic registers (in a more beautiful and type-safe way though).</p>
<p>In xpcc the above code is reduced to these three equivalent <em>and</em> self-explanatory lines:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">GpioG13</span><span class="o">::</span><span class="n">configure</span><span class="p">(</span><span class="n">Gpio</span><span class="o">::</span><span class="n">OutputType</span><span class="o">::</span><span class="n">PushPull</span><span class="p">);</span>
<span class="n">GpioG13</span><span class="o">::</span><span class="n">setOutput</span><span class="p">();</span>
<span class="k">while</span><span class="p">(</span><span class="nb">true</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">GpioG13</span><span class="o">::</span><span class="n">toggle</span><span class="p">();</span>
<span class="c1">// delay</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Notice how there are no registers in this code whatsoever and how clear the meaning of the code becomes.
It seems to be much more useful to implement a clean hardware abstraction layer, than a form of register access.</p>
<p>This does not mean that elegant register access is pointless, however, its level of abstraction might be too low for your library to benefit from.
It might be an enrichment for the library developers (Smith proposes unit testing register access), but usually a higher level of abstraction is required.</p>
<h4 id="missed-the-bus">Missed The Bus</h4>
<p>An external device is typically connected through a serial bus like UART, SPI or I<sup>2</sup>C, which is very slow compared to any internal bus.
Even the few devices using a parallel bus interface (like external RAM) most often multiplex their address and data lines to minimize the amount of required pins. It’s probably fair to say “Internal > Parallel > SPI > UART > I<sup>2</sup>C” in terms of transfer speed.</p>
<p>Since a typical read-modify-write means accessing the bus twice, a naive implementation of our example code would yield 8 bus accesses for setting up the port and another 2 for every pin toggle.
That only takes a few cycles using the internal bus<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>, but would easily stretch to micro- and milliseconds on an external serial bus, which makes it impractical to busy-wait during this time.</p>
<p>There is another problem. Consider the following memory layout of an external accelerometer:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rw: | Config1 | Config2 | Config3 | Config4 |
0x20 0x21 0x22 0x23
ro: | Status | XL | XH | YL | YH | ZL | ZH |
0x30 0x31 0x32 0x33 0x34 0x35 0x36
</code></pre></div></div>
<p>Both the configuration registers as well as the read-only registers are all placed in one continuos memory block.
The most usual serial interfaces of external devices auto-increment their start address, so that we can efficiently write or read a continuos block of memory.
This would allow us one bus access to write the four configuration bytes, and not have to access the bus four times to write only one byte each time.
Similarly, we also do not want to access each of the read-only registers separately.</p>
<p>However, such a register block access is not considered in Smith’s design nor in Jin Shin’s implementation.
It is also “merely” an optimization when using a different bus type, but it still breaks with the existing interface.</p>
<p>Considering this and the ideas on semantics, I would argue that devices connected through an external bus require a different level of abstraction.</p>
<h2 id="what-now">What Now?</h2>
<p>If you have read to here the situation seems a bit hopeless.
The existing solutions are difficult to implement and maintain, provide little semantical advantages and do not work well over external busses.
And I still have no idea how to model the memory map of my external devices.</p>
<p><em>So let us ignore the internal memory.</em> We already have a way of using it with the defines and with a good hardware abstraction layer there should be no need to access them directly.</p>
<h4 id="modelling-registers">Modelling Registers</h4>
<p>Instead, let’s focus on how to model register content of external devices and ignore the bus for the moment.</p>
<p>Registers can be made up of three things:</p>
<ul>
<li>Bits: a single bit (position <em>N</em>),</li>
<li>Configurations: a combination of bits where the meaning does not correspond to its numeric value (position <em>[N, M]</em>)</li>
<li>Values: a numeric value (position <em>[N, M]</em>)</li>
</ul>
<p>Example of an 8bit register: Control</p>
<center>
<img invertible="" src="control_register.svg" />
</center>
<ul>
<li>Bit <em>7</em>: Enable</li>
<li>Bit <em>6</em>: Full Scale</li>
<li>Configuration <em>[5, 4]</em>: Prescaler
<ul>
<li>00: Divide by 1</li>
<li>01: Divide by 2</li>
<li>10: Divide by 4</li>
<li>11: Divide by 8</li>
</ul>
</li>
<li>Value <em>[3, 1]</em>: Start-Up Delay in ms</li>
</ul>
<p>There should be an easy way to access all of this information in the register.</p>
<h4 id="static-constexpr">static constexpr</h4>
<p>The first idea and implementation was a bit messy.
I wanted to make every bit a static constant expression of class <code class="language-plaintext highlighter-rouge">Bit</code>.
Similar constructs are possible for configurations and values.
Using operator and constructor overloading these constant expressions could be converted and assigned and OR’ed in a type-safe way.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">Control</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Register8</span>
<span class="p">{</span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="n">Bit</span> <span class="n">EN</span> <span class="o">=</span> <span class="n">Bit7</span><span class="p">;</span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="n">Bit</span> <span class="n">FS</span> <span class="o">=</span> <span class="n">Bit6</span><span class="p">;</span>
<span class="k">struct</span> <span class="nc">Prescaler</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Group</span>
<span class="p">{</span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="n">Type</span> <span class="n">BitPosition</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="n">Config</span> <span class="n">Mask</span> <span class="o">=</span> <span class="mb">0b11</span> <span class="o"><<</span> <span class="n">BitPosition</span><span class="p">;</span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="n">Config</span> <span class="n">DivideBy1</span> <span class="o">=</span> <span class="mi">0</span> <span class="o"><<</span> <span class="n">BitPosition</span><span class="p">;</span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="n">Config</span> <span class="n">DivideBy2</span> <span class="o">=</span> <span class="mh">0x01</span> <span class="o"><<</span> <span class="n">BitPosition</span><span class="p">;</span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="n">Config</span> <span class="n">DivideBy4</span> <span class="o">=</span> <span class="mh">0x02</span> <span class="o"><<</span> <span class="n">BitPosition</span><span class="p">;</span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="n">Config</span> <span class="n">DivideBy8</span> <span class="o">=</span> <span class="mh">0x03</span> <span class="o"><<</span> <span class="n">BitPosition</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>
<p>I actually <a href="https://github.com/roboterclubaachen/xpcc/commit/ef55cb32a57b129af8a068f5b6c043eac2512312#diff-ba8846bac2db804c7b7c4a5d477002a0R159">implemented most of this</a> (with a bunch of ugly macros to reduce the verbosity of it).
Then I realized that <code class="language-plaintext highlighter-rouge">static constexpr</code> members require an external instantiation for the linker, which would place them somewhere in memory.
This is because the C++11 standard permits taking the address of a static constexpr member, and only instantiated members actually have an address.</p>
<p>What a dealbreaker.</p>
<p><strong>30 Aug 2015 – Update:</strong> Using a better approach, C. Biffle has implemented <code class="language-plaintext highlighter-rouge">Bitfields</code>, which models <a href="https://github.com/cbiffle/etl/blob/master/biffield/README.mkdn">memory-mapped register banks for his ETL library</a>.</p>
<h2 id="strongly-typed-enumerations">Strongly-Typed Enumerations</h2>
<p>Which C++ type does not need to be instantiated to be used? Yes, enums.
However, C++03 enums convert to integers pretty quickly, but thankfully, in C++11 we have strongly-typed enums which don’t do that.</p>
<h4 id="register-bits">Register Bits</h4>
<p>Using strongly-typed enums we can describe the bits of the example register as such:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">enum</span> <span class="k">class</span> <span class="nc">Control</span> <span class="o">:</span> <span class="kt">uint8_t</span>
<span class="p">{</span>
<span class="n">EN</span> <span class="o">=</span> <span class="n">Bit7</span><span class="p">,</span> <span class="c1">///< bit documentation</span>
<span class="n">FS</span> <span class="o">=</span> <span class="n">Bit6</span><span class="p">,</span>
<span class="n">PRE1</span> <span class="o">=</span> <span class="n">Bit5</span><span class="p">,</span>
<span class="n">PRE0</span> <span class="o">=</span> <span class="n">Bit4</span><span class="p">,</span>
<span class="n">DEL2</span> <span class="o">=</span> <span class="n">Bit3</span><span class="p">,</span>
<span class="n">DEL1</span> <span class="o">=</span> <span class="n">Bit2</span><span class="p">,</span>
<span class="n">DEL0</span> <span class="o">=</span> <span class="n">Bit1</span><span class="p">,</span>
<span class="p">};</span>
<span class="k">typedef</span> <span class="n">Flags8</span><span class="o"><</span> <span class="n">Control</span> <span class="o">></span> <span class="n">Control_t</span><span class="p">;</span>
</code></pre></div></div>
<p>Since strongly-typed enums do not have any predefined operators, they are wrapped into the <code class="language-plaintext highlighter-rouge">Flags8</code> <a href="https://modm.io/reference/module/modm-architecture-register/">template class</a><sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>, which adds the necessary constructors and bitwise operator overloading to them and returns them as a <code class="language-plaintext highlighter-rouge">Flags8</code> type.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup></p>
<p>This means, you can handle all its register bits as you would expect:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Control_t</span> <span class="n">control</span> <span class="o">=</span> <span class="n">Control</span><span class="o">::</span><span class="n">EN</span><span class="p">;</span>
<span class="n">control</span> <span class="o">=</span> <span class="n">Control</span><span class="o">::</span><span class="n">EN</span> <span class="o">|</span> <span class="n">Control</span><span class="o">::</span><span class="n">FS</span><span class="p">;</span>
<span class="n">control</span> <span class="o">&=</span> <span class="o">~</span><span class="n">Control</span><span class="o">::</span><span class="n">FS</span><span class="p">;</span>
<span class="n">control</span> <span class="o">|=</span> <span class="n">Control</span><span class="o">::</span><span class="n">FS</span><span class="p">;</span>
<span class="n">control</span> <span class="o">^=</span> <span class="n">Control</span><span class="o">::</span><span class="n">PRE1</span><span class="p">;</span>
<span class="kt">bool</span> <span class="n">isSet</span> <span class="o">=</span> <span class="n">control</span> <span class="o">&</span> <span class="n">Control</span><span class="o">::</span><span class="n">FS</span><span class="p">;</span>
<span class="n">control</span><span class="p">.</span><span class="n">reset</span><span class="p">(</span><span class="n">Control</span><span class="o">::</span><span class="n">PRE1</span> <span class="o">|</span> <span class="n">Control</span><span class="o">::</span><span class="n">PRE0</span><span class="p">);</span>
<span class="n">control</span><span class="p">.</span><span class="n">set</span><span class="p">(</span><span class="n">Control</span><span class="o">::</span><span class="n">DEL0</span><span class="p">);</span>
<span class="kt">bool</span> <span class="n">noneSet</span> <span class="o">=</span> <span class="n">control</span><span class="p">.</span><span class="n">none</span><span class="p">(</span><span class="n">Control</span><span class="o">::</span><span class="n">PRE1</span> <span class="o">|</span> <span class="n">Control</span><span class="o">::</span><span class="n">PRE0</span><span class="p">);</span>
<span class="kt">bool</span> <span class="n">allSet</span> <span class="o">=</span> <span class="n">control</span><span class="p">.</span><span class="n">all</span><span class="p">(</span><span class="n">Control</span><span class="o">::</span><span class="n">EN</span> <span class="o">|</span> <span class="n">Control</span><span class="o">::</span><span class="n">FS</span><span class="p">);</span>
</code></pre></div></div>
<p>You still get raw access if you really need it:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint8_t</span> <span class="n">raw</span> <span class="o">=</span> <span class="n">control</span><span class="p">.</span><span class="n">value</span><span class="p">;</span> <span class="c1">// the underlying type</span>
<span class="n">control</span><span class="p">.</span><span class="n">value</span> <span class="o">=</span> <span class="mh">0x24</span><span class="p">;</span>
</code></pre></div></div>
<p>And the access is type-safe, you cannot use bits from two different registers:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">enum</span> <span class="k">class</span> <span class="nc">Control2</span> <span class="o">:</span> <span class="kt">uint8_t</span>
<span class="p">{</span>
<span class="n">DIS</span> <span class="o">=</span> <span class="n">Bit4</span><span class="p">,</span>
<span class="n">HS</span> <span class="o">=</span> <span class="n">Bit3</span><span class="p">,</span>
<span class="p">};</span>
<span class="k">typedef</span> <span class="n">Flags8</span><span class="o"><</span> <span class="n">Control2</span> <span class="o">></span> <span class="n">Control2_t</span><span class="p">;</span>
<span class="k">auto</span> <span class="n">control</span> <span class="o">=</span> <span class="n">Control</span><span class="o">::</span><span class="n">EN</span> <span class="o">|</span> <span class="n">Control2</span><span class="o">::</span><span class="n">HS</span><span class="p">;</span> <span class="c1">// compile error</span>
</code></pre></div></div>
<p>You can even overload functions on argument type now:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">write</span><span class="p">(</span><span class="n">Control_t</span> <span class="n">control</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">write</span><span class="p">(</span><span class="n">Control2_t</span> <span class="n">control</span><span class="p">);</span>
<span class="n">write</span><span class="p">(</span><span class="n">Control</span><span class="o">::</span><span class="n">EN</span> <span class="o">|</span> <span class="n">Control</span><span class="o">::</span><span class="n">FS</span><span class="p">);</span> <span class="c1">// calls #1</span>
<span class="n">write</span><span class="p">(</span><span class="n">Control2</span><span class="o">::</span><span class="n">DIS</span><span class="p">);</span> <span class="c1">// calls #2</span>
</code></pre></div></div>
<h4 id="register-configurations">Register Configurations</h4>
<p>Configurations are also described as a strongly-typed enum and then wrapped into the <code class="language-plaintext highlighter-rouge">Configuration</code> <a href="https://modm.io/reference/module/modm-architecture-register/#register-configurations">template class</a>.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">enum</span> <span class="k">class</span> <span class="nc">Prescaler</span> <span class="o">:</span> <span class="kt">uint8_t</span>
<span class="p">{</span>
<span class="n">Div1</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="c1">///< configuration documentation</span>
<span class="n">Div2</span> <span class="o">=</span> <span class="n">Control</span><span class="o">::</span><span class="n">PRE0</span><span class="p">,</span>
<span class="n">Div4</span> <span class="o">=</span> <span class="n">Control</span><span class="o">::</span><span class="n">PRE1</span><span class="p">,</span>
<span class="n">Div8</span> <span class="o">=</span> <span class="n">Control</span><span class="o">::</span><span class="n">PRE1</span> <span class="o">|</span> <span class="n">Control</span><span class="o">::</span><span class="n">PRE0</span><span class="p">,</span>
<span class="p">};</span>
<span class="k">typedef</span> <span class="n">Configuration</span><span class="o"><</span> <span class="n">Control_t</span><span class="p">,</span> <span class="n">Prescaler</span><span class="p">,</span> <span class="p">(</span><span class="n">Bit5</span> <span class="o">|</span> <span class="n">Bit4</span><span class="p">)</span> <span class="o">></span> <span class="n">Prescaler_t</span><span class="p">;</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">Prescaler</code> enum values are already shifted in this example (hence the <code class="language-plaintext highlighter-rouge">(Bit5 | Bit4)</code> mask), however you can also declare the prescaler values non-shifted and let the wrapper shift it:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">enum</span> <span class="k">class</span> <span class="nc">Prescaler</span> <span class="o">:</span> <span class="kt">uint8_t</span>
<span class="p">{</span>
<span class="n">Div1</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span>
<span class="n">Div2</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span>
<span class="n">Div4</span> <span class="o">=</span> <span class="mi">2</span><span class="p">,</span>
<span class="n">Div8</span> <span class="o">=</span> <span class="mi">3</span><span class="p">,</span>
<span class="p">};</span>
<span class="k">typedef</span> <span class="n">Configuration</span><span class="o"><</span><span class="n">Control_t</span><span class="p">,</span> <span class="n">Prescaler</span><span class="p">,</span> <span class="mb">0b11</span><span class="p">,</span> <span class="mi">4</span><span class="o">></span> <span class="n">Prescaler_t</span><span class="p">;</span>
</code></pre></div></div>
<p>Why? If you have two or more configurations with the same selections in the same register, you can simply add another one:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="n">Configuration</span><span class="o"><</span> <span class="n">Control_t</span><span class="p">,</span> <span class="n">Prescaler</span><span class="p">,</span> <span class="mb">0b11</span><span class="p">,</span> <span class="mi">6</span> <span class="o">></span> <span class="n">Prescaler2_t</span><span class="p">;</span>
</code></pre></div></div>
<p>Configurations can be used inline:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Control_t</span> <span class="n">control</span> <span class="o">=</span> <span class="n">Control</span><span class="o">::</span><span class="n">EN</span> <span class="o">|</span> <span class="n">Prescaler_t</span><span class="p">(</span><span class="n">Prescaler</span><span class="o">::</span><span class="n">Div2</span><span class="p">);</span>
<span class="n">Control_t</span> <span class="n">control</span> <span class="o">&=</span> <span class="o">~</span><span class="n">Prescaler_t</span><span class="o">::</span><span class="n">mask</span><span class="p">();</span>
</code></pre></div></div>
<p>But do not have to:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Prescaler_t</span><span class="o">::</span><span class="n">set</span><span class="p">(</span><span class="n">control</span><span class="p">,</span> <span class="n">Prescaler</span><span class="o">::</span><span class="n">Div2</span><span class="p">);</span>
<span class="n">Prescaler_t</span><span class="o">::</span><span class="n">reset</span><span class="p">(</span><span class="n">control</span><span class="p">);</span>
<span class="n">Prescaler</span> <span class="n">prescaler</span> <span class="o">=</span> <span class="n">Prescaler_t</span><span class="o">::</span><span class="n">get</span><span class="p">(</span><span class="n">control</span><span class="p">);</span>
</code></pre></div></div>
<h4 id="register-values">Register Values</h4>
<p>Values are described using the <code class="language-plaintext highlighter-rouge">Value</code> <a href="https://modm.io/reference/module/modm-architecture-register/#register-values">template class</a> which masks and shifts the value as required.
In our example the value has a width of 3 bits and needs to be shifted 1 bit:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">typedef</span> <span class="n">Value</span><span class="o"><</span> <span class="n">Control_t</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">1</span> <span class="o">></span> <span class="n">Delay_t</span><span class="p">;</span>
</code></pre></div></div>
<p>This can be used the same way as the Configuration:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Control_t</span> <span class="n">control</span> <span class="o">=</span> <span class="n">Control</span><span class="o">::</span><span class="n">EN</span> <span class="o">|</span> <span class="n">Prescaler_t</span><span class="p">(</span><span class="n">Prescaler</span><span class="o">::</span><span class="n">Div2</span><span class="p">)</span> <span class="o">|</span> <span class="n">Delay_t</span><span class="p">(</span><span class="mi">4</span><span class="p">);</span>
<span class="n">Control_t</span> <span class="n">control</span> <span class="o">&=</span> <span class="o">~</span><span class="n">Delay_t</span><span class="o">::</span><span class="n">mask</span><span class="p">();</span>
<span class="n">Delay_t</span><span class="o">::</span><span class="n">set</span><span class="p">(</span><span class="n">control</span><span class="p">,</span> <span class="mi">7</span><span class="p">);</span>
<span class="n">Delay_t</span><span class="o">::</span><span class="n">reset</span><span class="p">(</span><span class="n">control</span><span class="p">);</span>
<span class="kt">uint8_t</span> <span class="n">delay</span> <span class="o">=</span> <span class="n">Delay_t</span><span class="o">::</span><span class="n">get</span><span class="p">(</span><span class="n">control</span><span class="p">);</span>
</code></pre></div></div>
<h4 id="efficiency">Efficiency</h4>
<p>These classes are using as much <code class="language-plaintext highlighter-rouge">constexpr</code> as possible, so constexpr constructors, constexpr operator overloading and constexpr methods.
This means whatever can be computed at compile time, will be computed at compile time.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Control_t</span> <span class="n">control</span> <span class="o">=</span> <span class="n">Control</span><span class="o">::</span><span class="n">EN</span> <span class="o">|</span> <span class="n">Prescaler_t</span><span class="p">(</span><span class="n">Prescaler</span><span class="o">::</span><span class="n">Div2</span><span class="p">)</span> <span class="o">|</span> <span class="n">Delay_t</span><span class="p">(</span><span class="mi">4</span><span class="p">);</span>
<span class="c1">// is just fancy syntax sugar coating for</span>
<span class="kt">uint8_t</span> <span class="n">control</span> <span class="o">=</span> <span class="mh">0xA4</span><span class="p">;</span>
</code></pre></div></div>
<p>Of course if your Configuration or Value class has to extract a value at runtime, the masking and shifting will happen at runtime. Not all that surprising.</p>
<h2 id="what-about-the-bus">What About The Bus?</h2>
<p>The above code works on a copy of the register content in the hosts RAM.
To understand why this makes a lot of sense for external devices, consider the accelerometer memory map from previously:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rw: | Config1 | Config2 | Config3 | Config4 |
0x20 0x21 0x22 0x23
ro: | Status | XL | XH | YL | YH | ZL | ZH |
0x30 0x31 0x32 0x33 0x34 0x35 0x36
</code></pre></div></div>
<p>In our device driver we would reserve 4 bytes for buffering the configuration registers, 1 byte for the status register and 6 bytes for the data.</p>
<p>Usually, configuration registers are not changed by the external hardware itself, so you can modify the local copy of the configuration register and then only need to write the result once to the external hardware.
During device driver initialization you can also prepare all configuration registers and then write all 4 at once.</p>
<p>Similarly the status and data bytes can be read in one bus access and buffered locally for further computations.</p>
<p>At the very basic level, the driver needs to provide functions to update the registers content:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">bool</span> <span class="c1">// bool because bus access can fail</span>
<span class="n">updateControl</span><span class="p">(</span><span class="n">Control_t</span> <span class="n">setMask</span><span class="p">,</span> <span class="n">Control_t</span> <span class="n">clearMask</span> <span class="o">=</span> <span class="n">Control_t</span><span class="p">(</span><span class="mh">0xff</span><span class="p">));</span>
<span class="n">Control_t</span>
<span class="nf">getControl</span><span class="p">();</span>
<span class="p">{</span> <span class="k">return</span> <span class="n">Control_t</span><span class="p">(</span><span class="n">rawBuffer</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span> <span class="p">}</span>
<span class="n">updateControl</span><span class="p">(</span><span class="n">Control</span><span class="o">::</span><span class="n">FS</span><span class="p">,</span> <span class="n">Control</span><span class="o">::</span><span class="n">EN</span><span class="p">);</span>
<span class="c1">// is equivalent to</span>
<span class="n">Control_t</span> <span class="n">control</span> <span class="o">=</span> <span class="n">getControl</span><span class="p">();</span>
<span class="n">control</span> <span class="o">&=</span> <span class="n">Control</span><span class="o">::</span><span class="n">EN</span><span class="p">;</span>
<span class="n">control</span> <span class="o">|=</span> <span class="n">Control</span><span class="o">::</span><span class="n">FS</span><span class="p">;</span>
<span class="n">updateControl</span><span class="p">(</span><span class="n">control</span><span class="p">);</span>
</code></pre></div></div>
<p>However providing meaningful setters and getters makes your code much more usable:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">bool</span>
<span class="nf">enable</span><span class="p">()</span>
<span class="p">{</span> <span class="k">return</span> <span class="n">updateControl</span><span class="p">(</span><span class="n">Control</span><span class="o">::</span><span class="n">EN</span><span class="p">,</span> <span class="n">Control_t</span><span class="p">(</span><span class="mi">0</span><span class="p">));</span> <span class="p">}</span>
<span class="kt">bool</span>
<span class="nf">disable</span><span class="p">()</span>
<span class="p">{</span> <span class="k">return</span> <span class="n">updateControl</span><span class="p">(</span><span class="n">Control_t</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">Control</span><span class="o">::</span><span class="n">EN</span><span class="p">);</span> <span class="p">}</span>
<span class="kt">bool</span>
<span class="nf">setPrescaler</span><span class="p">(</span><span class="n">Prescaler</span> <span class="n">prescaler</span><span class="p">)</span>
<span class="p">{</span> <span class="k">return</span> <span class="n">updateControl</span><span class="p">(</span><span class="n">prescaler</span><span class="p">,</span> <span class="n">Prescaler_t</span><span class="o">::</span><span class="n">mask</span><span class="p">());</span> <span class="p">}</span>
<span class="n">Prescaler</span> <span class="nf">getPrescaler</span><span class="p">()</span>
<span class="p">{</span> <span class="k">return</span> <span class="n">Prescaler_t</span><span class="o">::</span><span class="n">get</span><span class="p">(</span><span class="n">getControl</span><span class="p">());</span> <span class="p">}</span>
</code></pre></div></div>
<p>For working examples of this concept have a look at the <a href="https://github.com/modm-io/modm/tree/develop/src/modm/driver">xpcc device drivers</a>.</p>
<h2 id="conclusions">Conclusions</h2>
<ol>
<li>Don’t bother with a pure C++ model of your internal memory.</li>
<li>Better invest the time in a useful hardware abstraction layer.</li>
<li>Buffer often accessed registers of external devices locally.</li>
<li>Use the typesafe C++ access classes for these registers as presented.</li>
<li>Be aware of the overhead of using an external bus.</li>
</ol>
<p><em>This post was first published at blog.xpcc.io.</em>
<em>The links have been updated to point to the successor project modm.io.</em></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>D seems to be a lot better suited for compile time evaluations than C++. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>This is different from dektop-class CPUs, where even the internal bus is magnitudes slower than the CPU. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>You actually need to use <code class="language-plaintext highlighter-rouge">XPCC_FLAGS8(Control)</code>, which expands to <code class="language-plaintext highlighter-rouge">typedef Flags8<Control> Control_t;</code> and some magic enum operator overloading. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>While researching for this post I discovered an almost identical <a href="https://github.com/grisumbras/enum-flags"><code class="language-plaintext highlighter-rouge">flags</code> class on Github</a>. However, it is not written for embedded targets and has a slightly different field of application. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Niklas Hauserniklas@salkinium.comWhen you are writing software for microcontrollers, reading and writing hardware registers becomes second nature. Registers and bit mappings are typically “modeled” using C preprocessor defines, and usually provided to you by your cross compiler toolchain in device specific header files. Setting up and toggling PG13 on the STM32F4 this way looks rather… unreadable: // set push-pull, output GPIOG->OSPEEDR = (GPIOG->OSPEEDR & ~(3 << 26)) | (3 << 26); GPIOG->MODER = (GPIOG->MODER & ~(3 << 26)) | (1 << 26); GPIOG->OTYPER &= ~(1 << 13); GPIOG->PUPDR &= ~(1 << 13); while(true) { GPIOG->ODR ^= (1 << 13); // toggle // delay } It did not really dawn on me how primitive this concept was until I was forced to model a memory map myself for one of our many device drivers. Since I have never been a friend of using the C preprocessor in C++ unless absolutely necessary, it seemed like a good opportunity to research how best to implement this in pure C++. Update 2022: Note that this technique is outdated for C++20! Please consult the internet for the current state-of-the-art.