Dennis Cecic, P. Eng. (firstname.lastname@example.org)
Senior Member, IEEE Toronto Section
In this article, we’ll review key CPU and architectural features to identify potential compute performance bottlenecks. We’ll compare the CPUs in our Hero MCUs against these parameters:
- System Clock Frequency
- Addressable & Available Memory
- ALU Data Path
- Hardware Support for Arithmetic Operations and Extended Precision
Simple arithmetic and extended precision benchmark results from hardware simulation of our Hero MCUs will be discussed to provide a practical perspective on CPU performance.
System Clock Frequency
The CPU Instruction Cycle duration is directly related to the maximum system clock frequency which can be produced in the MCU. The following chart summarizes the Maximum System Clock Frequency (Fsys, in MHz) and resulting Maximum CPU Instruction Frequency (Fcyc, in millions of instructions per second, or MIPs) for our 3 Hero MCUs:
- “200 MIPs” PIC32MZ CPU instruction frequency throughput depends on code/data being available/fetched from the CPU’s instruction/data cache memories. For PIC32MZ, you should look at the “Fcyc” specification as the “average” instruction throughput.
Addressable & Available Memory
Based on the instruction set architecture design, cost, and power consumption, CPUs are limited as to how much directly/indirectly addressable program and data memory are available. This is important for you to assess and is a key constraint for your algorithms. The following table summarizes the maximum addressable and available user RAM and Flash program memory storage for our hero PIC MCU families:
- PIC16F19197 contains 4 kB RAM and 56 kB Flash
- PIC24FJ1024GA606 contains 32 kB RAM and 1024 kB Flash
- PIC32MZ1024EFH064 contains 512 kB RAM and 1024 kB Flash
- Large Pin-Count PIC32MZ devices contain an External Bus Interface, which can be used to map external memory into the memory space of the CPU.
ALU Data Path
What does it mean when we say that the CPU is 8, 16, or 32-bits? This refers to the native size of the operands that are processed by the CPUs integer Arithmetic-Logic Unit (ALU).
8-bit CPUs work natively with 8-bit integers between -128 to +127
16-bit CPUs work natively with 16-bit integers between -32,768 to + 32,767
32-bit CPUs work natively with 32-bit integers between -2,147,483,648 and +2,147,483,647
Where do the Operands Come From?
Instruction Set Architectures (ISAs) can be classified according to where the operands reside for arithmetic operations. There are 3 major classifications: Accumulator, Register-Memory, and Load/Store:
A major benefit of Load/Store architectures is the decoupling of CPU speed with memory access speed. This allows the use of cache memory and enables the CPU core to run at 100’s of MHz. A consequence of this architecture is that digital I/O port bit-set, bit-clear and bit-toggle operations require additional hardware to implement these operations atomically using store operations only.
The following table classifies the ISA type for the 3 CPU cores in our hero MCUs:
- Decouple CPU speed from memory speed (requires memory hierarchy)
- Some indeterminacy in code execution time, due to caching of instructions and data
- Special hardware needed to support atomic port pin manipulation using store operations only
- Large CPU register file increases function call and interrupt latencies, due to extra work in saving/restoring CPU context.
Hardware Support for Arithmetic and Extended Precision
The availability of hardware-assisted arithmetic operations such as multiply and divide greatly enhances the throughput of compute algorithms that use these operations. The following table indicates which MCU CPU cores contain special hardware support for enhanced arithmetic operations:
Digital Signal Processing (DSP) operations include multiply-accumulate (or MAC) which is performed efficiently on data with minimal overhead. These operations are further enhanced by additional hardware which provides the following features:
- Single-cycle MACs
- Circular memory addressing
- Accumulators with guard bits
- Fractional and saturating arithmetic
Note that libraries are provided by C compilers to emulate arithmetic operations for C data types which are larger than the native ALU data path width. The emulated operations require more time and code-space to execute, however, this may be acceptable in your application.
Benchmarking CPU Performance
The most accurate way to evaluate an MCU is to create some benchmarking code and run it on hardware. Our 3 hero MCUs were wired up on a simple breadboard for this test (basic connections only: Power, Clock, Debug, 1-LED, 1-Switch, and a UART):
We will be using the hardware debugger stopwatch feature to perform the measurement. It requires 2 hardware breakpoint resources from the target MCU.
Note that not all MCUs are able to support this feature. You can confirm debugger support of this feature for your MCU by opening the MPLAB X Hardware Tools Debug Features file located in the following MPLAB X installation paths:
\Program Files (x86)\Microchip\MPLABX\v5.40\docs\FeaturesSupport\HWToolDebugFeatures.html
Opening the file, the following snapshot informs us that the PIC16F19197 MCU does not support the hardware debug stopwatch feature for any of the hardware debuggers:
For the PIC16F19197 MCU, we will need to configure one of its 16-bit hardware timer (clocked at Fcyc), along with some software routines to perform the measurement.
One additional piece of information gleaned from the table is the number of program and data breakpoints available in a specific MCU:
- PIC16F19197: 1 Program BP, 1 Data BP
- PIC24FJ1024GA606: 6 Program BP, 6 Data BP
- PIC32MZ1024EFH064: 8 Program BP, 2 Data BP
Consider PIC24F and PIC32MZ if you anticipate debugging large/complex programs.
The Benchmark Code
The following code will be used to benchmark the basic arithmetic performance of the CPU as well as function-call overhead. The execution time of the highlighted functions/operations will be measured:
- All MCUs running at Fcyc = 8 MIPs
- PIC16F1 MCU measured using a 16-bit timer resource. PIC24F/PIC32MZ measured using the hardware debugger stopwatch feature
- All XC compilers configured with no optimization. “double” type set to 64-bit for XC16, XC32 and 32-bit for XC8
- PIC32MZ configuration: CPU Cache and Prefetch enabled. Flash wait states set to 0. Benchmark code/data is loaded into the CPU’s cache.
Discussion – sum8() Results
Each of the 3 MCUs can perform an 8-bit integer addition in a single instruction cycle, so why does the sum8() function take so many cycles to execute? The answer lies in implementation of the C-function call overhead. If we examine the disassembly listing for PIC16F19197, we see that 6 instructions are required to prepare the stack frame before calling the routine, while 5 more instructions are required after the call to store the result:
The following disassembly shows additional function call overhead within the sum8() function for the PIC32MZ device required to create/destroy a local stack frame within the function:
Discussion – Multiplication Results
As expected, the lack of dedicated MUL/DIV hardware on the PIC16F1 CPU seriously impacts the cycle count for multiply operations, especially with int32_t and float/double types.
However, consider that while running at 8 MIPs, a PIC16F1 can still perform ~5000 floating point operations per second! How much do you really need for your application?
Hopefully, this discussion provided some clues as to which CPU to consider for your next design, and how you should measure CPU performance in an MCU.
Components of a high-performance CPU include clock speed (instruction throughput), ALU-width and availability of dedicated hardware for performing extended precision operations.
Some insights regarding our 3 hero MCUs:
- PIC16F1 can perform extended precision operations using libraries available in the XC8 compiler
- PIC24F provides a nice step-up in execution of extended precision arithmetic, compared to PIC16F1
- PIC32MZ should be considered for high-memory, high-performance floating-point intensive algorithms
- PIC32MZ must be configured with cache and prefetch enabled to achieve maximum performance
- Function call overhead should be considered for high performance algorithms. Unroll loops and execute inline code for higher performance