Speed and Performance Comparison of Harvard Bus vs System Bus on ARM Cortex-M Processors
ARM, one of the most extensively used RISC processors on the planet, come up with Harvard architecture bus for .text area i.e. Code and data are fetched on separate buses i.e. I-Code and D-Code bus. Another bus that links Internal SRAM with the core is mapped on shared bus called System bus. The and System bus is shared b/w SRAM and other peripherals. The goal of this study is to analyze the speed difference when code is executed from flash vs from Internal RAM and the performance of Harvard bus over System bus.
A bus is a pathway for digital signals to rapidly move data. Buses are fundamental source of data exchange inside the core. There are three internal buses associated with processors: the data bus, address bus, and control bus. These buses are used to bring data in and out of the processor core as well as inter- register data transfer.
ARM Cortex-M are 32-bits Cores belong to RISC family of processors. The Cortex-M profile is intended from deep embedded devices with very little power consumption. This is due to the fact that embedded devices are mostly battery powered. These cores are highly energy efficient with much smaller silicon area.
ARM Cortex-M contain number of internal buses for memory interfacing and various low and High-speed peripherals. The code memory region is interfaced with Harvard bus architecture with separate buses for code (I-Code) and data (D- Code). The rest of memory map is interfaced on shared System bus. On STM32, the volatile memory i.e. internal RAM (IRAM) is interfaced on shared system bus while internal flash memory is interfaced on Harvard bus.
It is usually expected that SRAM out performs hybrid memories like flash, but what happens when flash is accessed via high bandwidth bus and SRAM is accessed via shared system bus!! On one side we have slow memory with high bandwidth bus (flash memory with Harvard bus) while on other side we have super-fast memory but comparatively slow shared bus (SRAM with System bus). The goal of this paper is to analyze the bus bandwidth effect on memory latency.
As mentioned in the introduction, the main purpose of the project is to compare performance analysis of the two buses i.e. Harvard bus (I-Code + D-Code) and Von-Neuman (System bus) on ARM cores. The main idea is to compare the performance when slow memory e.g. flash is accessed via high bandwidth bus (Harvard) while the fast memory e.g. SRAM is accessed via low bandwidth bus (shared System bus).
The performance will be evaluated on the following parameters.
1. Data Access
A large array containing dummy data will be repeatedly accessed using system bus (volatile data) and via D-code bus (constant data) and average access time will be recorded and compared.
2. Instruction Fetch
While data is read, instructions are fetched via pipeline. To compare the performance, we will execute instruction via both I-Code bus and system bus. The average performance time will be recorded. In order to fetch instruction via I-Code bus, a sample code (Fast Fourer Transform), will be executed from internal flash memory on STM32 MCU. Similarly, to execute code via system bus, the same sample code will be copied into internal RAM (IRAM) and the routine will be called from main routine subsequently.
1. Instruction Fetch
A sample program contains Fast Fourier Transform code was executed from Flash and then from IRAM. The average time of the operation is noted for both scenarios and shown in Fig- 1.
2. Data Access
A large array of dummy data is placed in internal flash (via constant data) and was accessed via D-Code bus. Similarly, the same data was placed in IRAM and was accessed. The average access time of whole array (around 1000 elements) is shown in Fig-2.
It is normally expected that flash memory is slow, and SRAMs are super-fast. The Fig-2,3 clearly shows that though the flash memory is comparatively slow but when it is inter- faced on high speed/bandwidth bus, the performance is almost comparable to SRAM. The conducted experiment concluded that in case of STM32F4 (ARM Cortex-M) the performance of the two buses are almost same and the Harvard bus indeed nullify flash latency.
1) ARM Cortex-M4 Technical Reference Manual
2) STM32F4xx Datasheet
3) STM32F4xx Technical Reference Manual
4) Keil uVision5 Reference Manual