Inquire: Call 0086-755-23203480, or reach out via the form below/your sales contact to discuss our design, manufacturing, and assembly capabilities.
Quote: Email your PCB files to Sales@pcbsync.com (Preferred for large files) or submit online. We will contact you promptly. Please ensure your email is correct.
Notes: For PCB fabrication, we require PCB design file in Gerber RS-274X format (most preferred), *.PCB/DDB (Protel, inform your program version) format or *.BRD (Eagle) format. For PCB assembly, we require PCB design file in above mentioned format, drilling file and BOM. Click to download BOM template To avoid file missing, please include all files into one folder and compress it into .zip or .rar format.
Understanding the Zynq UltraScale+ ARM architecture requires digging deeper than marketing datasheets. After spending considerable time bringing up custom boards and debugging PS-PL interactions, I’ve learned that this platform’s power comes from understanding how its heterogeneous processing elements work together—and when each should handle specific workloads.
This technical guide examines the Xilinx UltraScale+ FPGA architecture in detail, covering the ARM Cortex-A53 application processor, Cortex-R5 real-time processor, and programmable logic fabric. Whether you’re architecting a new system or optimizing an existing design, this deep dive provides the technical foundation needed for effective hardware-software partitioning.
The Zynq UltraScale+ MPSoC fundamentally differs from both traditional processors and standalone FPGAs through its heterogeneous multiprocessing approach. Rather than forcing all workloads onto a single processing element, the architecture provides multiple specialized engines optimized for different tasks.
The Zynq UltraScale+ ARM processing system contains:
Application Processing Unit (APU): ARM Cortex-A53 cores for Linux and complex applications
Real-time Processing Unit (RPU): ARM Cortex-R5F cores for deterministic processing
Graphics Processing Unit (GPU): ARM Mali-400 MP2 for display and OpenGL ES (EG/EV variants)
Platform Management Unit (PMU): Dedicated MicroBlaze for system management
The Programmable Logic (PL) provides Xilinx UltraScale+ FPGA fabric for custom hardware acceleration and I/O interfaces. The combination enables system designers to assign workloads to the most appropriate processing element rather than compromising with a one-size-fits-all approach.
ARM Cortex-A53 Application Processing Unit
The Application Processing Unit (APU) centers on ARM Cortex-A53 cores implementing the ARMv8-A architecture. This represents a fundamental advancement over the Zynq-7000’s Cortex-A9 cores—64-bit addressing, improved efficiency, and modern instruction set extensions.
Cortex-A53 Core Architecture
The Cortex-A53 is a mid-range ARMv8-A processor optimized for power efficiency while delivering strong single-threaded performance. Key architectural features include:
Feature
Specification
Architecture
ARMv8-A (AArch64 and AArch32)
Pipeline
8-stage, in-order
Issue Width
Dual-issue
Clock Frequency
Up to 1.5 GHz (EG/EV), 1.3 GHz (CG)
L1 I-Cache
32 KB per core
L1 D-Cache
32 KB per core
L2 Cache
1 MB shared (configurable)
SIMD
NEON, 128-bit
Floating Point
VFPv4, double precision
The in-order pipeline may seem like a limitation compared to out-of-order designs like the Cortex-A72, but it provides predictable execution timing crucial for embedded systems. The dual-issue capability means two instructions can execute per cycle when dependencies allow.
APU Configuration Options
Depending on the device variant, the APU provides different core counts:
CG devices: Dual-core Cortex-A53 at up to 1.3 GHz
EG devices: Quad-core Cortex-A53 at up to 1.5 GHz
EV devices: Quad-core Cortex-A53 at up to 1.5 GHz
All configurations support both Symmetric Multiprocessing (SMP) where Linux manages all cores as a unified pool, and Asymmetric Multiprocessing (AMP) where individual cores or core pairs run separate operating systems or bare-metal code.
Memory System and Cache Hierarchy
The APU memory system significantly impacts application performance. Understanding its structure helps optimize software for this platform.
Memory Level
Size
Characteristics
L1 I-Cache
32 KB/core
2-way set associative
L1 D-Cache
32 KB/core
4-way set associative, write-back
L2 Cache
1 MB shared
16-way set associative, unified
Snoop Control Unit
N/A
Maintains coherency across cores
The Snoop Control Unit (SCU) maintains cache coherency between cores, ensuring that when one core modifies shared data, other cores see the updated values. This hardware coherency simplifies software development but adds complexity when interfacing with DMA-capable peripherals in the programmable logic.
TrustZone Security Extensions
The Cortex-A53 implements ARM TrustZone technology, partitioning system resources into Secure and Non-Secure worlds. This hardware-enforced isolation enables:
Secure boot verification
Protected key storage
Isolated security processing
Trusted execution environments
TrustZone operates at Exception Level 3 (EL3), above the hypervisor level (EL2) and operating system level (EL1). The ARM Trusted Firmware (ATF) typically manages secure world operations and transitions between security states.
The Real-time Processing Unit (RPU) addresses deterministic processing requirements that the Linux-running APU cannot satisfy. The dual Cortex-R5F cores provide sub-microsecond interrupt response times essential for motor control, safety monitoring, and real-time I/O handling.
Cortex-R5F Architecture Details
The Cortex-R5F is a 32-bit processor from ARM’s real-time family, optimized for low-latency, predictable execution:
Feature
Specification
Architecture
ARMv7-R
Pipeline
8-stage, dual-issue
Clock Frequency
Up to 600 MHz (EG/EV), 533 MHz (CG)
TCM (Tightly Coupled Memory)
128 KB per core (ATCM + BTCM)
L1 I-Cache
32 KB
L1 D-Cache
32 KB
MPU Regions
16
ECC Support
Full ECC on TCM and caches
The “F” suffix indicates floating-point support via the VFPv3 extension, enabling efficient signal processing without software emulation.
Operating Modes: Split vs. Lockstep
The RPU supports two distinct operating configurations that fundamentally change its behavior:
Split Mode: Both Cortex-R5F cores operate independently, each running its own code. This doubles processing throughput and enables parallel real-time tasks. In split mode:
Core 0 and Core 1 have separate TCMs
Each core handles independent interrupt vectors
No redundancy; single-core failure affects only that core’s functions
Lockstep Mode: Both cores execute identical instructions simultaneously, with hardware comparing results on every cycle. Any mismatch indicates a fault and triggers an error response. In lockstep mode:
Single logical processor with hardware redundancy
Meets safety requirements up to ASIL-B/SIL-2
Half the processing throughput of split mode
Automatic fault detection without software overhead
Lockstep mode proves essential for functional safety applications where processor failures could cause system hazards. The ISO 26262 ASIL-C certification for automotive XA devices relies heavily on this capability.
TCM and Memory Architecture
The Tightly Coupled Memory (TCM) provides deterministic, single-cycle access for time-critical code and data:
Memory
Size
Purpose
ATCM (Core 0)
64 KB
Instruction memory, deterministic fetch
BTCM (Core 0)
64 KB
Data memory, deterministic access
ATCM (Core 1)
64 KB
Available in split mode only
BTCM (Core 1)
64 KB
Available in split mode only
Unlike cached memory where access times vary based on cache hits/misses, TCM provides consistent timing essential for hard real-time systems. Critical interrupt handlers and control loops should execute from TCM whenever possible.
Memory Protection Unit
The Cortex-R5F uses a Memory Protection Unit (MPU) rather than a full Memory Management Unit (MMU). The MPU provides:
16 configurable protection regions
Access permission control (read/write/execute)
Memory type attributes (cacheable, bufferable, shareable)
No virtual-to-physical address translation
This approach suits real-time applications where address translation latency would be unacceptable, but it means the RPU cannot run Linux or other operating systems requiring virtual memory.
Xilinx UltraScale+ FPGA Programmable Logic
The Programmable Logic (PL) region contains Xilinx UltraScale+ FPGA fabric based on the UltraScale architecture—not the 7-series fabric found in Zynq-7000. This architectural advancement provides improved timing characteristics, enhanced DSP capabilities, and the addition of UltraRAM blocks.
CLB Architecture
The Configurable Logic Block (CLB) is the fundamental building block of the Xilinx UltraScale+ FPGA fabric. Each CLB contains one slice with:
The 6-input LUTs can implement any Boolean function of up to 6 variables, or be configured as dual 5-input LUTs sharing common inputs. This flexibility enables efficient mapping of complex logic functions.
Each LUT can alternatively function as:
64×1 distributed RAM (single-port)
32×2 distributed RAM (dual-port)
32-bit shift register (SRL32)
DSP Slice Capabilities
The DSP48E2 slice in the UltraScale architecture provides significant digital signal processing capability:
Feature
Specification
Pre-adder
27-bit
Multiplier
27 × 18 bits
Accumulator
48-bit
Pattern Detector
48-bit
XOR Function
96-bit
A single DSP slice can perform a 27×18 multiply-accumulate operation in a single clock cycle at frequencies exceeding 700 MHz. Cascading multiple slices enables efficient FIR filters, matrix operations, and floating-point implementations.
Block RAM and UltraRAM
The Xilinx UltraScale+ FPGA provides two types of dedicated memory blocks:
Block RAM (BRAM):
36 Kb capacity per block (configurable as 2×18 Kb)
True dual-port operation
Built-in FIFO logic
ECC support
Synchronous operation
UltraRAM:
288 Kb capacity per block
True dual-port, 72-bit wide
Single-cycle access at full speed
Can be cascaded for deeper memories
Independent power-down capability
UltraRAM represents a significant advancement for designs requiring large on-chip buffers. A single UltraRAM block replaces eight BRAM blocks while consuming less power and providing better timing characteristics.
High-Speed Transceivers
The PL includes multiple transceiver types for high-speed serial communication:
Transceiver
Data Rate
Typical Applications
GTH
0.5–16.3 Gb/s
10G Ethernet, PCIe Gen3, Aurora
GTY
0.5–32.75 Gb/s
25G/100G Ethernet, PCIe Gen4
Each transceiver includes programmable equalizers, clock recovery, and protocol-specific features. The transceivers operate independently of the PS transceivers, enabling the PL to implement custom high-speed interfaces.
PS-PL Interface Architecture
The interface between Processing System and Programmable Logic defines system performance and determines which designs are feasible. The Zynq UltraScale+ ARM platform provides multiple interface types optimized for different traffic patterns.
AXI Interface Types
Interface
Width
Purpose
Bandwidth
HPM (High Performance Master)
32/64/128-bit
PS master, PL slave
~5 GB/s each
HPC (High Performance Coherent)
32/64/128-bit
PL master with cache coherency
~5 GB/s each
HP (High Performance)
32/64/128-bit
PL master to DDR, non-coherent
~5 GB/s each
LPD (Low Power Domain)
32/64/128-bit
LPD peripherals to PL
~2 GB/s
ACP (Accelerator Coherency Port)
128-bit
PL coherent access to APU caches
~5 GB/s
The aggregate PS-PL bandwidth exceeds 150 GB/s when all interfaces are utilized, though practical designs rarely approach this theoretical maximum.
Choosing the Right Interface
Selecting appropriate interfaces significantly impacts system performance:
The PS and PL operate in separate clock domains, requiring careful synchronization at interfaces. The PS generates several PL reference clocks (PL0-PL3) configurable from 100 MHz to over 300 MHz, but PL designs may use independent clocking when required.
Clock domain crossing between PS and PL occurs automatically within the AXI infrastructure, but designers must understand that:
AXI transactions include handshaking that accommodates clock differences
Maximum interface frequency depends on both PS and PL clock rates
Asynchronous clock domains add latency to transactions
What is the difference between Cortex-A53 and Cortex-R5 in Zynq UltraScale+?
The Zynq UltraScale+ ARM Cortex-A53 is a 64-bit application processor designed for running operating systems like Linux. It features virtual memory (MMU), multi-level caches, and TrustZone security. The Cortex-R5 is a 32-bit real-time processor optimized for deterministic, low-latency tasks. It uses an MPU instead of MMU, provides tightly coupled memory for guaranteed access timing, and supports lockstep operation for safety-critical applications. Use the A53 for complex software; use the R5 for time-critical control loops.
Can the Cortex-A53 and Cortex-R5 run simultaneously?
Yes, the APU and RPU operate independently and can run simultaneously. This enables powerful system architectures where Linux handles networking, user interface, and complex algorithms on the A53 cores while real-time control loops execute on the R5 cores with guaranteed timing. Inter-processor communication uses shared memory regions, hardware mailboxes, or software-defined protocols. The OpenAMP framework provides standard mechanisms for AMP systems.
How does UltraRAM differ from Block RAM in the Xilinx UltraScale+ FPGA?
UltraRAM provides 288 Kb per block versus 36 Kb for Block RAM—eight times the density. UltraRAM is optimized for large buffers and can be independently powered down for energy savings. Block RAM offers more flexible configurations (aspect ratios, ECC options, FIFO modes) and is distributed throughout the fabric for better timing to nearby logic. Use UltraRAM for large memories where a single wide interface suffices; use Block RAM for smaller, distributed memories requiring specific features.
What software can run on each processor in the Zynq UltraScale+?
The Cortex-A53 APU supports Linux, FreeRTOS, VxWorks, QNX, bare-metal applications, and hypervisors like Xen. The Cortex-R5 RPU supports FreeRTOS, SafeRTOS, bare-metal applications, and other RTOSes that don’t require virtual memory. The Mali-400 GPU supports OpenGL ES 1.1 and 2.0 graphics APIs. Typical production systems run Linux on the APU for application software and FreeRTOS or bare-metal on the RPU for real-time functions.
How do I decide what functionality belongs in the FPGA versus the ARM processors?
Place functionality in the Xilinx UltraScale+ FPGA programmable logic when you need: parallel processing beyond what ARM cores provide, precise timing control, custom interfaces not available in the PS, or hardware acceleration of compute-intensive algorithms. Place functionality in the ARM processors when you need: complex decision logic, operating system services, networking stacks, file systems, or rapid development without HDL. The PS-PL interface bandwidth supports moving data between domains, so the decision centers on which processing element best handles each workload rather than data locality constraints.
Architecting Effective Zynq UltraScale+ Systems
The Zynq UltraScale+ ARM architecture provides remarkable flexibility but demands thoughtful system partitioning. Success requires understanding each processing element’s strengths:
The PS-PL interfaces enable efficient data movement between domains, making the partitioning decision about capability rather than connectivity. Start with clear requirements for timing, throughput, and functionality, then map each function to the most appropriate processing element.
Interrupt Architecture and Management
The Zynq UltraScale+ ARM interrupt system deserves careful attention as it directly impacts real-time performance and system responsiveness.
Generic Interrupt Controller (GIC)
The APU uses ARM’s GIC-400, a GICv2 implementation supporting:
Feature
Specification
Shared Peripheral Interrupts (SPI)
160
Private Peripheral Interrupts (PPI)
16 per core
Software Generated Interrupts (SGI)
16
Priority Levels
32
Security States
Secure and Non-Secure
The GIC provides interrupt prioritization, routing to specific cores, and security partitioning. Properly configuring interrupt affinities—which core handles which interrupt—significantly impacts system performance and real-time response.
RPU Interrupt Handling
The Cortex-R5 uses a separate GIC-400 implementation with similar capabilities but independent configuration. This isolation ensures that APU interrupt load doesn’t affect RPU response times.
Key RPU interrupt characteristics:
Vectored interrupt controller with configurable priority
Fast interrupt (FIQ) path for lowest-latency handlers
Nested interrupt support for priority preemption
Interrupt latency under 20 cycles from assertion to handler entry
PL-to-PS Interrupt Routing
The programmable logic can generate interrupts to both the APU and RPU through dedicated interrupt lines:
Interrupt Group
Destination
Count
PL-PS Group 0
APU (IRQ)
8
PL-PS Group 1
APU (IRQ)
8
PL-RPU
RPU
2
PL-generated interrupts enable hardware accelerators to signal completion, custom peripherals to request service, and external events to trigger software responses. Proper interrupt design minimizes latency between PL event occurrence and software handler execution.
Power Domain Architecture
The Zynq UltraScale+ ARM architecture implements sophisticated power management through independent power domains:
Power Domain Organization
Domain
Contents
Typical Power State
Full Power Domain (FPD)
APU, GPU, DisplayPort, SATA, PCIe
Active during application processing
Low Power Domain (LPD)
RPU, USB, Ethernet, IOU
Can remain active when FPD sleeps
PL Power Domain
Entire FPGA fabric
Independent control
Battery Power Domain
RTC, minimal logic
Always on (nanoamps)
This partitioning enables sophisticated power management strategies:
FPD Off Mode: Disable APU and high-speed peripherals while RPU maintains real-time functions
Deep Sleep: Only battery domain active, microseconds wake time
PL Power Gating: Disable unused programmable logic regions
Platform Management Unit
The Platform Management Unit (PMU) orchestrates power management using a dedicated MicroBlaze processor with:
32 KB ROM for boot code
128 KB RAM for runtime firmware
Access to all power control registers
System monitoring capabilities
The PMU firmware handles power state transitions, monitors system health, and manages the boot sequence. Customizing PMU firmware enables application-specific power optimization.
Understanding this architecture deeply enables designs that leverage all processing resources effectively—achieving performance and determinism impossible with any single processor type alone.
Inquire: Call 0086-755-23203480, or reach out via the form below/your sales contact to discuss our design, manufacturing, and assembly capabilities.
Quote: Email your PCB files to Sales@pcbsync.com (Preferred for large files) or submit online. We will contact you promptly. Please ensure your email is correct.
Notes: For PCB fabrication, we require PCB design file in Gerber RS-274X format (most preferred), *.PCB/DDB (Protel, inform your program version) format or *.BRD (Eagle) format. For PCB assembly, we require PCB design file in above mentioned format, drilling file and BOM. Click to download BOM template To avoid file missing, please include all files into one folder and compress it into .zip or .rar format.