Zynq UltraScale+ Architecture: Cortex-A53, R5 & FPGA Explained

Understanding the Zynq UltraScale+ ARM architecture requires digging deeper than marketing datasheets. After spending considerable time bringing up custom boards and debugging PS-PL interactions, I’ve learned that this platform’s power comes from understanding how its heterogeneous processing elements work together—and when each should handle specific workloads.

This technical guide examines the Xilinx UltraScale+ FPGA architecture in detail, covering the ARM Cortex-A53 application processor, Cortex-R5 real-time processor, and programmable logic fabric. Whether you’re architecting a new system or optimizing an existing design, this deep dive provides the technical foundation needed for effective hardware-software partitioning.

Request PCB Manufacturing & Assembly Quote Now

The Heterogeneous Processing Architecture

The Zynq UltraScale+ MPSoC fundamentally differs from both traditional processors and standalone FPGAs through its heterogeneous multiprocessing approach. Rather than forcing all workloads onto a single processing element, the architecture provides multiple specialized engines optimized for different tasks.

The Zynq UltraScale+ ARM processing system contains:

Application Processing Unit (APU): ARM Cortex-A53 cores for Linux and complex applications
Real-time Processing Unit (RPU): ARM Cortex-R5F cores for deterministic processing
Graphics Processing Unit (GPU): ARM Mali-400 MP2 for display and OpenGL ES (EG/EV variants)
Platform Management Unit (PMU): Dedicated MicroBlaze for system management

The Programmable Logic (PL) provides Xilinx UltraScale+ FPGA fabric for custom hardware acceleration and I/O interfaces. The combination enables system designers to assign workloads to the most appropriate processing element rather than compromising with a one-size-fits-all approach.

ARM Cortex-A53 Application Processing Unit

The Application Processing Unit (APU) centers on ARM Cortex-A53 cores implementing the ARMv8-A architecture. This represents a fundamental advancement over the Zynq-7000’s Cortex-A9 cores—64-bit addressing, improved efficiency, and modern instruction set extensions.

Cortex-A53 Core Architecture

The Cortex-A53 is a mid-range ARMv8-A processor optimized for power efficiency while delivering strong single-threaded performance. Key architectural features include:

Feature	Specification
Architecture	ARMv8-A (AArch64 and AArch32)
Pipeline	8-stage, in-order
Issue Width	Dual-issue
Clock Frequency	Up to 1.5 GHz (EG/EV), 1.3 GHz (CG)
L1 I-Cache	32 KB per core
L1 D-Cache	32 KB per core
L2 Cache	1 MB shared (configurable)
SIMD	NEON, 128-bit
Floating Point	VFPv4, double precision

The in-order pipeline may seem like a limitation compared to out-of-order designs like the Cortex-A72, but it provides predictable execution timing crucial for embedded systems. The dual-issue capability means two instructions can execute per cycle when dependencies allow.

APU Configuration Options

Depending on the device variant, the APU provides different core counts:

CG devices: Dual-core Cortex-A53 at up to 1.3 GHz
EG devices: Quad-core Cortex-A53 at up to 1.5 GHz
EV devices: Quad-core Cortex-A53 at up to 1.5 GHz

All configurations support both Symmetric Multiprocessing (SMP) where Linux manages all cores as a unified pool, and Asymmetric Multiprocessing (AMP) where individual cores or core pairs run separate operating systems or bare-metal code.

Memory System and Cache Hierarchy

The APU memory system significantly impacts application performance. Understanding its structure helps optimize software for this platform.

Memory Level	Size	Characteristics
L1 I-Cache	32 KB/core	2-way set associative
L1 D-Cache	32 KB/core	4-way set associative, write-back
L2 Cache	1 MB shared	16-way set associative, unified
Snoop Control Unit	N/A	Maintains coherency across cores

The Snoop Control Unit (SCU) maintains cache coherency between cores, ensuring that when one core modifies shared data, other cores see the updated values. This hardware coherency simplifies software development but adds complexity when interfacing with DMA-capable peripherals in the programmable logic.

TrustZone Security Extensions

The Cortex-A53 implements ARM TrustZone technology, partitioning system resources into Secure and Non-Secure worlds. This hardware-enforced isolation enables:

Secure boot verification
Protected key storage
Isolated security processing
Trusted execution environments

TrustZone operates at Exception Level 3 (EL3), above the hypervisor level (EL2) and operating system level (EL1). The ARM Trusted Firmware (ATF) typically manages secure world operations and transitions between security states.

Read more Xilinx FPGA Series:

Best Zynq UltraScale+ Development Boards Compared (2024)

How to Install Vivado on Windows 11: Step-by-Step Tutorial

Spartan-3E FPGA Board: Beginner Tutorial & Project Ideas

Where to Buy Xilinx FPGAs: Complete Authorized Distributors Guide

Xilinx Alveo Accelerator Cards: Data Center FPGA Guide

Xilinx AMD Acquisition: What It Means for FPGA Developers

Xilinx Artix-7 FPGA Family: Features, Specs & Selection Guide

Xilinx Artix-7 FPGA Price Guide

Xilinx CPLD Programmer and Xilinx CPLD Board: The Complete Guide for Engineers

Xilinx FPGA Programming for Beginners: First Project Tutorial

Xilinx JTAG Programming: Complete Hardware Setup & Debug Tutorial

Xilinx Kintex-7 FPGA: Mid-Range Performance Powerhouse

Xilinx Spartan-3 FPGA: Legacy Support & Migration Guide

Xilinx Spartan-6 FPGA: Still Relevant? Complete 2025 Guide

Xilinx Spartan-7 FPGA: Low-Cost Solution for Embedded Design

Xilinx Virtex-7 FPGA: High-End Performance for Critical Applications

ARM Cortex-R5F Real-Time Processing Unit

The Real-time Processing Unit (RPU) addresses deterministic processing requirements that the Linux-running APU cannot satisfy. The dual Cortex-R5F cores provide sub-microsecond interrupt response times essential for motor control, safety monitoring, and real-time I/O handling.

Cortex-R5F Architecture Details

The Cortex-R5F is a 32-bit processor from ARM’s real-time family, optimized for low-latency, predictable execution:

Feature	Specification
Architecture	ARMv7-R
Pipeline	8-stage, dual-issue
Clock Frequency	Up to 600 MHz (EG/EV), 533 MHz (CG)
TCM (Tightly Coupled Memory)	128 KB per core (ATCM + BTCM)
L1 I-Cache	32 KB
L1 D-Cache	32 KB
MPU Regions	16
ECC Support	Full ECC on TCM and caches

The “F” suffix indicates floating-point support via the VFPv3 extension, enabling efficient signal processing without software emulation.

Operating Modes: Split vs. Lockstep

The RPU supports two distinct operating configurations that fundamentally change its behavior:

Split Mode: Both Cortex-R5F cores operate independently, each running its own code. This doubles processing throughput and enables parallel real-time tasks. In split mode:

Core 0 and Core 1 have separate TCMs
Each core handles independent interrupt vectors
No redundancy; single-core failure affects only that core’s functions

Lockstep Mode: Both cores execute identical instructions simultaneously, with hardware comparing results on every cycle. Any mismatch indicates a fault and triggers an error response. In lockstep mode:

Single logical processor with hardware redundancy
Meets safety requirements up to ASIL-B/SIL-2
Half the processing throughput of split mode
Automatic fault detection without software overhead

Lockstep mode proves essential for functional safety applications where processor failures could cause system hazards. The ISO 26262 ASIL-C certification for automotive XA devices relies heavily on this capability.

TCM and Memory Architecture

The Tightly Coupled Memory (TCM) provides deterministic, single-cycle access for time-critical code and data:

Memory	Size	Purpose
ATCM (Core 0)	64 KB	Instruction memory, deterministic fetch
BTCM (Core 0)	64 KB	Data memory, deterministic access
ATCM (Core 1)	64 KB	Available in split mode only
BTCM (Core 1)	64 KB	Available in split mode only

Unlike cached memory where access times vary based on cache hits/misses, TCM provides consistent timing essential for hard real-time systems. Critical interrupt handlers and control loops should execute from TCM whenever possible.

Memory Protection Unit

The Cortex-R5F uses a Memory Protection Unit (MPU) rather than a full Memory Management Unit (MMU). The MPU provides:

16 configurable protection regions
Access permission control (read/write/execute)
Memory type attributes (cacheable, bufferable, shareable)
No virtual-to-physical address translation

This approach suits real-time applications where address translation latency would be unacceptable, but it means the RPU cannot run Linux or other operating systems requiring virtual memory.

Xilinx UltraScale+ FPGA Programmable Logic

The Programmable Logic (PL) region contains Xilinx UltraScale+ FPGA fabric based on the UltraScale architecture—not the 7-series fabric found in Zynq-7000. This architectural advancement provides improved timing characteristics, enhanced DSP capabilities, and the addition of UltraRAM blocks.

CLB Architecture

The Configurable Logic Block (CLB) is the fundamental building block of the Xilinx UltraScale+ FPGA fabric. Each CLB contains one slice with:

Resource	Count per CLB	Capability
6-Input LUTs	8	Combinatorial logic, distributed RAM, shift registers
Flip-Flops	16	Storage registers with clock enable and reset
Carry Chain	8 bits	Fast arithmetic operations
Wide Muxes	Variable	Efficient multiplexer implementation

The 6-input LUTs can implement any Boolean function of up to 6 variables, or be configured as dual 5-input LUTs sharing common inputs. This flexibility enables efficient mapping of complex logic functions.

Each LUT can alternatively function as:

64×1 distributed RAM (single-port)
32×2 distributed RAM (dual-port)
32-bit shift register (SRL32)

DSP Slice Capabilities

The DSP48E2 slice in the UltraScale architecture provides significant digital signal processing capability:

Feature	Specification
Pre-adder	27-bit
Multiplier	27 × 18 bits
Accumulator	48-bit
Pattern Detector	48-bit
XOR Function	96-bit

A single DSP slice can perform a 27×18 multiply-accumulate operation in a single clock cycle at frequencies exceeding 700 MHz. Cascading multiple slices enables efficient FIR filters, matrix operations, and floating-point implementations.

Block RAM and UltraRAM

The Xilinx UltraScale+ FPGA provides two types of dedicated memory blocks:

Block RAM (BRAM):

36 Kb capacity per block (configurable as 2×18 Kb)
True dual-port operation
Built-in FIFO logic
ECC support
Synchronous operation

UltraRAM:

288 Kb capacity per block
True dual-port, 72-bit wide
Single-cycle access at full speed
Can be cascaded for deeper memories
Independent power-down capability

UltraRAM represents a significant advancement for designs requiring large on-chip buffers. A single UltraRAM block replaces eight BRAM blocks while consuming less power and providing better timing characteristics.

High-Speed Transceivers

The PL includes multiple transceiver types for high-speed serial communication:

Transceiver	Data Rate	Typical Applications
GTH	0.5–16.3 Gb/s	10G Ethernet, PCIe Gen3, Aurora
GTY	0.5–32.75 Gb/s	25G/100G Ethernet, PCIe Gen4

Each transceiver includes programmable equalizers, clock recovery, and protocol-specific features. The transceivers operate independently of the PS transceivers, enabling the PL to implement custom high-speed interfaces.

PS-PL Interface Architecture

The interface between Processing System and Programmable Logic defines system performance and determines which designs are feasible. The Zynq UltraScale+ ARM platform provides multiple interface types optimized for different traffic patterns.

AXI Interface Types

Interface	Width	Purpose	Bandwidth
HPM (High Performance Master)	32/64/128-bit	PS master, PL slave	~5 GB/s each
HPC (High Performance Coherent)	32/64/128-bit	PL master with cache coherency	~5 GB/s each
HP (High Performance)	32/64/128-bit	PL master to DDR, non-coherent	~5 GB/s each
LPD (Low Power Domain)	32/64/128-bit	LPD peripherals to PL	~2 GB/s
ACP (Accelerator Coherency Port)	128-bit	PL coherent access to APU caches	~5 GB/s

The aggregate PS-PL bandwidth exceeds 150 GB/s when all interfaces are utilized, though practical designs rarely approach this theoretical maximum.

Choosing the Right Interface

Selecting appropriate interfaces significantly impacts system performance:

Use HPM when:

PS software initiates data transfers
PL contains register-based peripherals
Latency tolerance exists for software polling

Use HPC when:

PL accelerator operates on cached data structures
Software and hardware share memory regions
Cache coherency eliminates explicit cache maintenance

Use HP when:

PL requires direct DDR access
Maximum bandwidth is priority
Cache coherency overhead is unacceptable

Use ACP when:

PL needs coherent cache access
Data fits in APU caches
Latency is more important than bandwidth

Clock Domain Considerations

The PS and PL operate in separate clock domains, requiring careful synchronization at interfaces. The PS generates several PL reference clocks (PL0-PL3) configurable from 100 MHz to over 300 MHz, but PL designs may use independent clocking when required.

Clock domain crossing between PS and PL occurs automatically within the AXI infrastructure, but designers must understand that:

AXI transactions include handshaking that accommodates clock differences
Maximum interface frequency depends on both PS and PL clock rates
Asynchronous clock domains add latency to transactions

Read more Xilinx Products:

XCVU35P-L2FSVH2104E: AMD Virtex UltraScale+ HBM FPGA Specifications, Features & Applications

XCVU35P-1FSVH2892E: High-Performance AMD Virtex UltraScale+ HBM FPGA

XC2C256-7FT256I CoolRunner-II CPLD: High-Performance Programmable Logic Device

XC2C128-7VQ100C: High-Performance CoolRunner-II CPLD for Advanced Digital Design

XC18V01SO20I: High-Performance Configuration PROM for FPGA Applications

XQ18V04VQ44N: Military-Grade 4Mbit FPGA Configuration PROM by AMD Xilinx

XC18V02VQG44I: Complete Guide to Xilinx 2Mbit In-System Programmable Configuration PROM

XC18V02PC44C0936: AMD Xilinx 2Mbit In-System Programmable Configuration PROM for FPGA Applications

XC2C512-7FT256C: AMD Xilinx CoolRunner-II CPLD | 512 Macrocell Programmable Logic Device

XC17S30PC: Xilinx Spartan OTP Configuration PROM for FPGA Applications

Essential Resources for Zynq UltraScale+ ARM Development

These resources support development on the Zynq UltraScale+ ARM platform.

Official AMD Documentation

Document	Number	Description
Technical Reference Manual	UG1085	Comprehensive architecture reference (1800+ pages)
Software Developer Guide	UG1137	Software development information
Register Reference	UG1087	Complete register definitions
Datasheet Overview	DS891	Device specifications
PCB Design Guide	UG583	Board design guidelines

ARM Architecture References

Document	Description
ARM Cortex-A53 TRM	Core architecture and features
ARM Cortex-R5 TRM	Real-time processor details
ARMv8-A Architecture Reference	Complete instruction set reference
ARM AMBA AXI Protocol	Interface specifications

Download Links

Resource	URL
Vivado Design Suite	https://www.xilinx.com/support/download.html
Vitis Platform	https://www.xilinx.com/products/design-tools/vitis.html
PetaLinux Tools	https://www.xilinx.com/products/design-tools/embedded-software/petalinux-sdk.html
ARM Documentation	https://developer.arm.com/documentation

Frequently Asked Questions

What is the difference between Cortex-A53 and Cortex-R5 in Zynq UltraScale+?

The Zynq UltraScale+ ARM Cortex-A53 is a 64-bit application processor designed for running operating systems like Linux. It features virtual memory (MMU), multi-level caches, and TrustZone security. The Cortex-R5 is a 32-bit real-time processor optimized for deterministic, low-latency tasks. It uses an MPU instead of MMU, provides tightly coupled memory for guaranteed access timing, and supports lockstep operation for safety-critical applications. Use the A53 for complex software; use the R5 for time-critical control loops.

Can the Cortex-A53 and Cortex-R5 run simultaneously?

Yes, the APU and RPU operate independently and can run simultaneously. This enables powerful system architectures where Linux handles networking, user interface, and complex algorithms on the A53 cores while real-time control loops execute on the R5 cores with guaranteed timing. Inter-processor communication uses shared memory regions, hardware mailboxes, or software-defined protocols. The OpenAMP framework provides standard mechanisms for AMP systems.

How does UltraRAM differ from Block RAM in the Xilinx UltraScale+ FPGA?

UltraRAM provides 288 Kb per block versus 36 Kb for Block RAM—eight times the density. UltraRAM is optimized for large buffers and can be independently powered down for energy savings. Block RAM offers more flexible configurations (aspect ratios, ECC options, FIFO modes) and is distributed throughout the fabric for better timing to nearby logic. Use UltraRAM for large memories where a single wide interface suffices; use Block RAM for smaller, distributed memories requiring specific features.

What software can run on each processor in the Zynq UltraScale+?

The Cortex-A53 APU supports Linux, FreeRTOS, VxWorks, QNX, bare-metal applications, and hypervisors like Xen. The Cortex-R5 RPU supports FreeRTOS, SafeRTOS, bare-metal applications, and other RTOSes that don’t require virtual memory. The Mali-400 GPU supports OpenGL ES 1.1 and 2.0 graphics APIs. Typical production systems run Linux on the APU for application software and FreeRTOS or bare-metal on the RPU for real-time functions.

How do I decide what functionality belongs in the FPGA versus the ARM processors?

Place functionality in the Xilinx UltraScale+ FPGA programmable logic when you need: parallel processing beyond what ARM cores provide, precise timing control, custom interfaces not available in the PS, or hardware acceleration of compute-intensive algorithms. Place functionality in the ARM processors when you need: complex decision logic, operating system services, networking stacks, file systems, or rapid development without HDL. The PS-PL interface bandwidth supports moving data between domains, so the decision centers on which processing element best handles each workload rather than data locality constraints.

Architecting Effective Zynq UltraScale+ Systems

The Zynq UltraScale+ ARM architecture provides remarkable flexibility but demands thoughtful system partitioning. Success requires understanding each processing element’s strengths:

Use the Cortex-A53 APU for:

Operating system services
Complex algorithms and decision logic
Network protocol stacks
User interface and display management
File system and storage management

Use the Cortex-R5 RPU for:

Motor control loops
Safety monitoring functions
Interrupt-driven I/O handling
Real-time protocol processing
Functions requiring lockstep redundancy

Use the Xilinx UltraScale+ FPGA PL for:

Custom interface implementations
Hardware acceleration of parallel algorithms
High-speed signal processing
Precise timing generation
Functions requiring determinism beyond software capability

The PS-PL interfaces enable efficient data movement between domains, making the partitioning decision about capability rather than connectivity. Start with clear requirements for timing, throughput, and functionality, then map each function to the most appropriate processing element.

Interrupt Architecture and Management

The Zynq UltraScale+ ARM interrupt system deserves careful attention as it directly impacts real-time performance and system responsiveness.

Generic Interrupt Controller (GIC)

The APU uses ARM’s GIC-400, a GICv2 implementation supporting:

Feature	Specification
Shared Peripheral Interrupts (SPI)	160
Private Peripheral Interrupts (PPI)	16 per core
Software Generated Interrupts (SGI)	16
Priority Levels	32
Security States	Secure and Non-Secure

The GIC provides interrupt prioritization, routing to specific cores, and security partitioning. Properly configuring interrupt affinities—which core handles which interrupt—significantly impacts system performance and real-time response.

RPU Interrupt Handling

The Cortex-R5 uses a separate GIC-400 implementation with similar capabilities but independent configuration. This isolation ensures that APU interrupt load doesn’t affect RPU response times.

Key RPU interrupt characteristics:

Vectored interrupt controller with configurable priority
Fast interrupt (FIQ) path for lowest-latency handlers
Nested interrupt support for priority preemption
Interrupt latency under 20 cycles from assertion to handler entry

PL-to-PS Interrupt Routing

The programmable logic can generate interrupts to both the APU and RPU through dedicated interrupt lines:

Interrupt Group	Destination	Count
PL-PS Group 0	APU (IRQ)	8
PL-PS Group 1	APU (IRQ)	8
PL-RPU	RPU	2

PL-generated interrupts enable hardware accelerators to signal completion, custom peripherals to request service, and external events to trigger software responses. Proper interrupt design minimizes latency between PL event occurrence and software handler execution.

Power Domain Architecture

The Zynq UltraScale+ ARM architecture implements sophisticated power management through independent power domains:

Power Domain Organization

Domain	Contents	Typical Power State
Full Power Domain (FPD)	APU, GPU, DisplayPort, SATA, PCIe	Active during application processing
Low Power Domain (LPD)	RPU, USB, Ethernet, IOU	Can remain active when FPD sleeps
PL Power Domain	Entire FPGA fabric	Independent control
Battery Power Domain	RTC, minimal logic	Always on (nanoamps)

This partitioning enables sophisticated power management strategies:

FPD Off Mode: Disable APU and high-speed peripherals while RPU maintains real-time functions
Deep Sleep: Only battery domain active, microseconds wake time
PL Power Gating: Disable unused programmable logic regions

Platform Management Unit

The Platform Management Unit (PMU) orchestrates power management using a dedicated MicroBlaze processor with:

32 KB ROM for boot code
128 KB RAM for runtime firmware
Access to all power control registers
System monitoring capabilities

The PMU firmware handles power state transitions, monitors system health, and manages the boot sequence. Customizing PMU firmware enables application-specific power optimization.

Understanding this architecture deeply enables designs that leverage all processing resources effectively—achieving performance and determinism impossible with any single processor type alone.

Contact Sales & After-Sales Service

Printed Circuit Board

RF PCB

PCB Surface Finish

Special Process

Special Materials

PCB Assembly

PCBA Services

Testing

Application

Resources

News & Blog