Contact Sales & After-Sales Service

Contact & Quotation

  • Inquire: Call 0086-755-23203480, or reach out via the form below/your sales contact to discuss our design, manufacturing, and assembly capabilities.
  • Quote: Email your PCB files to Sales@pcbsync.com (Preferred for large files) or submit online. We will contact you promptly. Please ensure your email is correct.
Drag & Drop Files, Choose Files to Upload You can upload up to 3 files.

Notes:
For PCB fabrication, we require PCB design file in Gerber RS-274X format (most preferred), *.PCB/DDB (Protel, inform your program version) format or *.BRD (Eagle) format. For PCB assembly, we require PCB design file in above mentioned format, drilling file and BOM. Click to download BOM template To avoid file missing, please include all files into one folder and compress it into .zip or .rar format.

Zynq UltraScale+ Architecture: Cortex-A53, R5 & FPGA Explained

Understanding the Zynq UltraScale+ ARM architecture requires digging deeper than marketing datasheets. After spending considerable time bringing up custom boards and debugging PS-PL interactions, I’ve learned that this platform’s power comes from understanding how its heterogeneous processing elements work together—and when each should handle specific workloads.

This technical guide examines the Xilinx UltraScale+ FPGA architecture in detail, covering the ARM Cortex-A53 application processor, Cortex-R5 real-time processor, and programmable logic fabric. Whether you’re architecting a new system or optimizing an existing design, this deep dive provides the technical foundation needed for effective hardware-software partitioning.

The Heterogeneous Processing Architecture

The Zynq UltraScale+ MPSoC fundamentally differs from both traditional processors and standalone FPGAs through its heterogeneous multiprocessing approach. Rather than forcing all workloads onto a single processing element, the architecture provides multiple specialized engines optimized for different tasks.

The Zynq UltraScale+ ARM processing system contains:

  • Application Processing Unit (APU): ARM Cortex-A53 cores for Linux and complex applications
  • Real-time Processing Unit (RPU): ARM Cortex-R5F cores for deterministic processing
  • Graphics Processing Unit (GPU): ARM Mali-400 MP2 for display and OpenGL ES (EG/EV variants)
  • Platform Management Unit (PMU): Dedicated MicroBlaze for system management

The Programmable Logic (PL) provides Xilinx UltraScale+ FPGA fabric for custom hardware acceleration and I/O interfaces. The combination enables system designers to assign workloads to the most appropriate processing element rather than compromising with a one-size-fits-all approach.

ARM Cortex-A53 Application Processing Unit

The Application Processing Unit (APU) centers on ARM Cortex-A53 cores implementing the ARMv8-A architecture. This represents a fundamental advancement over the Zynq-7000’s Cortex-A9 cores—64-bit addressing, improved efficiency, and modern instruction set extensions.

Cortex-A53 Core Architecture

The Cortex-A53 is a mid-range ARMv8-A processor optimized for power efficiency while delivering strong single-threaded performance. Key architectural features include:

FeatureSpecification
ArchitectureARMv8-A (AArch64 and AArch32)
Pipeline8-stage, in-order
Issue WidthDual-issue
Clock FrequencyUp to 1.5 GHz (EG/EV), 1.3 GHz (CG)
L1 I-Cache32 KB per core
L1 D-Cache32 KB per core
L2 Cache1 MB shared (configurable)
SIMDNEON, 128-bit
Floating PointVFPv4, double precision

The in-order pipeline may seem like a limitation compared to out-of-order designs like the Cortex-A72, but it provides predictable execution timing crucial for embedded systems. The dual-issue capability means two instructions can execute per cycle when dependencies allow.

APU Configuration Options

Depending on the device variant, the APU provides different core counts:

  • CG devices: Dual-core Cortex-A53 at up to 1.3 GHz
  • EG devices: Quad-core Cortex-A53 at up to 1.5 GHz
  • EV devices: Quad-core Cortex-A53 at up to 1.5 GHz

All configurations support both Symmetric Multiprocessing (SMP) where Linux manages all cores as a unified pool, and Asymmetric Multiprocessing (AMP) where individual cores or core pairs run separate operating systems or bare-metal code.

Memory System and Cache Hierarchy

The APU memory system significantly impacts application performance. Understanding its structure helps optimize software for this platform.

Memory LevelSizeCharacteristics
L1 I-Cache32 KB/core2-way set associative
L1 D-Cache32 KB/core4-way set associative, write-back
L2 Cache1 MB shared16-way set associative, unified
Snoop Control UnitN/AMaintains coherency across cores

The Snoop Control Unit (SCU) maintains cache coherency between cores, ensuring that when one core modifies shared data, other cores see the updated values. This hardware coherency simplifies software development but adds complexity when interfacing with DMA-capable peripherals in the programmable logic.

TrustZone Security Extensions

The Cortex-A53 implements ARM TrustZone technology, partitioning system resources into Secure and Non-Secure worlds. This hardware-enforced isolation enables:

  • Secure boot verification
  • Protected key storage
  • Isolated security processing
  • Trusted execution environments

TrustZone operates at Exception Level 3 (EL3), above the hypervisor level (EL2) and operating system level (EL1). The ARM Trusted Firmware (ATF) typically manages secure world operations and transitions between security states.

Read more Xilinx FPGA Series:

ARM Cortex-R5F Real-Time Processing Unit

The Real-time Processing Unit (RPU) addresses deterministic processing requirements that the Linux-running APU cannot satisfy. The dual Cortex-R5F cores provide sub-microsecond interrupt response times essential for motor control, safety monitoring, and real-time I/O handling.

Cortex-R5F Architecture Details

The Cortex-R5F is a 32-bit processor from ARM’s real-time family, optimized for low-latency, predictable execution:

FeatureSpecification
ArchitectureARMv7-R
Pipeline8-stage, dual-issue
Clock FrequencyUp to 600 MHz (EG/EV), 533 MHz (CG)
TCM (Tightly Coupled Memory)128 KB per core (ATCM + BTCM)
L1 I-Cache32 KB
L1 D-Cache32 KB
MPU Regions16
ECC SupportFull ECC on TCM and caches

The “F” suffix indicates floating-point support via the VFPv3 extension, enabling efficient signal processing without software emulation.

Operating Modes: Split vs. Lockstep

The RPU supports two distinct operating configurations that fundamentally change its behavior:

Split Mode: Both Cortex-R5F cores operate independently, each running its own code. This doubles processing throughput and enables parallel real-time tasks. In split mode:

  • Core 0 and Core 1 have separate TCMs
  • Each core handles independent interrupt vectors
  • No redundancy; single-core failure affects only that core’s functions

Lockstep Mode: Both cores execute identical instructions simultaneously, with hardware comparing results on every cycle. Any mismatch indicates a fault and triggers an error response. In lockstep mode:

  • Single logical processor with hardware redundancy
  • Meets safety requirements up to ASIL-B/SIL-2
  • Half the processing throughput of split mode
  • Automatic fault detection without software overhead

Lockstep mode proves essential for functional safety applications where processor failures could cause system hazards. The ISO 26262 ASIL-C certification for automotive XA devices relies heavily on this capability.

TCM and Memory Architecture

The Tightly Coupled Memory (TCM) provides deterministic, single-cycle access for time-critical code and data:

MemorySizePurpose
ATCM (Core 0)64 KBInstruction memory, deterministic fetch
BTCM (Core 0)64 KBData memory, deterministic access
ATCM (Core 1)64 KBAvailable in split mode only
BTCM (Core 1)64 KBAvailable in split mode only

Unlike cached memory where access times vary based on cache hits/misses, TCM provides consistent timing essential for hard real-time systems. Critical interrupt handlers and control loops should execute from TCM whenever possible.

Memory Protection Unit

The Cortex-R5F uses a Memory Protection Unit (MPU) rather than a full Memory Management Unit (MMU). The MPU provides:

  • 16 configurable protection regions
  • Access permission control (read/write/execute)
  • Memory type attributes (cacheable, bufferable, shareable)
  • No virtual-to-physical address translation

This approach suits real-time applications where address translation latency would be unacceptable, but it means the RPU cannot run Linux or other operating systems requiring virtual memory.

Xilinx UltraScale+ FPGA Programmable Logic

The Programmable Logic (PL) region contains Xilinx UltraScale+ FPGA fabric based on the UltraScale architecture—not the 7-series fabric found in Zynq-7000. This architectural advancement provides improved timing characteristics, enhanced DSP capabilities, and the addition of UltraRAM blocks.

CLB Architecture

The Configurable Logic Block (CLB) is the fundamental building block of the Xilinx UltraScale+ FPGA fabric. Each CLB contains one slice with:

ResourceCount per CLBCapability
6-Input LUTs8Combinatorial logic, distributed RAM, shift registers
Flip-Flops16Storage registers with clock enable and reset
Carry Chain8 bitsFast arithmetic operations
Wide MuxesVariableEfficient multiplexer implementation

The 6-input LUTs can implement any Boolean function of up to 6 variables, or be configured as dual 5-input LUTs sharing common inputs. This flexibility enables efficient mapping of complex logic functions.

Each LUT can alternatively function as:

  • 64×1 distributed RAM (single-port)
  • 32×2 distributed RAM (dual-port)
  • 32-bit shift register (SRL32)

DSP Slice Capabilities

The DSP48E2 slice in the UltraScale architecture provides significant digital signal processing capability:

FeatureSpecification
Pre-adder27-bit
Multiplier27 × 18 bits
Accumulator48-bit
Pattern Detector48-bit
XOR Function96-bit

A single DSP slice can perform a 27×18 multiply-accumulate operation in a single clock cycle at frequencies exceeding 700 MHz. Cascading multiple slices enables efficient FIR filters, matrix operations, and floating-point implementations.

Block RAM and UltraRAM

The Xilinx UltraScale+ FPGA provides two types of dedicated memory blocks:

Block RAM (BRAM):

  • 36 Kb capacity per block (configurable as 2×18 Kb)
  • True dual-port operation
  • Built-in FIFO logic
  • ECC support
  • Synchronous operation

UltraRAM:

  • 288 Kb capacity per block
  • True dual-port, 72-bit wide
  • Single-cycle access at full speed
  • Can be cascaded for deeper memories
  • Independent power-down capability

UltraRAM represents a significant advancement for designs requiring large on-chip buffers. A single UltraRAM block replaces eight BRAM blocks while consuming less power and providing better timing characteristics.

High-Speed Transceivers

The PL includes multiple transceiver types for high-speed serial communication:

TransceiverData RateTypical Applications
GTH0.5–16.3 Gb/s10G Ethernet, PCIe Gen3, Aurora
GTY0.5–32.75 Gb/s25G/100G Ethernet, PCIe Gen4

Each transceiver includes programmable equalizers, clock recovery, and protocol-specific features. The transceivers operate independently of the PS transceivers, enabling the PL to implement custom high-speed interfaces.

PS-PL Interface Architecture

The interface between Processing System and Programmable Logic defines system performance and determines which designs are feasible. The Zynq UltraScale+ ARM platform provides multiple interface types optimized for different traffic patterns.

AXI Interface Types

InterfaceWidthPurposeBandwidth
HPM (High Performance Master)32/64/128-bitPS master, PL slave~5 GB/s each
HPC (High Performance Coherent)32/64/128-bitPL master with cache coherency~5 GB/s each
HP (High Performance)32/64/128-bitPL master to DDR, non-coherent~5 GB/s each
LPD (Low Power Domain)32/64/128-bitLPD peripherals to PL~2 GB/s
ACP (Accelerator Coherency Port)128-bitPL coherent access to APU caches~5 GB/s

The aggregate PS-PL bandwidth exceeds 150 GB/s when all interfaces are utilized, though practical designs rarely approach this theoretical maximum.

Choosing the Right Interface

Selecting appropriate interfaces significantly impacts system performance:

Use HPM when:

  • PS software initiates data transfers
  • PL contains register-based peripherals
  • Latency tolerance exists for software polling

Use HPC when:

  • PL accelerator operates on cached data structures
  • Software and hardware share memory regions
  • Cache coherency eliminates explicit cache maintenance

Use HP when:

  • PL requires direct DDR access
  • Maximum bandwidth is priority
  • Cache coherency overhead is unacceptable

Use ACP when:

  • PL needs coherent cache access
  • Data fits in APU caches
  • Latency is more important than bandwidth

Clock Domain Considerations

The PS and PL operate in separate clock domains, requiring careful synchronization at interfaces. The PS generates several PL reference clocks (PL0-PL3) configurable from 100 MHz to over 300 MHz, but PL designs may use independent clocking when required.

Clock domain crossing between PS and PL occurs automatically within the AXI infrastructure, but designers must understand that:

  • AXI transactions include handshaking that accommodates clock differences
  • Maximum interface frequency depends on both PS and PL clock rates
  • Asynchronous clock domains add latency to transactions

Read more Xilinx Products:

Essential Resources for Zynq UltraScale+ ARM Development

These resources support development on the Zynq UltraScale+ ARM platform.

Official AMD Documentation

DocumentNumberDescription
Technical Reference ManualUG1085Comprehensive architecture reference (1800+ pages)
Software Developer GuideUG1137Software development information
Register ReferenceUG1087Complete register definitions
Datasheet OverviewDS891Device specifications
PCB Design GuideUG583Board design guidelines

ARM Architecture References

DocumentDescription
ARM Cortex-A53 TRMCore architecture and features
ARM Cortex-R5 TRMReal-time processor details
ARMv8-A Architecture ReferenceComplete instruction set reference
ARM AMBA AXI ProtocolInterface specifications

Download Links

ResourceURL
Vivado Design Suitehttps://www.xilinx.com/support/download.html
Vitis Platformhttps://www.xilinx.com/products/design-tools/vitis.html
PetaLinux Toolshttps://www.xilinx.com/products/design-tools/embedded-software/petalinux-sdk.html
ARM Documentationhttps://developer.arm.com/documentation

Frequently Asked Questions

What is the difference between Cortex-A53 and Cortex-R5 in Zynq UltraScale+?

The Zynq UltraScale+ ARM Cortex-A53 is a 64-bit application processor designed for running operating systems like Linux. It features virtual memory (MMU), multi-level caches, and TrustZone security. The Cortex-R5 is a 32-bit real-time processor optimized for deterministic, low-latency tasks. It uses an MPU instead of MMU, provides tightly coupled memory for guaranteed access timing, and supports lockstep operation for safety-critical applications. Use the A53 for complex software; use the R5 for time-critical control loops.

Can the Cortex-A53 and Cortex-R5 run simultaneously?

Yes, the APU and RPU operate independently and can run simultaneously. This enables powerful system architectures where Linux handles networking, user interface, and complex algorithms on the A53 cores while real-time control loops execute on the R5 cores with guaranteed timing. Inter-processor communication uses shared memory regions, hardware mailboxes, or software-defined protocols. The OpenAMP framework provides standard mechanisms for AMP systems.

How does UltraRAM differ from Block RAM in the Xilinx UltraScale+ FPGA?

UltraRAM provides 288 Kb per block versus 36 Kb for Block RAM—eight times the density. UltraRAM is optimized for large buffers and can be independently powered down for energy savings. Block RAM offers more flexible configurations (aspect ratios, ECC options, FIFO modes) and is distributed throughout the fabric for better timing to nearby logic. Use UltraRAM for large memories where a single wide interface suffices; use Block RAM for smaller, distributed memories requiring specific features.

What software can run on each processor in the Zynq UltraScale+?

The Cortex-A53 APU supports Linux, FreeRTOS, VxWorks, QNX, bare-metal applications, and hypervisors like Xen. The Cortex-R5 RPU supports FreeRTOS, SafeRTOS, bare-metal applications, and other RTOSes that don’t require virtual memory. The Mali-400 GPU supports OpenGL ES 1.1 and 2.0 graphics APIs. Typical production systems run Linux on the APU for application software and FreeRTOS or bare-metal on the RPU for real-time functions.

How do I decide what functionality belongs in the FPGA versus the ARM processors?

Place functionality in the Xilinx UltraScale+ FPGA programmable logic when you need: parallel processing beyond what ARM cores provide, precise timing control, custom interfaces not available in the PS, or hardware acceleration of compute-intensive algorithms. Place functionality in the ARM processors when you need: complex decision logic, operating system services, networking stacks, file systems, or rapid development without HDL. The PS-PL interface bandwidth supports moving data between domains, so the decision centers on which processing element best handles each workload rather than data locality constraints.

Architecting Effective Zynq UltraScale+ Systems

The Zynq UltraScale+ ARM architecture provides remarkable flexibility but demands thoughtful system partitioning. Success requires understanding each processing element’s strengths:

Use the Cortex-A53 APU for:

  • Operating system services
  • Complex algorithms and decision logic
  • Network protocol stacks
  • User interface and display management
  • File system and storage management

Use the Cortex-R5 RPU for:

  • Motor control loops
  • Safety monitoring functions
  • Interrupt-driven I/O handling
  • Real-time protocol processing
  • Functions requiring lockstep redundancy

Use the Xilinx UltraScale+ FPGA PL for:

  • Custom interface implementations
  • Hardware acceleration of parallel algorithms
  • High-speed signal processing
  • Precise timing generation
  • Functions requiring determinism beyond software capability

The PS-PL interfaces enable efficient data movement between domains, making the partitioning decision about capability rather than connectivity. Start with clear requirements for timing, throughput, and functionality, then map each function to the most appropriate processing element.

Interrupt Architecture and Management

The Zynq UltraScale+ ARM interrupt system deserves careful attention as it directly impacts real-time performance and system responsiveness.

Generic Interrupt Controller (GIC)

The APU uses ARM’s GIC-400, a GICv2 implementation supporting:

FeatureSpecification
Shared Peripheral Interrupts (SPI)160
Private Peripheral Interrupts (PPI)16 per core
Software Generated Interrupts (SGI)16
Priority Levels32
Security StatesSecure and Non-Secure

The GIC provides interrupt prioritization, routing to specific cores, and security partitioning. Properly configuring interrupt affinities—which core handles which interrupt—significantly impacts system performance and real-time response.

RPU Interrupt Handling

The Cortex-R5 uses a separate GIC-400 implementation with similar capabilities but independent configuration. This isolation ensures that APU interrupt load doesn’t affect RPU response times.

Key RPU interrupt characteristics:

  • Vectored interrupt controller with configurable priority
  • Fast interrupt (FIQ) path for lowest-latency handlers
  • Nested interrupt support for priority preemption
  • Interrupt latency under 20 cycles from assertion to handler entry

PL-to-PS Interrupt Routing

The programmable logic can generate interrupts to both the APU and RPU through dedicated interrupt lines:

Interrupt GroupDestinationCount
PL-PS Group 0APU (IRQ)8
PL-PS Group 1APU (IRQ)8
PL-RPURPU2

PL-generated interrupts enable hardware accelerators to signal completion, custom peripherals to request service, and external events to trigger software responses. Proper interrupt design minimizes latency between PL event occurrence and software handler execution.

Power Domain Architecture

The Zynq UltraScale+ ARM architecture implements sophisticated power management through independent power domains:

Power Domain Organization

DomainContentsTypical Power State
Full Power Domain (FPD)APU, GPU, DisplayPort, SATA, PCIeActive during application processing
Low Power Domain (LPD)RPU, USB, Ethernet, IOUCan remain active when FPD sleeps
PL Power DomainEntire FPGA fabricIndependent control
Battery Power DomainRTC, minimal logicAlways on (nanoamps)

This partitioning enables sophisticated power management strategies:

  • FPD Off Mode: Disable APU and high-speed peripherals while RPU maintains real-time functions
  • Deep Sleep: Only battery domain active, microseconds wake time
  • PL Power Gating: Disable unused programmable logic regions

Platform Management Unit

The Platform Management Unit (PMU) orchestrates power management using a dedicated MicroBlaze processor with:

  • 32 KB ROM for boot code
  • 128 KB RAM for runtime firmware
  • Access to all power control registers
  • System monitoring capabilities

The PMU firmware handles power state transitions, monitors system health, and manages the boot sequence. Customizing PMU firmware enables application-specific power optimization.

Understanding this architecture deeply enables designs that leverage all processing resources effectively—achieving performance and determinism impossible with any single processor type alone.

Leave a Reply

Your email address will not be published. Required fields are marked *

Contact Sales & After-Sales Service

Contact & Quotation

  • Inquire: Call 0086-755-23203480, or reach out via the form below/your sales contact to discuss our design, manufacturing, and assembly capabilities.

  • Quote: Email your PCB files to Sales@pcbsync.com (Preferred for large files) or submit online. We will contact you promptly. Please ensure your email is correct.

Drag & Drop Files, Choose Files to Upload You can upload up to 3 files.

Notes:
For PCB fabrication, we require PCB design file in Gerber RS-274X format (most preferred), *.PCB/DDB (Protel, inform your program version) format or *.BRD (Eagle) format. For PCB assembly, we require PCB design file in above mentioned format, drilling file and BOM. Click to download BOM template To avoid file missing, please include all files into one folder and compress it into .zip or .rar format.