Fast Fourier Transform IP

Core Features

  • Transform Size: Up to 64K-point
  • Processing Architectures: Base-Radix (Burst I/O), RSDF (Pipelined SDF), and SSR (Super Sample Rate)
  • Algorithm Topologies: Decimation-in-Time (DIT) & Decimation-in-Frequency (DIF)
  • Zero-Overhead Convolution: Completely eliminate digit-reversal memory overhead in FFT/IFFT systems
  • Datapath: Native Fixed-Point with seamless Single-Precision Floating-Point support
  • Flexibility: Dynamic runtime reconfiguration (Forward/Inverse transform, Point size)
  • Interfaces: Industry-standard AXI4-Stream (Data with robust backpressure) & AXI4-Lite (Control)
  • Design Focus: Exceptional PPA efficiency (A lean, "muscle and bone" optimized RTL)
  • Codebase: 100% Vendor-Independent, pure VHDL-2008 synthesizable source code
  • Optional Cyclic Prefix Support for All Architectures
  • 3-way or 4-way Complex Multiplier 

Core Overview

The FFT IP Core is a high-performance, highly configurable implementation of the Discrete Fourier Transform (DFT) based on the Cooley-Tukey algorithm, engineered to deliver exceptional Power, Performance, and Area (PPA) results. Supporting transform sizes up to 65,536 points, the core natively handles fixed-point arithmetic while providing a seamless integration path for Single-Precision Floating-Point operations.

It fully supports both forward and inverse transforms, custom scaling schedule on a per-stage basis alongside with Decimation-in-Frequency (DIF) and Decimation-in-Time (DIT) implemntations. To ensure robust system integration, the IP features dynamic runtime reconfiguration via basic RTL-reconfiguration interface or through a standardized AXI4-Lite register interface with optional Clock Domain Crossing (CDC). The core is backed by comprehensive real-time status monitoring for arithmetic's overflows and underflows and invalid packet detection while having also the optional availability to debug any dynamic-range issues during processing for fixed-point arithmetics.

Rigorously verified in industry-standard EDA simulator across all major configurations, this core provides the ultimate answer to modern DSP challenges. Engineered with uncompromising digital design best practices and exceptional code quality, the IP is perfectly positioned to challenge generic, vendor-provided solutions.

Architecture Support

  • Base-Radix (Iterative / Burst I/O): Maximizes resource efficiency by looping data through a single, powerful hardware processing stage. It is ideal for applications where low footprint (BRAM/DSP) is prioritized over continuous streaming.

  • RSDF (Radix Single Delay Feedback): A fully unrolled, pipelined architecture utilizing delay lines to provide continuous, 1-sample-per-clock-cycle throughput with an extremely lean memory footprint.

  • Pipelined SSR (Super Sample Rate): The ultimate performance top-tier. This fully pipelined and highly parallelized architecture processes multiple samples per clock cycle (e.g., up to 8 samples per cycle for a Radix-8 configuration), enabling processing of multi-gigabit data streams.

Advanced System OptimizationZero-Overhead Convolution

For DSP systems requiring heavy frequency-domain filtering or OFDM modulation, the IP uniquely enables a highly optimized "DIT/DIF Sandwich" topology. By configuring the forward FFT with Decimation-in-Frequency (DIF) and the inverse IFFT with Decimation-in-Time (DIT), the core completely bypasses the traditional digit-reversal permutation step. This architectural synergy allows natural-order time-domain data to seamlessly pass through the frequency domain and back, saving massive amounts of internal RAM and logic resources and drastically reducing end-to-end system latency while improving FMAX at the same time.

Throughput, Latency and Radix Variants

The core supports Radix-2, Radix-4, and Radix-8 butterfly engines, including mixed-radix capabilities. Higher base-radix configurations dynamically reduce the number of necessary processing stages which significantly cuts down overall latency and increases processing speed. Furthermore, the IP employs robust AXI-Stream backpressure handling and strictly guarantees 100% gapless continuous throughput for the following boundary conditions:

  • Pipeline SDF (Radix-2): For N >= 8

  • Pipeline SSR (Radix-2): For N >= 64

  • Pipeline SSR (Radix-4): For N >= 128

  • Pipeline SSR (Radix-8): For N >= 256

Q2 2023
Q2 2023

Foundation & BASE Architecture

  • Initial release with DIF-based  Cooley-Tukey Interative FFT processing engine.

  • Implementation of industry-standard AXI4-Stream data handling and baseline resource configurations.

Q4 2023
Q4 2023

Algorithmic Expansion & Floating-Point

  • Introduction of Decimation-in-Time (DIT) support, laying the groundwork for Zero-Overhead Convolution topologies.

  • Integration of external IEEE-754 Single-Precision Floating-Point libraries alongside fixed-point arithmetic.

Q3 2024
Q3 2024

Integration & Reliability

  • Focus on system-level robustness: Introduced Clock Domain Crossing (CDC) for the AXI4-Lite control interface.

  • Added advanced packet padding support, diagnostic counters, and timing/DSP optimizations

  • Baseline for Piipelined SSR Architecture

Q3 2025
Q3 2025

SDF Architecture Support

  • A massive architectural improvement with the introduction of the continuous Radix-2 Single-Path Delay Feedback (RSDF) topology.

  • Achieved drastic footprint reductions in RAM, ROM, and DSP utilization for mixed-radix transforms.

  • Introduced a proprietary, hardware-accelerated diagnostic bitmask engine for streamlined fixed-point dynamic range evaluation.

Q2 2026
Q2 2026

SSR Architecture Support

  • Finalization of the ultra-broadband PIPE (SSR) architecture for multi-gigabit parallel data streams.

  • Integration of high-end telecom features including dynamic Cyclic Prefix (CP) insertion and native RTL-based runtime reconfiguration.

  • Introduced selectable 3-way or 4-way Complex Multiplier architectures for extreme DSP slice fine-tuning

Documentation (V10)

IP PORT DEFINITIONS
entity FFT_TOP_BB is
  generic(
    ----------------------------------------------------------------------------------------------------------------
    -------- Core Architecture & Capabilities                                                               --------
    ----------------------------------------------------------------------------------------------------------------
    ARCH_TYPE             : string            := "BASE";          -- Architecture Type
    IMPL_TYPE             : string            := "DIF";           -- Implementation Type
    BASE_RADIX            : natural           := 2;               -- Default Base Radix
    MIN_SIZE              : natural           := 2;               -- Minimum Supported Size
    MAX_SIZE              : natural           := 2048;            -- Maximum Supported Size
    CYCLIC_PREFIX_EN      : natural           := 0;               -- Includes Cyclic Prefix
    ----------------------------------------------------------------------------------------------------------------
    -------- DSP Arithmetic & Datapath Precision                                                            --------
    ----------------------------------------------------------------------------------------------------------------
    DSP_DATA_WIDTH        : natural           := 16;              -- Data Width (Real and Imaginary)
    DSP_TWIDDLE_WIDTH     : natural           := 16;              -- Twiddle Data Width (Real And Imaginary)
    DSP_PRECISION         : string            := "FIXED";         -- Fixed / Floating Point
    DSP_FI_SCALE_WIDTH    : natural           := 2;               -- Fixed Point Scaler Configuration Width
    DSP_FI_ROUND_MODE     : Rounding_t        := ROUND_HALF_UP;   -- Fixed Point Rounding mode
    DSP_FI_MULT_ARCH      : string            := "CMUL_X4";       -- Fixed Point Multiplier Optimization  X3 / X4
    DSP_FI_DR_MASK_EN     : natural           := 0;               -- Generate Fixed-Point Data Mask Evaluation
    DSP_FP_LAT_MUL        : natural           := 8;               -- Latency of Floating Point Multiplier
    DSP_FP_LAT_ADDSUB     : natural           := 10;              -- Latency of Floating Point Adder/Subtracter
    ----------------------------------------------------------------------------------------------------------------
    -------- AXI4-Stream Interfaces                                                                         --------
    ----------------------------------------------------------------------------------------------------------------
    AXIS_SAMPLES_S        : natural           := 1;               -- S - AXIS Parallel Samples
    AXIS_SAMPLES_M        : natural           := 1;               -- M - AXIS Parallel Samples
    ----------------------------------------------------------------------------------------------------------------
    -------- AXI4-Lite Control & Configuration Interface                                                    --------
    ----------------------------------------------------------------------------------------------------------------
    AXIL_INIT_MODE        : sl                := FFT_CNTRL_AXI;   -- Initial Control Mode
    AXIL_ADDR_WIDTH       : natural           := 32;              -- AXI4-Lite Addressing Width
    AXIL_DATA_WIDTH       : natural           := 32;              -- AXI4-Lite Data Width 
    AXIL_CDC_EN           : natural           := 0;               -- Enable AXIL to CDC
    AXIL_CDC_STAGES       : natural           := 3;               -- Amount of CDC Stages for AXIL Crossing
    ----------------------------------------------------------------------------------------------------------------
    -------- Hardware Synthesis Attributes                                                                  --------
    ----------------------------------------------------------------------------------------------------------------
    ATTR_RAM_TYPE         : string            := "auto";          -- Synthesis directive for RAM Implementation Type
    ATTR_ROM_TYPE         : string            := "auto";          -- Synthesis directive for ROM Implementation Type
    ATTR_USE_DSP          : string            := "yes"            -- Synthesis directive for DSP Slice Inference
  );
  port (

    ACLK                : in sl; -- AXI4-Lite Clock
    FACLK               : in sl; -- AXIS / FFT Core Clock
    ARESETn             : in sl; -- AXI4-Lite Reset
    FRESETn             : in sl; -- AXIS / FFT Core Reset

    -------------------------------------------------
    ------- AXI4-Lite Configuration Interface -------
    S_AXIL_ARADDR       : in  slv ( AXIL_ADDR_WIDTH - 1 downto 0 );
    S_AXIL_ARREADY      : out sl;
    S_AXIL_ARVALID      : in  sl;

    S_AXIL_AWADDR       : in  slv ( AXIL_ADDR_WIDTH - 1 downto 0 );
    S_AXIL_AWREADY      : out sl;
    S_AXIL_AWVALID      : in  sl;

    S_AXIL_BREADY       : in  sl;
    S_AXIL_BRESP        : out slv ( 1  downto 0 );
    S_AXIL_BVALID       : out sl;
    
    S_AXIL_RDATA        : out slv ( AXIL_DATA_WIDTH - 1 downto 0 );
    S_AXIL_RRESP        : out slv ( 1 downto 0 );
    S_AXIL_RVALID       : out sl;
    S_AXIL_RREADY       : in  sl;

    S_AXIL_WDATA        : in  slv ( AXIL_DATA_WIDTH - 1 downto 0 );
    S_AXIL_WSTRB        : in  slv ( AXIL_DATA_WIDTH / 8 - 1 downto 0 );
    S_AXIL_WVALID       : in  sl;
    S_AXIL_WREADY       : out sl;

    ----------------------------------------------------
    ------- Status and Reconfiguration Interface -------
    CFG_VALID           : in  sl;
    CFG_READY           : out sl;
    CFG_SIZE            : in  slv(FFT_LOG_ADDR_WIDTH - 1 downto 0);
    CFG_SCALES          : in  slv(127 downto 0);
    CFG_PREFIX          : in  slv(31  downto 0);
    CFG_DIRECTION       : in  sl;

    CFG_PADDING_EN      : in  sl;
    CFG_SATURATION_EN   : in  sl;
    ST_OVF_REAL         : out sl;
    ST_OVF_IMAG         : out sl;
    ST_TLAST_MISSING    : out sl;
    ST_TLAST_UNEXPECTED : out sl;

    --------------------------------------
    ------- Data Receive Interface -------
    S_AXI_TLast         : in  sl;
    S_AXI_TReady        : out sl;
    S_AXI_TVld          : in  sl;
    S_AXI_TStrb         : in  slv(AXIS_SAMPLES_S - 1 downto 0);
    S_AXI_TData_Real    : in  Slv_1D_Array_t(0 to AXIS_SAMPLES_S - 1)(DSP_DATA_WIDTH - 1 downto 0);
    S_AXI_TData_Imag    : in  Slv_1D_Array_t(0 to AXIS_SAMPLES_S - 1)(DSP_DATA_WIDTH - 1 downto 0);

    --------------------------------------
    ------ Data Transmit Interface -------
    M_AXI_TReady        : in  sl;
    M_AXI_TVld          : out sl;
    M_AXI_TLast         : out sl;
    M_AXI_TStrb         : out slv(AXIS_SAMPLES_M - 1 downto 0);
    M_AXI_TData_Real    : out Slv_1D_Array_t(0 to AXIS_SAMPLES_M - 1)(DSP_DATA_WIDTH - 1 downto 0);
    M_AXI_TData_Imag    : out Slv_1D_Array_t(0 to AXIS_SAMPLES_M - 1)(DSP_DATA_WIDTH - 1 downto 0)

  );
end FFT_TOP_BB;
RESOURCES & PERFORMANCE

The Resource and Performance estimates provided below represent a typical IP baseline. All metrics were obtained via post-place&route reports with the primary processing clock (FACLK) constrained to 500 MHz, utilizing the respective EDA tool's default optimization strategies. The datapath precision (DSP_DATA_WIDTH and DSP_TWIDDLE_WIDTH) is uniformly configured to 16 bits. The targeted silicon represents the current industry-standard high-end tiers: AMD UltraScale+™, Intel® Agilex™, and Efinix® Titanium series. If you need a specific utilization report with a different set of IP configuration parameters, please don't hesitate to contact me via the contact form at the bottom of this page.

SUMMARY

RAW RESULTS

LATENCY SPECIFICATIONS

The latency metrics provided below represent typical cycle counts obtained via cycle-accurate RTL simulations for Single-Precision Floating-Point arithmetic, with the internal pipeline latency configured to 10 clock cycles for both Multipliers and Adders. It is expected that real-world use cases utilizing native Fixed-Point arithmetic will yield even better latency results. The Latency is measured from TLAST consumption on S_AXIS up to its transmission on M_AXIS. The benchmarked latencies exclude any potential stalls on the M_AXIS bus. If the downstream slave deasserts S_AXIS_TREADY, the core will stall, freezing the internal pipelines and extending the total cycle count.

 

BASE_DIF_R2

      2,     56
      4,    162
      8,    158
     16,    228
     32,    334
     64,    520
    128,    882
    256,   1628
    512,   3206
   1024,   6576
   2048,  13786
   4096,  29188
   8192,  61998
  16384, 131672
  32768, 279170
  65536, 590508

BASE_DIF_R4


      4,     68
      8,    120
     16,    142
     32,    224
     64,    284
    128,    502
    256,    738
    512,   1548
   1024,   2584
   2048,   5954
   4096,  10574
   8192,  24952
  16384,  45444
  32768, 106926
  65536, 197050

BASE_DIF_R8



      8,    102
     16,    162
     32,    192
     64,    262
    128,    448
    256,    620
    512,    974
   1024,   2232
   2048,   3652
   4096,   6502
   8192,  16784
  16384,  29084
  32768,  53694
  65536, 139752

BASE_DIT_R2

      2,     56
      4,    162
      8,    158
     16,    228
     32,    334
     64,    520
    128,    882
    256,   1628
    512,   3206
   1024,   6576
   2048,  13786
   4096,  29188
   8192,  61998
  16384, 131672
  32768, 279170
  65536, 590508

BASE_DIT_R4


      4,     68
      8,    120
     16,    142
     32,    224
     64,    284
    128,    502
    256,    738
    512,   1548
   1024,   2584
   2048,   5954
   4096,  10574
   8192,  24952
  16384,  45444
  32768, 106926
  65536, 197050

BASE_DIT_R8



      8,    102
     16,    162
     32,    192
     64,    262
    128,    448
    256,    620
    512,    974
   1024,   2232
   2048,   3652
   4096,   6502
   8192,  16784
  16384,  29084
  32768,  53694
  65536, 139752

RSDF_DIF

      2,     61
      4,    108
      8,    158
     16,    216
     32,    290
     64,    400
    128,    574
    256,    876
    512,   1434
   1024,   2504
   2048,   4598
   4096,   8740
   8192,  16978
  16384,  33408
  32768,  66222
  65536, 131804

RSDF_DIT

      2,     61
      4,    109
      8,    159
     16,    216
     32,    290
     64,    400
    128,    574
    256,    876
    512,   1434
   1024,   2504
   2048,   4598
   4096,   8740
   8192,  16978
  16384,  33408
  32768,  66222
  65536, 131804

PIPE_DIF_R2

      2,     29
      4,     82
      8,    135
     16,    194
     32,    269
     64,    365
    128,    501
    256,    719
    512,   1097
   1024,   1795
   2048,   3133
   4096,   5751
   8192,  10929
  16384,  21227
  32768,  41765
  65536,  82783

PIPE_DIF_R4


      4,     39
      8,     95
     16,    107
     32,    176
     64,    200
    128,    334
    256,    402
    512,    762
   1024,   1000
   2048,   2276
   4096,   3188
   8192,   8112
  16384,  11712
  32768,  31228
  65536,  45580

PIPE_DIF_R8



      8,     69
     16,    132
     32,    145
     64,    180
    128,    305
    256,    357
    512,    449
   1024,   1078
   2048,   1333
   4096,   1852
   8192,   6459
  16384,   8393
  32768,  12271
  65536,  48689

PIPE_DIT_R2

      2,     55
      4,    110
      8,    163
     16,    224
     32,    298
     64,    394
    128,    530
    256,    748
    512,   1126
   1024,   1824
   2048,   3162
   4096,   5780
   8192,  10957
  16384,  21254
  32768,  41791
  65536,  82787

PIPE_DIT_R4


      4,     65
      8,    126
     16,    134
     32,    208
     64,    232
    128,    362
    256,    434
    512,    766
   1024,   1032
   2048,   2184
   4096,   3220
   8192,   7635
  16384,  11743
  32768,  29193
  65536,  45589

PIPE_DIT_R8



      8,     95
     16,    167
     32,    178
     64,    206
    128,    342
    256,    396
    512,    488
   1024,   1081
   2048,   1344
   4096,   2017
   8192,   6126
  16384,   8180
  32768,  13444
  65536,  45646

DigitReversal Bypass

RSDF_DIF

      2,     59
      4,    103
      8,    147
     16,    194
     32,    250
     64,    322
    128,    426
    256,    594
    512,    890
   1024,   1442
   2048,   2506
   4096,   4594
   8192,   8730
  16384,  16962
  32768,  33386
  65536,  66194

RSDF_DIT

      2,     59
      4,    103
      8,    147
     16,    194
     32,    250
     64,    322
    128,    426
    256,    594
    512,    890
   1024,   1442
   2048,   2506
   4096,   4594
   8192,   8730
  16384,  16962
  32768,  33386
  65536,  66194

PIPE_DIF_R2

      2,     27
      4,     73
      8,    118
     16,    167
     32,    223
     64,    291
    128,    383
    256,    523
    512,    759
   1024,   1187
   2048,   1999
   4096,   3579
   8192,   6695
  16384,  12883
  32768,  25215
  65536,  49835

PIPE_DIF_R4


      4,     37
      8,     85
     16,     96
     32,    155
     64,    171
    128,    275
    256,    309
    512,    593
   1024,    699
   2048,   1703
   4096,   2097
   8192,   5981
  16384,   7527
  32768,  22931
  65536,  29085

PIPE_DIF_R8



      8,     67
     16,    120
     32,    131
     64,    165
    128,    272
    256,    296
    512,    354
   1024,    902
   2048,   1024
   4096,   1278
   8192,   5354
  16384,   6260
  32768,   8082
  65536,  40382

PIPE_DIT_R2

      2,     52
      4,     99
      8,    145
     16,    196
     32,    250
     64,    318
    128,    410
    256,    550
    512,    786
   1024,   1214
   2048,   2026
   4096,   3606
   8192,   6721
  16384,  12908
  32768,  25239
  65536,  49837

PIPE_DIT_R4


      4,     62
      8,    111
     16,    122
     32,    179
     64,    201
    128,    295
    256,    339
    512,    589
   1024,    729
   2048,   1603
   4096,   2127
   8192,   5496
  16384,   7556
  32768,  20888
  65536,  29092

PIPE_DIT_R8



      8,     92
     16,    147
     32,    159
     64,    190
    128,    289
    256,    327
    512,    391
   1024,    885
   2048,   1027
   4096,   1441
   8192,   5001
  16384,   6039
  32768,   9253
  65536,  37319

Seamless System Integration

Designed for maximum architectural flexibility, the FFT IP can be deployed as a standalone, inline processor within a high-speed DSP datapath, or as a memory-mapped hardware accelerator. For complete SoC offload solutions, the core pairs perfectly with my custom high-throughput DMA Engine (available in the portfolio  HERE delivering a ready-to-use, system-level acceleration subsystem.

Deep Domain Expertise: From Algorithm to Silicon

This core is the culmination of extensive DSP research and development, originating from my work at the Czech Technical University in Prague. My expertise with the Cooley-Tukey algorithm spans the entire computational stack: starting from initial MATLAB mathematical models and massively parallel NVIDIA CUDA applications, progressing through optimized C++ implementations, and finally culminating in this uncompromising, resource-optimized ASIC/FPGA RTL architecture. When you license this IP, you are not just getting HDL code, you are partnering with a full-stack FFT expert.

Ready for Evaluation

I guarantee a robust, thoroughly verified, and feature-rich IP core engineered for peak performance and gapless throughput. Whether you require highly detailed PPA (Power, Performance, Area) reports, exact latency calculations for your specific target device, or an encrypted RTL model for your simulation environment, please do not hesitate to contact me.

Contact Form