
Core Features
- Transform Size: Up to 64K-point
- Processing Architectures: Base-Radix (Burst I/O), RSDF (Pipelined SDF), and SSR (Super Sample Rate)
- Algorithm Topologies: Decimation-in-Time (DIT) & Decimation-in-Frequency (DIF)
- Zero-Overhead Convolution: Completely eliminate digit-reversal memory overhead in FFT/IFFT systems
- Datapath: Native Fixed-Point with seamless Single-Precision Floating-Point support
- Flexibility: Dynamic runtime reconfiguration (Forward/Inverse transform, Point size)
- Interfaces: Industry-standard AXI4-Stream (Data with robust backpressure) & AXI4-Lite (Control)
- Design Focus: Exceptional PPA efficiency (A lean, "muscle and bone" optimized RTL)
- Codebase: 100% Vendor-Independent, pure VHDL-2008 synthesizable source code
- Optional Cyclic Prefix Support for All Architectures
- 3-way or 4-way Complex Multiplier
Core Overview
The FFT IP Core is a high-performance, highly configurable implementation of the Discrete Fourier Transform (DFT) based on the Cooley-Tukey algorithm, engineered to deliver exceptional Power, Performance, and Area (PPA) results. Supporting transform sizes up to 65,536 points, the core natively handles fixed-point arithmetic while providing a seamless integration path for Single-Precision Floating-Point operations.
It fully supports both forward and inverse transforms, custom scaling schedule on a per-stage basis alongside with Decimation-in-Frequency (DIF) and Decimation-in-Time (DIT) implemntations. To ensure robust system integration, the IP features dynamic runtime reconfiguration via basic RTL-reconfiguration interface or through a standardized AXI4-Lite register interface with optional Clock Domain Crossing (CDC). The core is backed by comprehensive real-time status monitoring for arithmetic's overflows and underflows and invalid packet detection while having also the optional availability to debug any dynamic-range issues during processing for fixed-point arithmetics.
Rigorously verified in industry-standard EDA simulator across all major configurations, this core provides the ultimate answer to modern DSP challenges. Engineered with uncompromising digital design best practices and exceptional code quality, the IP is perfectly positioned to challenge generic, vendor-provided solutions.
Architecture Support
-
Base-Radix (Iterative / Burst I/O): Maximizes resource efficiency by looping data through a single, powerful hardware processing stage. It is ideal for applications where low footprint (BRAM/DSP) is prioritized over continuous streaming.
-
RSDF (Radix Single Delay Feedback): A fully unrolled, pipelined architecture utilizing delay lines to provide continuous, 1-sample-per-clock-cycle throughput with an extremely lean memory footprint.
-
Pipelined SSR (Super Sample Rate): The ultimate performance top-tier. This fully pipelined and highly parallelized architecture processes multiple samples per clock cycle (e.g., up to 8 samples per cycle for a Radix-8 configuration), enabling processing of multi-gigabit data streams.
Advanced System Optimization
For DSP systems requiring heavy frequency-domain filtering or OFDM modulation, the IP uniquely enables a highly optimized "DIT/DIF Sandwich" topology. By configuring the forward FFT with Decimation-in-Frequency (DIF) and the inverse IFFT with Decimation-in-Time (DIT), the core completely bypasses the traditional digit-reversal permutation step. This architectural synergy allows natural-order time-domain data to seamlessly pass through the frequency domain and back, saving massive amounts of internal RAM and logic resources and drastically reducing end-to-end system latency while improving FMAX at the same time.
Throughput, Latency and Radix Variants
The core supports Radix-2, Radix-4, and Radix-8 butterfly engines, including mixed-radix capabilities. Higher base-radix configurations dynamically reduce the number of necessary processing stages which significantly cuts down overall latency and increases processing speed. Furthermore, the IP employs robust AXI-Stream backpressure handling and strictly guarantees 100% gapless continuous throughput for the following boundary conditions:
-
Pipeline SDF (Radix-2): For N >= 8
-
Pipeline SSR (Radix-2): For N >= 64
-
Pipeline SSR (Radix-4): For N >= 128
-
Pipeline SSR (Radix-8): For N >= 256
Foundation & BASE Architecture
-
Initial release with DIF-based Cooley-Tukey Interative FFT processing engine.
-
Implementation of industry-standard AXI4-Stream data handling and baseline resource configurations.
Algorithmic Expansion & Floating-Point
-
Introduction of Decimation-in-Time (DIT) support, laying the groundwork for Zero-Overhead Convolution topologies.
-
Integration of external IEEE-754 Single-Precision Floating-Point libraries alongside fixed-point arithmetic.
Integration & Reliability
-
Focus on system-level robustness: Introduced Clock Domain Crossing (CDC) for the AXI4-Lite control interface.
-
Added advanced packet padding support, diagnostic counters, and timing/DSP optimizations
-
Baseline for Piipelined SSR Architecture
SDF Architecture Support
-
A massive architectural improvement with the introduction of the continuous Radix-2 Single-Path Delay Feedback (RSDF) topology.
-
Achieved drastic footprint reductions in RAM, ROM, and DSP utilization for mixed-radix transforms.
-
Introduced a proprietary, hardware-accelerated diagnostic bitmask engine for streamlined fixed-point dynamic range evaluation.
SSR Architecture Support
-
Finalization of the ultra-broadband PIPE (SSR) architecture for multi-gigabit parallel data streams.
-
Integration of high-end telecom features including dynamic Cyclic Prefix (CP) insertion and native RTL-based runtime reconfiguration.
-
Introduced selectable 3-way or 4-way Complex Multiplier architectures for extreme DSP slice fine-tuning
IP PORT DEFINITIONS
entity FFT_TOP_BB is
generic(
----------------------------------------------------------------------------------------------------------------
-------- Core Architecture & Capabilities --------
----------------------------------------------------------------------------------------------------------------
ARCH_TYPE : string := "BASE"; -- Architecture Type
IMPL_TYPE : string := "DIF"; -- Implementation Type
BASE_RADIX : natural := 2; -- Default Base Radix
MIN_SIZE : natural := 2; -- Minimum Supported Size
MAX_SIZE : natural := 2048; -- Maximum Supported Size
CYCLIC_PREFIX_EN : natural := 0; -- Includes Cyclic Prefix
----------------------------------------------------------------------------------------------------------------
-------- DSP Arithmetic & Datapath Precision --------
----------------------------------------------------------------------------------------------------------------
DSP_DATA_WIDTH : natural := 16; -- Data Width (Real and Imaginary)
DSP_TWIDDLE_WIDTH : natural := 16; -- Twiddle Data Width (Real And Imaginary)
DSP_PRECISION : string := "FIXED"; -- Fixed / Floating Point
DSP_FI_SCALE_WIDTH : natural := 2; -- Fixed Point Scaler Configuration Width
DSP_FI_ROUND_MODE : Rounding_t := ROUND_HALF_UP; -- Fixed Point Rounding mode
DSP_FI_MULT_ARCH : string := "CMUL_X4"; -- Fixed Point Multiplier Optimization X3 / X4
DSP_FI_DR_MASK_EN : natural := 0; -- Generate Fixed-Point Data Mask Evaluation
DSP_FP_LAT_MUL : natural := 8; -- Latency of Floating Point Multiplier
DSP_FP_LAT_ADDSUB : natural := 10; -- Latency of Floating Point Adder/Subtracter
----------------------------------------------------------------------------------------------------------------
-------- AXI4-Stream Interfaces --------
----------------------------------------------------------------------------------------------------------------
AXIS_SAMPLES_S : natural := 1; -- S - AXIS Parallel Samples
AXIS_SAMPLES_M : natural := 1; -- M - AXIS Parallel Samples
----------------------------------------------------------------------------------------------------------------
-------- AXI4-Lite Control & Configuration Interface --------
----------------------------------------------------------------------------------------------------------------
AXIL_INIT_MODE : sl := FFT_CNTRL_AXI; -- Initial Control Mode
AXIL_ADDR_WIDTH : natural := 32; -- AXI4-Lite Addressing Width
AXIL_DATA_WIDTH : natural := 32; -- AXI4-Lite Data Width
AXIL_CDC_EN : natural := 0; -- Enable AXIL to CDC
AXIL_CDC_STAGES : natural := 3; -- Amount of CDC Stages for AXIL Crossing
----------------------------------------------------------------------------------------------------------------
-------- Hardware Synthesis Attributes --------
----------------------------------------------------------------------------------------------------------------
ATTR_RAM_TYPE : string := "auto"; -- Synthesis directive for RAM Implementation Type
ATTR_ROM_TYPE : string := "auto"; -- Synthesis directive for ROM Implementation Type
ATTR_USE_DSP : string := "yes" -- Synthesis directive for DSP Slice Inference
);
port (
ACLK : in sl; -- AXI4-Lite Clock
FACLK : in sl; -- AXIS / FFT Core Clock
ARESETn : in sl; -- AXI4-Lite Reset
FRESETn : in sl; -- AXIS / FFT Core Reset
-------------------------------------------------
------- AXI4-Lite Configuration Interface -------
S_AXIL_ARADDR : in slv ( AXIL_ADDR_WIDTH - 1 downto 0 );
S_AXIL_ARREADY : out sl;
S_AXIL_ARVALID : in sl;
S_AXIL_AWADDR : in slv ( AXIL_ADDR_WIDTH - 1 downto 0 );
S_AXIL_AWREADY : out sl;
S_AXIL_AWVALID : in sl;
S_AXIL_BREADY : in sl;
S_AXIL_BRESP : out slv ( 1 downto 0 );
S_AXIL_BVALID : out sl;
S_AXIL_RDATA : out slv ( AXIL_DATA_WIDTH - 1 downto 0 );
S_AXIL_RRESP : out slv ( 1 downto 0 );
S_AXIL_RVALID : out sl;
S_AXIL_RREADY : in sl;
S_AXIL_WDATA : in slv ( AXIL_DATA_WIDTH - 1 downto 0 );
S_AXIL_WSTRB : in slv ( AXIL_DATA_WIDTH / 8 - 1 downto 0 );
S_AXIL_WVALID : in sl;
S_AXIL_WREADY : out sl;
----------------------------------------------------
------- Status and Reconfiguration Interface -------
CFG_VALID : in sl;
CFG_READY : out sl;
CFG_SIZE : in slv(FFT_LOG_ADDR_WIDTH - 1 downto 0);
CFG_SCALES : in slv(127 downto 0);
CFG_PREFIX : in slv(31 downto 0);
CFG_DIRECTION : in sl;
CFG_PADDING_EN : in sl;
CFG_SATURATION_EN : in sl;
ST_OVF_REAL : out sl;
ST_OVF_IMAG : out sl;
ST_TLAST_MISSING : out sl;
ST_TLAST_UNEXPECTED : out sl;
--------------------------------------
------- Data Receive Interface -------
S_AXI_TLast : in sl;
S_AXI_TReady : out sl;
S_AXI_TVld : in sl;
S_AXI_TStrb : in slv(AXIS_SAMPLES_S - 1 downto 0);
S_AXI_TData_Real : in Slv_1D_Array_t(0 to AXIS_SAMPLES_S - 1)(DSP_DATA_WIDTH - 1 downto 0);
S_AXI_TData_Imag : in Slv_1D_Array_t(0 to AXIS_SAMPLES_S - 1)(DSP_DATA_WIDTH - 1 downto 0);
--------------------------------------
------ Data Transmit Interface -------
M_AXI_TReady : in sl;
M_AXI_TVld : out sl;
M_AXI_TLast : out sl;
M_AXI_TStrb : out slv(AXIS_SAMPLES_M - 1 downto 0);
M_AXI_TData_Real : out Slv_1D_Array_t(0 to AXIS_SAMPLES_M - 1)(DSP_DATA_WIDTH - 1 downto 0);
M_AXI_TData_Imag : out Slv_1D_Array_t(0 to AXIS_SAMPLES_M - 1)(DSP_DATA_WIDTH - 1 downto 0)
);
end FFT_TOP_BB;RESOURCES & PERFORMANCE
The Resource and Performance estimates provided below represent a typical IP baseline. All metrics were obtained via post-place&route reports with the primary processing clock (FACLK) constrained to 500 MHz, utilizing the respective EDA tool's default optimization strategies. The datapath precision (DSP_DATA_WIDTH and DSP_TWIDDLE_WIDTH) is uniformly configured to 16 bits. The targeted silicon represents the current industry-standard high-end tiers: AMD UltraScale+™, Intel® Agilex™, and Efinix® Titanium series. If you need a specific utilization report with a different set of IP configuration parameters, please don't hesitate to contact me via the contact form at the bottom of this page.
SUMMARY
RAW RESULTS
LATENCY SPECIFICATIONS
The latency metrics provided below represent typical cycle counts obtained via cycle-accurate RTL simulations for Single-Precision Floating-Point arithmetic, with the internal pipeline latency configured to 10 clock cycles for both Multipliers and Adders. It is expected that real-world use cases utilizing native Fixed-Point arithmetic will yield even better latency results. The Latency is measured from TLAST consumption on S_AXIS up to its transmission on M_AXIS. The benchmarked latencies exclude any potential stalls on the M_AXIS bus. If the downstream slave deasserts S_AXIS_TREADY, the core will stall, freezing the internal pipelines and extending the total cycle count.
BASE_DIF_R2
2, 56
4, 162
8, 158
16, 228
32, 334
64, 520
128, 882
256, 1628
512, 3206
1024, 6576
2048, 13786
4096, 29188
8192, 61998
16384, 131672
32768, 279170
65536, 590508
BASE_DIF_R4
4, 68
8, 120
16, 142
32, 224
64, 284
128, 502
256, 738
512, 1548
1024, 2584
2048, 5954
4096, 10574
8192, 24952
16384, 45444
32768, 106926
65536, 197050
BASE_DIF_R8
8, 102
16, 162
32, 192
64, 262
128, 448
256, 620
512, 974
1024, 2232
2048, 3652
4096, 6502
8192, 16784
16384, 29084
32768, 53694
65536, 139752
BASE_DIT_R2
2, 56
4, 162
8, 158
16, 228
32, 334
64, 520
128, 882
256, 1628
512, 3206
1024, 6576
2048, 13786
4096, 29188
8192, 61998
16384, 131672
32768, 279170
65536, 590508
BASE_DIT_R4
4, 68
8, 120
16, 142
32, 224
64, 284
128, 502
256, 738
512, 1548
1024, 2584
2048, 5954
4096, 10574
8192, 24952
16384, 45444
32768, 106926
65536, 197050
BASE_DIT_R8
8, 102
16, 162
32, 192
64, 262
128, 448
256, 620
512, 974
1024, 2232
2048, 3652
4096, 6502
8192, 16784
16384, 29084
32768, 53694
65536, 139752
RSDF_DIF
2, 61
4, 108
8, 158
16, 216
32, 290
64, 400
128, 574
256, 876
512, 1434
1024, 2504
2048, 4598
4096, 8740
8192, 16978
16384, 33408
32768, 66222
65536, 131804
RSDF_DIT
2, 61
4, 109
8, 159
16, 216
32, 290
64, 400
128, 574
256, 876
512, 1434
1024, 2504
2048, 4598
4096, 8740
8192, 16978
16384, 33408
32768, 66222
65536, 131804
PIPE_DIF_R2
2, 29
4, 82
8, 135
16, 194
32, 269
64, 365
128, 501
256, 719
512, 1097
1024, 1795
2048, 3133
4096, 5751
8192, 10929
16384, 21227
32768, 41765
65536, 82783
PIPE_DIF_R4
4, 39
8, 95
16, 107
32, 176
64, 200
128, 334
256, 402
512, 762
1024, 1000
2048, 2276
4096, 3188
8192, 8112
16384, 11712
32768, 31228
65536, 45580
PIPE_DIF_R8
8, 69
16, 132
32, 145
64, 180
128, 305
256, 357
512, 449
1024, 1078
2048, 1333
4096, 1852
8192, 6459
16384, 8393
32768, 12271
65536, 48689
PIPE_DIT_R2
2, 55
4, 110
8, 163
16, 224
32, 298
64, 394
128, 530
256, 748
512, 1126
1024, 1824
2048, 3162
4096, 5780
8192, 10957
16384, 21254
32768, 41791
65536, 82787
PIPE_DIT_R4
4, 65
8, 126
16, 134
32, 208
64, 232
128, 362
256, 434
512, 766
1024, 1032
2048, 2184
4096, 3220
8192, 7635
16384, 11743
32768, 29193
65536, 45589
PIPE_DIT_R8
8, 95
16, 167
32, 178
64, 206
128, 342
256, 396
512, 488
1024, 1081
2048, 1344
4096, 2017
8192, 6126
16384, 8180
32768, 13444
65536, 45646
DigitReversal Bypass
RSDF_DIF
2, 59
4, 103
8, 147
16, 194
32, 250
64, 322
128, 426
256, 594
512, 890
1024, 1442
2048, 2506
4096, 4594
8192, 8730
16384, 16962
32768, 33386
65536, 66194
RSDF_DIT
2, 59
4, 103
8, 147
16, 194
32, 250
64, 322
128, 426
256, 594
512, 890
1024, 1442
2048, 2506
4096, 4594
8192, 8730
16384, 16962
32768, 33386
65536, 66194
PIPE_DIF_R2
2, 27
4, 73
8, 118
16, 167
32, 223
64, 291
128, 383
256, 523
512, 759
1024, 1187
2048, 1999
4096, 3579
8192, 6695
16384, 12883
32768, 25215
65536, 49835
PIPE_DIF_R4
4, 37
8, 85
16, 96
32, 155
64, 171
128, 275
256, 309
512, 593
1024, 699
2048, 1703
4096, 2097
8192, 5981
16384, 7527
32768, 22931
65536, 29085
PIPE_DIF_R8
8, 67
16, 120
32, 131
64, 165
128, 272
256, 296
512, 354
1024, 902
2048, 1024
4096, 1278
8192, 5354
16384, 6260
32768, 8082
65536, 40382
PIPE_DIT_R2
2, 52
4, 99
8, 145
16, 196
32, 250
64, 318
128, 410
256, 550
512, 786
1024, 1214
2048, 2026
4096, 3606
8192, 6721
16384, 12908
32768, 25239
65536, 49837
PIPE_DIT_R4
4, 62
8, 111
16, 122
32, 179
64, 201
128, 295
256, 339
512, 589
1024, 729
2048, 1603
4096, 2127
8192, 5496
16384, 7556
32768, 20888
65536, 29092
PIPE_DIT_R8
8, 92
16, 147
32, 159
64, 190
128, 289
256, 327
512, 391
1024, 885
2048, 1027
4096, 1441
8192, 5001
16384, 6039
32768, 9253
65536, 37319
Seamless System Integration
Designed for maximum architectural flexibility, the FFT IP can be deployed as a standalone, inline processor within a high-speed DSP datapath, or as a memory-mapped hardware accelerator. For complete SoC offload solutions, the core pairs perfectly with my custom high-throughput DMA Engine (available in the portfolio HERE delivering a ready-to-use, system-level acceleration subsystem.
Deep Domain Expertise: From Algorithm to Silicon
This core is the culmination of extensive DSP research and development, originating from my work at the Czech Technical University in Prague. My expertise with the Cooley-Tukey algorithm spans the entire computational stack: starting from initial MATLAB mathematical models and massively parallel NVIDIA CUDA applications, progressing through optimized C++ implementations, and finally culminating in this uncompromising, resource-optimized ASIC/FPGA RTL architecture. When you license this IP, you are not just getting HDL code, you are partnering with a full-stack FFT expert.
Ready for Evaluation
I guarantee a robust, thoroughly verified, and feature-rich IP core engineered for peak performance and gapless throughput. Whether you require highly detailed PPA (Power, Performance, Area) reports, exact latency calculations for your specific target device, or an encrypted RTL model for your simulation environment, please do not hesitate to contact me.