FFT IP

by IrisCores | Sep 7, 2024

Overview

The FFT IP Implements by default the Cooley-Tukey Decimation in Frequency (DIF) FFT Algorithm, an efficient and performance-optimized implementation of the DFT (Discrete Fourier Transform) with sizes of up to 65536 samples. The core support both fixed point and floating point (Single Precision) implementation, floating point math however requires the usage of external IPs. Both forward and Inverse transforms are supported. The core can be also reconfigured during runtime and provides various status signals – including invalid FFT packet detection and Math overflow problems. The IP supports both Base-Radix and Pipelined architectures for maximum performance. Base-radix allows for computing of 1 FFT butterfly (Either Radix 2 / Radix 4 or Radix 8) per clock cycle. Pipelined architecture allows for computing with the same speed in all FFT stages in parallel (IE Fully Pipelined at a cost of extra resources). Mixed radix tranforms are also supported. The core has standardized AXI4-Lite status & control interface with Optional CDC to facilitate asynchronous control and processing clock domains.

Base radix IP configuration (R2 / R4 or R8) implements a single FFT stage. In order to process the required transform, multiple stage loops are required. Amount of FFT stages is equal to ceil(log(Radix,FFT size)), IE. for a 4096-point transform and base radix of 4, 6 stages needs to be processed. Amount of stages required to compute the transform directly relates to the the core’s latency.Thats why higher Base-R transforms reduce latency and increase the overall performance of the core. The difference between base-R and pipeline architecture is that within the pipeline architecture, multiple transforms are processed in parallel – although Base-R and pipelined architecture will still have the same latency (Considering the base radix is equal). The core also supports both Decimation in Frequency (DIF) and Decimation in Time (DIT). There are however no benefits of using DIT over DIF and thus the results will be equal in both cases except for some rounding errors due to different order of math applied.

Features

FFT Sizes from 2 to 65536
Dynamic reconfiguration
AXI-Stream & AXI4-Lite Compliant
Base – 2 / Base – 4 / Base – 8 / Pipelined*** Architectures
Mixed Radix (2 / 4 / 8)
Fixed point and Floating point
Variable data and twiddle factor widths
Natural output order by default
Both forward and inverse transforms supported
Full RTL-based VHDL2008 Source code without 3rd party IPs and/or vendor dependencies

*** The PIPELINE architecture is currently under rework and its usage is not recommended until further notice due to excessive memory utilization. As a part of this rework, the FFT IP CORE is further undergoing additional enhancements.

NOTE: The IP’s resource estimates and FMAX timing performance were evaluated with Vivado 2024.2 tool targetting xcku5p-ffvb676-2-e device (KCU116 – Kintex Ultrascale+) or xcu50-fsvh2104-2-e (Alveo U50 – Virtex Ultrascale+). All evaluation was done with the default tool configuration for Synthesis / Optimization / Place & Route. Due to the excessive amount of possible IP configuration, the assumed “worst-case” was processed only with 32-bit data with and 24-bit twiddle width. Fixed Point DIF architecture. Mixed radix implementations would require additional DSPs. The available optional timing optimizations within the core were also enabled in all cases.

Resource Estimates - R2

FFT_SIZE	FFs [K]	LUTs [K]	BRAM [36Kbit]	URAM [288Kbit]	DSP	~FMAX [MHz]
16	4.9	2.9	0	0	8	400
32	5.0	3.0	0	0	8	400
64	5.1	3.0	1	0	8	400
128	5.2	3.2	1	0	8	400
256	5	3.2	3	0	8	400
512	5.1	3.2	3	0	8	400
1024	5.3	3.3	4	0	8	400
2048	5.3	3.4	7	0	8	400
4096	5.5	3.4	6	2	8	400
8192	5.5	3.5	12	2	8	400
16384	5.7	3.6	56	0	8	400
32768	5.7	3.7	112	0	8	400
65536	6.5	3.8	224	0	8	400

Resource Estimates - R4

FFT_SIZE	FFs [K]	LUTs [K]	BRAM [36Kbit]	URAM [288Kbit]	DSP	~FMAX [MHz]
16	10.0	7.7	0	0	24	400
64	10.5	8	3	0	24	400
256	10.8	8.5	3	0	24	400
1024	10.8	8.7	8	0	24	400
4096	11.3	9	20	0	25	400
16384	11.7	9.3	48	4	25	400
65536	12.8	10.1	320	0	25	400

Resource Estimates - R8

FFT_SIZE	FFs [K]	LUTs [K]	BRAM [36Kbit]	URAM [288Kbit]	DSP	~FMAX [MHz]
64	27.5	35.6	7	0	56	320
512	30.0	43.0	7	0	56	320
4096	30.1	47.0	32	0	60	320
32768	33.3	53.3	192	0	60	320

Resource Estimates - Pipelined

WARNING: Eventhough the pipeline FFT architecture is fully functional and capable to meet timing requirements, its resource usage is currrently NOT optimized. It is recommended to not use the solution at all as the memory requirements grow rapidly. This option is currently under rework.

FFT_SIZE	FFs [K]	LUTs [K]	BRAM [36Kbit]	URAM [288Kbit]	DSP	~FMAX [MHz]
16	18.1	9.5	0	0	32	350
32	22.2	12.0	0	0	40	350
64	25.8	14.0	6	0	48	350
128	29.6	16.8	7	0	56	350
256	30.8	17.6	30	0	64	350
512	36.3	20.7	33	0	72	350
1024	40.7	23.3	46	0	80	350
2048	45.0	26.0	89	0	88	350
4096	49.9	28.8	72	30	96	350
8192	54.8	31.5	156	32	104	350
16384	60.3	35.6	880	0	112	350

Latest documentation for the FFT IP Core



FAQ

How is the floating point math supported?
When the IP is configured to implement floating point math, it expects a specific 3rd party components to be available during the synthesis. There should be 3 components for floating point addition, subtraction and multiplication. The exact interface description is shown in the documentation,which is currently available on request.
What are the benefits or using higher order radixes?
For Base-R implemntation, choosing a higher order radix always leads to a performance boost at a cost of extra resource usage. It is therefore possible to balance the need for performance and resource usage.
How does the base radix choice affect the pipelined version?
For pipelined architecture, the default radix should be always set to 2. Even though higher-order radixes are supported, there are no extra benefits of using those. The reason is that the core is able to receive/transmit only one sample at a time. The default Base-2 pipelined architecture is capable to handle the processing speed.
Is the AXI4-lite interface required to be used?
The core can be set to process a pre-defined transform during instantiation,so generally AXI4-lite interface is not required. It adds however certain features that might help with the usage of the IP such as math overflow status or invalid paacket detection.
Are there any benefits of using the DIT over DIF?
No there are no benefits, it is recommended to keep the default DIF algorithm in place. DIT is however supported due to upcomming releases of other IPs that will utilize the DIT algorithm.
How are the timing and resource estimates measured?
They have been measured in Vivado 2024.1 on Windows after design route with a clock frequency of 500MHz. If the timing failed (Most of the cases), then the negative slack was substracted to obtain the max frequency. IE, for 500MHz, the period is 2ns, for a negative slack of 0.5ns, the FMAX frequency was calculated as (2.0 + 0.5 –> 400MHz). The design implementation used all default values.

Please contact me for more details on documentation and additional features. The core can be used as stand-alone IP inside a DSP processing chain or as an extra offload accelerator in which case, it could be for example used together with a DMA component. The DMA is also in the portfolio HERE. The image above is taken from an FFT demonstration application,where the core works in the “Accelerator” version along with DMA as shown below:

Initialially I have started to develop the FFT algorithm at the Czech Technical University in Prague, CZ, where I have implemented the algorithm in Matlab and the nVidia CUDA technology. Later on, I have reworked the version to C++ (Not officially released though) and now, I can proudly present even a version intended for ASIC/FPGA platforms. If you are looking for an FFT expert, look no further ☄️!

I do always guarantee a feature-rich and bug-free functionality with the highest performance. For more information such as latency/performance estimates and/or resource usage, please dont hesitate to contant me.

Vojtech Ters

IrisCores.com