FFT IP

by | Sep 7, 2024

Update:

A comprehensive architectural overhaul is currently in the final stages of validation. This update introduces a heavily optimized pipeline archtecture designed to minimize resource utilization without compromising throughput or maximum operating frequency (F_max). While the internal architecture has been re-engineered for efficiency, the design maintains strict adherence to code quality and verification standards. The core logic is now baselined, with a full release scheduled for the near future.

Features

  • Transform Size: Up to 64K-point
  • Processing Architectures: Base-Radix (Burst I/O), RSDF (Pipelined SDF), and SSR (Super Sample Rate)
  • Algorithm Topologies: Decimation-in-Time (DIT) & Decimation-in-Frequency (DIF)
  • Datapath: Native Fixed-Point with seamless Single-Precision Floating-Point support
  • Flexibility: Dynamic runtime reconfiguration (Forward/Inverse transform, Point size)
  • Interfaces: Industry-standard AXI4-Stream (Data with robust backpressure) & AXI4-Lite (Control)
  • Design Focus: Exceptional PPA efficiency (A lean, “muscle and bone” optimized RTL)
  • Codebase: 100% Vendor-Independent, pure VHDL-2008 synthesizable source code

The IP is undergoing its final stages during development and testing. 

0%

Core Overview

The FFT IP Core is a high-performance, highly configurable implementation of the Discrete Fourier Transform (DFT) based on the Cooley-Tukey algorithm, engineered to deliver exceptional Power, Performance, and Area (PPA) results. Supporting transform sizes up to 65,536 points, the core natively handles fixed-point arithmetic while providing a seamless integration path for Single-Precision Floating-Point operations.

It fully supports both forward and inverse transforms, custom scaling schedule on a per-stage basis alongside with Decimation-in-Frequency (DIF) and Decimation-in-Time (DIT) implemntations. To ensure robust system integration, the IP features dynamic runtime reconfiguration via basic RTL-reconfiguration interface or through a standardized AXI4-Lite register interface with optional Clock Domain Crossing (CDC). The core is backed by comprehensive real-time status monitoring for arithmetic’s overflows and underflows and invalid packet detection while having also the optional availability to debug any dynamic-range issues during processing for fixed-point arithmetics.

Rigorously verified in industry-standard EDA simulator across all major configurations, this core provides the ultimate answer to modern DSP challenges. Engineered with uncompromising digital design best practices and exceptional code quality, the IP is perfectly positioned to challenge generic, vendor-provided solutions.

Architecture Support

  • Base-Radix (Iterative / Burst I/O): Maximizes resource efficiency by looping data through a single, powerful hardware processing stage. It is ideal for applications where low footprint (BRAM/DSP) is prioritized over continuous streaming.

  • RSDF (Radix Single Delay Feedback): A fully unrolled, pipelined architecture utilizing delay lines to provide continuous, 1-sample-per-clock-cycle throughput with an extremely lean memory footprint.

  • Pipelined SSR (Super Sample Rate): The ultimate performance top-tier. This fully pipelined and highly parallelized architecture processes multiple samples per clock cycle (e.g., up to 8 samples per cycle for a Radix-8 configuration), enabling processing of multi-gigabit data streams.

Throughput, Latency and Radix Variants

The core supports Radix-2, Radix-4, and Radix-8 butterfly engines, including mixed-radix capabilities. Higher base-radix configurations dynamically reduce the number of necessary processing stages which significantly cuts down overall latency and increases processing speed. Furthermore, the IP employs robust AXI-Stream backpressure handling and strictly guarantees 100% gapless continuous throughput for the following boundary conditions:

  • Pipeline SDF (Radix-2): For N >= 8

  • Pipeline SSR (Radix-2): For N >= 64

  • Pipeline SSR (Radix-4): For N >= 128

  • Pipeline SSR (Radix-8): For N >= 256

Documentation (Under development)

Seamless System Integration

Designed for maximum architectural flexibility, the FFT IP can be deployed as a standalone, inline processor within a high-speed DSP datapath, or as a memory-mapped hardware accelerator. For complete SoC offload solutions, the core pairs perfectly with my custom high-throughput DMA Engine (available in the portfolio [ HERE ]), delivering a ready-to-use, system-level acceleration subsystem, as demonstrated in the application diagram below.

Deep Domain Expertise: From Algorithm to Silicon

This core is the culmination of extensive DSP research and development, originating from my work at the Czech Technical University in Prague. My expertise with the Cooley-Tukey algorithm spans the entire computational stack: starting from initial MATLAB mathematical models and massively parallel NVIDIA CUDA applications, progressing through optimized C++ implementations, and finally culminating in this uncompromising, resource-optimized ASIC/FPGA RTL architecture. When you license this IP, you are not just getting HDL code—you are partnering with a full-stack FFT expert.

Ready for Evaluation

I guarantee a robust, rigorously verified, and feature-rich IP core engineered for peak performance and gapless throughput. Whether you require highly detailed PPA (Power, Performance, Area) reports, exact latency calculations for your specific target device, or an encrypted RTL model for your simulation environment, please do not hesitate to contact me.

Vojtech Ters

IrisCores.com

Contact Form