DOI: 10.3724/SP.J.1089.2010.11147

Journal of Computer-Aided Design & Computer Graphics (计算机辅助设计与图形学学报) 2010/22:9 PP.1435-1442

High Speed FIR Digital Filtering on GPU

This paper proposes a massively parallel FIR filter algorithm with its GPU implementation to improve the efficiency and scalability of GPU based FIR. By the algorithm proposed, the problem is formulated as a matrix multiplication operation that offers sufficient data level parallelism for parallel filtering on modern GPUs. In addition, the GPU implementation guarantees that each thread could complete a two-signal operation within a single instruction cycle. Efficient and effective strategies for load balancing and memory mapping are also introduced to further improve the performance. The proposed algorithm and the corresponding GPU implementation could achieve an efficient multi-channel signal filtering. Experimental results on a GTX260+ graphics card prove that the FIR filter algorithm could be used to deliver on average a speed-up of 203X and an efficiency increase over 40%.

Key words:finite impulse response (FIR),digital filter,parallel computing,CUDA,GPU

ReleaseDate:2014-07-21 15:25:55

Funds:NVidia Professor Partnership Award; CUDA Center of Excellence.

[1] Sophocles J O. Introduction to signal processing (photocopy version)[M]. Beijing: Tsinghua University Press, 1998: 125-567

[2] Park J, Muhammad K, Roy K. High-performance FIR filter design based on sharing multiplication[J]. IEEE Transactions on Very Large Scale Integration(VLSI) Systems, 2003, 11(2): 244-253

[3] Voronenko Y, Püschel M. Multiplierless multiple constant multiplication[J]. ACM Transactions on Algorithms, 2007, 3(2): 11-20

[4] Mahesh R, Vinod A P. A new common subexpression elimination algorithm for realizing low-complexity higher order digital filters[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2008, 27(2): 217-229

[5] Li Ying, Lu Weijun, Yu Dunshan,et al. A resource optimizing algorithm in FPGA based high speed FIR digital filters[J]. Aeta Seientiarum Naturalium Universitatis Pekinensis, 2009, 45(2): 222-226 (in Chinese) (李 莹, 路卫军, 于敦山, 等. 一种在FPGA上实现FIR数字滤波器的资源优化算法[J]. 北京大学学报: 自然科学版, 2009, 45(2): 222-226)

[6] Mou Z J, Duhamel P. Fast FIR filtering: algorithms and implementations[J]. Signal Processing, 1987, 13(4): 377-384

[7] Cheng C, Parhi K K. Hardware efficient fast parallel FIR filter structures based on iterated short convolution[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2004, 51(8): 1492-1500

[8] Conway R. Efficient residue arithmetic based parallel fixed coefficient FIR filters[C]// Proceedings of IEEE International Symposium on Circuits and Systems, Seattle: The Printing House, 2008: 1484-1487

[9] Cheng C, Parhi K K. Low- cost parallel FIR filter structures with 2-stage parallelism[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2007, 54(2): 280-290

[10] Wu Enhua, Liu Youquan. General purpose computing on GPU[J]. Journal of Computer-Aided Design & Computer Graphics, 2004, 16(5): 601-612 (in Chinese) (吴恩华, 柳有权. 基于图形处理器(GPU)的通用计算[J]. 计算机辅助设计与图形学学报, 2004, 16(5): 601-612)

[11] Smirnov A, Chiueh T C. An Implementation of a FIR filter on a GPU[R]. New York: State University of New York. Experimental Computer Systems Lab, 2005

[12] Koon L. CUDA real FIR crossover on GPU[OL]. [2009-11-03].

[13] NVIDIA CUDA C programming best practices guide[M]. Santa Clara: NVIDIA Corporation, 2009

[14] Volkv V, Demmel J W. Benchmarking GPUs to tune dense linear algebra[C]// Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, Austin: IEEE Press, 2008: 1-11

[15] Grama A, Gupta A, Karypis G,et al. Introduction to parallel computing[M]. 2nd ed. Upper Saddle River: Addison Wesley, 2003: 143-272

[16] Gongáleg P, Cabaleiro J C, Pena T F. On parallel solvers for sparse triangular systems[J]. Journal of Systems Architecture, 2000, 46(8): 675-685

[17] Kung S Y, Hu Y H. A highly concurrent algorithm and pipelined architecture for solving Toeplitz systems[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing 1983, ASSP-31(1): 66-76

[18] Sundaram N, Raghunatnan A, Chakradhar S T. A framework for efficient and scalable execution of domain-specific templates on GPUs[C]// Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium. New York: IEEE Press, 2009: 1-12

[19] Cormen T H, Leiserson C E, Rivest R L,et al. Introduction to algorithms[M]. 2nd ed. Boston: The MIT Press, 2001: 323-331

[20] Bakhoda A, Yuan G L, Fung W W L,et al. Analyzing CUDA workloads using a detailed GPU simulator[C]// Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software. New York: IEEE Press, 2009: 163-174