Thumbnail Image

A QHD-capable parallel H.264 decoder

Chi, Chi Ching; Juurlink, Ben

Video coding follows the trend of demanding higher performance every new generation, and therefore could utilize many-cores. A complete parallelization of H.264, which is the most advanced video coding standard, was found to be difficult due to the complexity of the standard. In this paper a parallel implementation of a complete H.264 decoder is presented. Our parallelization strategy exploits function-level as well as data-level parallelism. Function-level parallelism is used to pipeline the H.264 decoding stages. Data-level parallelism is exploited within the two most time consuming stages, the entropy decoding stage and the macroblock decoding stage. The parallelization strategy has been implemented and optimized on three platforms with very different memory architectures, namely an 8-core SMP, a 64-core cc-NUMA, and an 18-core Cell platform. Evaluations have been performed using 4kx2k QHD sequences. On the SMP platform a maximum speedup of 4.5x is achieved. The SMP-implementation is reasonably performance portable as it achieves a speedup of 26.6x on the cc-NUMA system. However, to obtain the highest performance (speedup of 33.4x and throughput of 200 QHD frames per second), several cc-NUMA specific optimizations are necessary such as optimizing the page placement and statically assigning threads to cores. Finally, on the Cell platform a near ideal speedup of 16.5x is achieved by completely hiding the communication latency.
Published in: Proceedings of the International Conference on Supercomputing, 10.1145/1995896.1995945, Association for Computing Machinery (ACM)