FG Architektur eingebetteter Systeme

92 Items

Recent Submissions
Enabling GPU software developers to optimize their applications – The LPGPU2approach

Juurlink, Ben ; Lucas, Jan ; Mammeri, Nadjib ; Keramidas, Georgios ; Pontzolkova, Katerina ; Aransay, Ignacio ; Kokkala, Chrysa ; Bliss, Martyn ; Richards, Andrew (2017)

Low-power GPUs have become ubiquitous, they can be found in domains ranging from wearable and mobile computing to automotive systems. With this ubiquity has come a wider range of applications exploiting low-power GPUs, placing ever increasing demands on the expected performance and power efficiency of the devices. The LPGPU 2 project is an EU-funded, Innovation Action, 30-month-project targetin...

Highly parallel HEVC decoding for heterogeneous systems with CPU and GPU

Wang, Biao ; de Souza, Diego F. ; Álvarez-Mesa, Mauricio ; Chi, Chi Ching ; Juurlink, Ben ; Ilic, Aleksandar ; Roma, Nuno ; Sousa, Leonel (2017)

The High Efficiency Video Coding HEVC standard provides a higher compression efficiency than other video coding standards but at the cost of an increased computational load, which makes hard to achieve real-time encoding/decoding for ultra high-resolution and high-quality video sequences. Graphics Processing Units GPU are known to provide massive processing capability for highly parallel and re...

Highly Parallel HEVC Decoding for Heterogeneous Systems with CPU and GPU - Research Data

Wang, Biao ; Felix de Souza, Diego ; Alvarez-Mesa, Mauricio ; Chi, Chi Ching ; Juurlink, Ben ; Ilic, Aleksandar ; Nuno Roma, Nuno ; Sousa, Leonel (2017)

The High Efficiency Video Coding (HEVC) standard provides a higher compression efficiency than other video coding standards but at the cost of an increased computational load, which makes hard to achieve real-time encoding/decoding for ultra high-resolution and high-quality video sequences. Graphics Processing Units (GPUs) are known to provide massive processing capability for highly parallel a...

Optimal DC/AC Data Bus Inversion Coding

Lucas, Jan ; Lal, Sohan ; Juurlink, Ben (2018-10-29)

GDDR5 and DDR4 memories use data bus inversion (DBI) coding to reduce termination power and decrease the number of output transitions. Two main strategies exist for encoding data using DBI: DBI DC minimizes the number of outputs transmitting a zero, while DBI AC minimizes the number of signal transitions. We show that neither of these strategies is optimal and reduction of interface power of up...

VComputeBench: A Vulkan Benchmark Suite for GPGPU on Mobile and Embedded GPUs

Mammeri, Nadjib (2018-09-30)

GPUs have become immensely important computational units on embedded and mobile devices. However, GPGPU developers are often not able to exploit the compute power offered by GPUs on these devices mainly due to the lack of support of traditional programming models such as CUDA and OpenCL. The recent introduction of the Vulkan API provides a new programming model that could be explored for GPGPU ...

VComputeBench: A Vulkan Benchmark Suite for GPGPU on Mobile and Embedded GPUs

Mammeri, Nadjib ; Juurlink, Ben (2018)

GPUs have become immensely important computational units on embedded and mobile devices. However, GPGPU developers are often not able to exploit the compute power offered by GPUs on these devices mainly due to the lack of support of traditional programming models such as CUDA and OpenCL. The recent introduction of the Vulkan API provides a new programming model that could be explored for GPGPU ...

Application-Specific Cache and Prefetching for HEVC CABAC Decoding

Habermann, Philipp ; Chi, Chi Ching ; Álvarez-Mesa, Mauricio ; Juurlink, Ben (2017)

Context-based Adaptive Binary Arithmetic Coding (CABAC) is the entropy coding module in the HEVC/H.265 video coding standard. As in its predecessor, H.264/AVC, CABAC is a well-known throughput bottleneck due to its strong data dependencies. Besides other optimizations, the replacement of the context model memory by a smaller cache has been proposed for hardware decoders, resulting in an improve...

E²MC: Entropy Encoding Based Memory Compression for GPUs

Lal, Sohan ; Lucas, Jan ; Juurlink, Ben (2017)

Modern Graphics Processing Units (GPUs) provide much higher off-chip memory bandwidth than CPUs, but many GPU applications are still limited by memory bandwidth. Unfortunately, off-chip memory bandwidth is growing slower than the number of cores and has become a performance bottleneck. Thus, optimizations of effective memory bandwidth play a significant role for scaling the performance of GPUs....

ALUPower: Data Dependent Power Consumption in GPUs - Research Data

Lucas, Jan ; Juurlink, Ben (2016)

Existing architectural power models for GPUs count activities such as executing floating point or integer instructions, but do not consider the data values processed. While data value dependent power consumption can often be neglected when performing architectural simulations of high performance Out-of-Order (OoO) CPUs, in our related paper we show that this approach is invalid for estimating t...

Real-Time Vision System for License Plate Detection and Recognition on FPGA

Rosli, Faird ; Elhossini, Ahmed ; Juurlink, Ben (2015)

Rapid development of the Field Programmable Gate Array (FPGA) offers an alternative way to provide acceleration for computationally intensive tasks such as digital signal and image processing. Its ability to perform parallel processing shows the potential in implementing a high speed vision system. Out of numerous applications of computer vision, this paper focuses on the hardware implementatio...

High performance CCSDS image data compression using GPGPUs for space applications

Ramanarayanan, Sunil Chokkanathapuram ; Manthey, Kristian ; Juurlink, Ben (2015)

The usage of graphics processing units (GPUs) as computing architectures for inherently data parallel signal processing applications in this computing era is very popular. In principle, GPUs in comparison with central processing units (CPUs) could achieve significant speed-up over the latter, especially considering data parallel applications which expect high throughput. The paper investigates ...

Proximity Scheme for Instruction Caches in Tiled CMP Architectures

Alawneh, Tareq ; Chi, Chi Ching ; Elhossini, Ahmed ; Juurlink, Ben (2015)

Recent research results show that there is a high degree of code sharing between cores in multi-core architectures. In this paper we propose a proximity scheme for the instruction caches, a scheme in which the shared code blocks among the neighbouring L2 caches in tiled multi-core architectures are exploited to reduce the average cache miss penalty and the on-chip network traffic. We evaluate t...

A Benchmark Suite for Evaluating Parallel Programming Models

Andersch, Michael ; Juurlink, Ben ; Chi, Chi Ching (2011)

The transition to multi-core processors enforces software developers to explicitly exploit thread-level parallelism to increase performance. The associated programmability problem has led to the introduction of a plethora of parallel programming models that aim at simplifying software development by raising the abstraction level. Since industry has not settled for a single model, however, multi...

Spectral turning bands for efficient Gaussian random fields generation on GPUs and accelerators

Hunger, Lars ; Cosenza, Biagio ; Kimeswenger, Stefan ; Fahringer, Thomas (2015)

A random field (RF) is a set of correlated random variables associated with different spatial locations. RF generation algorithms are of crucial importance for many scientific areas, such as astrophysics, geostatistics, computer graphics, and many others. Current approaches commonly make use of 3D fast Fourier transform (FFT), which does not scale well for RF bigger than the available memory; t...

An evaluation of current SIMD programming models for C++

Pohl, Angela ; Cosenza, Biagio ; Álvarez-Mesa, Mauricio ; Chi, Chi Ching ; Juurlink, Ben (2016)

SIMD extensions were added to microprocessors in the mid '90s to speed-up data-parallel code by vectorization. Unfortunately, the SIMD programming model has barely evolved and the most efficient utilization is still obtained with elaborate intrinsics coding. As a consequence, several approaches to write efficient and portable SIMD code have been proposed. In this work, we evaluate current progr...

A Quantitative Analysis of the Memory Architecture of FPGA-SoCs

Göbel, Matthias ; Elhossini, Ahmed ; Chi, Chi Ching ; Álvarez-Mesa, Mauricio ; Juurlink, Ben (2017)

In recent years, so called FPGA-SoCs have been introduced by Intel (formerly Altera) and Xilinx. These devices combine multi-core processors with programmable logic. This paper analyzes the various memory and communication interconnects found in actual devices, particularly the Zynq-7020 and Zynq-7045 from Xilinx and the Cyclone V SE SoC from Intel. Issues such as different access patterns, cac...

The LPGPU2 Project: Low-Power Parallel Computing on GPUs

Juurlink, Ben ; Lucas, Jan ; Mammeri, Nadjib ; Bliss, Martyn ; Keramidas, Georgios ; Kokkala, Chrysa ; Richards, Andrew (2017)

The LPGPU2 project is a 30-month-project (Innovation Action) funded by the European Union. Its overall goal is to develop an analysis and visualization framework that enables GPU application developers to improve the performance and power consumption of their applications. To achieve this overall goal, several key objectives need to be achieved. First, several applications (use cases) need to b...

GPU Parallelization of HEVC In-Loop Filters

Wang, Biao ; de Souza, Diego F. ; Álvarez-Mesa, Mauricio ; Chi, Chi Ching ; Juurlink, Ben ; Ilic, Aleksandar ; Roma, Nuno ; Sousa, Leonel (2017-01)

In the High Efficiency Video Coding (HEVC) standard, multiple decoding modules have been designed to take advantage of parallel processing. In particular, the HEVC in-loop filters (i.e., the deblocking filter and sample adaptive offset) were conceived to be exploited by parallel architectures. However, the type of the offered parallelism mostly suits the capabilities of multi-core CPUs, thus ma...

Efficient HEVC decoder for heterogeneous CPU with GPU systems

Wang, Biao ; Álvarez-Mesa, Mauricio ; Chi, Chi Ching ; Juurlink, Ben ; de Souza, Diego F. ; Ilic, Aleksandar ; Roma, Nuno ; Sousa, Leonel (2016)

The High Efficiency Video Coding (HEVC) standard provides higher compression efficiency than other video coding standards but at the cost of increased computational load, which makes it hard to achieve real-time encoding/decoding of high-resolution, high-quality video sequences. In this paper, we investigate how Graphics Processing Units (GPUs) can be employed to accelerate HEVC decoding. GPUs ...

Syntax Element Partitioning for high-throughput HEVC CABAC decoding

Habermann, Philipp ; Chi, Chi Ching ; Álvarez-Mesa, Mauricio ; Juurlink, Ben (2017)

Encoder and decoder implementations of the High Efficiency Video Coding (HEVC) standard have been subject to many optimization approaches since the release in 2013. However, the real-time decoding of high quality and ultra high resolution videos is still a very challenging task. Especially entropy decoding (CABAC) is most often the throughput bottleneck for very high bitrates. Syntax Element Pa...