FG Architektur eingebetteter Systeme

83 Items

Recent Submissions
Real-Time Vision System for License Plate Detection and Recognition on FPGA

Rosli, Faird ; Elhossini, Ahmed ; Juurlink, Ben (2015)

Rapid development of the Field Programmable Gate Array (FPGA) offers an alternative way to provide acceleration for computationally intensive tasks such as digital signal and image processing. Its ability to perform parallel processing shows the potential in implementing a high speed vision system. Out of numerous applications of computer vision, this paper focuses on the hardware implementatio...

High performance CCSDS image data compression using GPGPUs for space applications

Ramanarayanan, Sunil Chokkanathapuram ; Manthey, Kristian ; Juurlink, Ben (2015)

The usage of graphics processing units (GPUs) as computing architectures for inherently data parallel signal processing applications in this computing era is very popular. In principle, GPUs in comparison with central processing units (CPUs) could achieve significant speed-up over the latter, especially considering data parallel applications which expect high throughput. The paper investigates ...

Proximity Scheme for Instruction Caches in Tiled CMP Architectures

Alawneh, Tareq ; Chi, Chi Ching ; Elhossini, Ahmed ; Juurlink, Ben (2015)

Recent research results show that there is a high degree of code sharing between cores in multi-core architectures. In this paper we propose a proximity scheme for the instruction caches, a scheme in which the shared code blocks among the neighbouring L2 caches in tiled multi-core architectures are exploited to reduce the average cache miss penalty and the on-chip network traffic. We evaluate t...

A Benchmark Suite for Evaluating Parallel Programming Models

Andersch, Michael ; Juurlink, Ben ; Chi, Chi Ching (2011)

The transition to multi-core processors enforces software developers to explicitly exploit thread-level parallelism to increase performance. The associated programmability problem has led to the introduction of a plethora of parallel programming models that aim at simplifying software development by raising the abstraction level. Since industry has not settled for a single model, however, multi...

Spectral turning bands for efficient Gaussian random fields generation on GPUs and accelerators

Hunger, Lars ; Cosenza, Biagio ; Kimeswenger, Stefan ; Fahringer, Thomas (2015)

A random field (RF) is a set of correlated random variables associated with different spatial locations. RF generation algorithms are of crucial importance for many scientific areas, such as astrophysics, geostatistics, computer graphics, and many others. Current approaches commonly make use of 3D fast Fourier transform (FFT), which does not scale well for RF bigger than the available memory; t...

An evaluation of current SIMD programming models for C++

Pohl, Angela ; Cosenza, Biagio ; Álvarez-Mesa, Mauricio ; Chi, Chi Ching ; Juurlink, Ben (2016)

SIMD extensions were added to microprocessors in the mid '90s to speed-up data-parallel code by vectorization. Unfortunately, the SIMD programming model has barely evolved and the most efficient utilization is still obtained with elaborate intrinsics coding. As a consequence, several approaches to write efficient and portable SIMD code have been proposed. In this work, we evaluate current progr...

A Quantitative Analysis of the Memory Architecture of FPGA-SoCs

Göbel, Matthias ; Elhossini, Ahmed ; Chi, Chi Ching ; Álvarez-Mesa, Mauricio ; Juurlink, Ben (2017)

In recent years, so called FPGA-SoCs have been introduced by Intel (formerly Altera) and Xilinx. These devices combine multi-core processors with programmable logic. This paper analyzes the various memory and communication interconnects found in actual devices, particularly the Zynq-7020 and Zynq-7045 from Xilinx and the Cyclone V SE SoC from Intel. Issues such as different access patterns, cac...

The LPGPU2 Project: Low-Power Parallel Computing on GPUs

Juurlink, Ben ; Lucas, Jan ; Mammeri, Nadjib ; Bliss, Martyn ; Keramidas, Georgios ; Kokkala, Chrysa ; Richards, Andrew (2017)

The LPGPU2 project is a 30-month-project (Innovation Action) funded by the European Union. Its overall goal is to develop an analysis and visualization framework that enables GPU application developers to improve the performance and power consumption of their applications. To achieve this overall goal, several key objectives need to be achieved. First, several applications (use cases) need to b...

GPU Parallelization of HEVC In-Loop Filters

Wang, Biao ; de Souza, Diego F. ; Álvarez-Mesa, Mauricio ; Chi, Chi Ching ; Juurlink, Ben ; Ilic, Aleksandar ; Roma, Nuno ; Sousa, Leonel (2017-01)

In the High Efficiency Video Coding (HEVC) standard, multiple decoding modules have been designed to take advantage of parallel processing. In particular, the HEVC in-loop filters (i.e., the deblocking filter and sample adaptive offset) were conceived to be exploited by parallel architectures. However, the type of the offered parallelism mostly suits the capabilities of multi-core CPUs, thus ma...

Efficient HEVC decoder for heterogeneous CPU with GPU systems

Wang, Biao ; Álvarez-Mesa, Mauricio ; Chi, Chi Ching ; Juurlink, Ben ; de Souza, Diego F. ; Ilic, Aleksandar ; Roma, Nuno ; Sousa, Leonel (2016)

The High Efficiency Video Coding (HEVC) standard provides higher compression efficiency than other video coding standards but at the cost of increased computational load, which makes it hard to achieve real-time encoding/decoding of high-resolution, high-quality video sequences. In this paper, we investigate how Graphics Processing Units (GPUs) can be employed to accelerate HEVC decoding. GPUs ...

Syntax Element Partitioning for high-throughput HEVC CABAC decoding

Habermann, Philipp ; Chi, Chi Ching ; Álvarez-Mesa, Mauricio ; Juurlink, Ben (2017)

Encoder and decoder implementations of the High Efficiency Video Coding (HEVC) standard have been subject to many optimization approaches since the release in 2013. However, the real-time decoding of high quality and ultra high resolution videos is still a very challenging task. Especially entropy decoding (CABAC) is most often the throughput bottleneck for very high bitrates. Syntax Element Pa...

Static optimization in PHP 7

Popov, Nikita ; Cosenza, Biagio ; Juurlink, Ben ; Stogov, Dmitry (2017)

PHP is a dynamically typed programming language commonly used for the server-side implementation of web applications. Approachability and ease of deployment have made PHP one of the most widely used scripting languages for the web, powering important web applications such as WordPress, Wikipedia, and Facebook. PHP's highly dynamic nature, while providing useful language features, also makes it ...

A Methodology for Predicting Application-Specific Achievable Memory Bandwidth for HW/SW-Codesign

Göbel, Matthias ; Elhossini, Ahmed ; Juurlink, Ben (2017)

The trend of using heterogeneous computing and HW/SW-Codesign approaches allows increasing performance significantly while reducing power consumption. One of the main challenges when combining multiple processing devices is the communication, as an inefficient communication configuration can pose a bottleneck to the overall system performance. To address this problem, we present a methodology t...

E²MC: Entropy Encoding Based Memory Compression for GPUs

Lal, Sohan ; Lucas, Jan ; Juurlink, Ben (2017)

Modern Graphics Processing Units (GPUs) provide much higher off-chip memory bandwidth than CPUs, but many GPU applications are still limited by memory bandwidth. Unfortunately, off-chip memory bandwidth is growing slower than the number of cores and has become a performance bottleneck. Thus, optimizations of effective memory bandwidth play a significant role for scaling the performance of GPUs....

ALUPower: Data Dependent Power Consumption in GPUs

Lucas, Jan ; Juurlink, Ben (2016)

Existing architectural power models for GPUs count activities such as executing floating point or integer instructions, but do not consider the data values processed. While data value dependent power consumption can often be neglected when performing architectural simulations of high performance Out-of-Order (OoO) CPUs, we show that this approach is invalid for estimating the power consumption ...

Autotuning Stencil Computations with Structural Ordinal Regression Learning

Cosenza, Biagio ; Durillo, Juan J. ; Ermon, Stefano ; Juurlink, Ben (2017)

Stencil computations expose a large and complex space of equivalent implementations. These computations often rely on autotuning techniques, based on iterative compilation or machine learning (ML), to achieve high performance. Iterative compilation autotuning is a challenging and time-consuming task that may be unaffordable in many scenarios. Meanwhile, traditional ML autotuning approaches expl...

Local memory-aware kernel perforation

Maier, Daniel ; Cosenza, Biagio ; Juurlink, Ben (2018)

Many applications provide inherent resilience to some amount of error and can potentially trade accuracy for performance by using approximate computing. Applications running on GPUs often use local memory to minimize the number of global memory accesses and to speed up execution. Local memory can also be very useful to improve the way approximate computation is performed, e.g., by improving the...

Automatic Data Layout Optimizations for GPUs

Kofler, Klaus ; Cosenza, Biagio ; Fahringer, Thomas (2015)

Memory optimizations have became increasingly important in order to fully exploit the computational power of modern GPUs. The data arrangement has a big impact on the performance, and it is very hard for GPU programmers to identify a well-suited data layout. Classical data layout transformations include grouping together data fields that have similar access patterns, or transforming Array-of-St...

Behavioral Spherical Harmonics for Long-Range Agents’ Interaction

Cosenza, Biagio (2015)

We introduce behavioral spherical harmonic (BSH), a novel approach to efficiently and compactly represent the directional-dependent behavior of agent. BSH is based on spherical harmonics to project the directional information of a group of multiple agents to a vector of few coefficients; thus, BSH drastically reduces the complexity of the directional evaluation, as it requires only few agent-gr...

The SARC Architecture

Ramirez, Alex ; Cabarcas, Felipe ; Juurlink, Ben ; Álvarez-Mesa, Mauricio ; Sanchez, Friman ; Azevedo, Arnaldo ; Meenderinck, Cor ; Ciobanu, Cătălin ; Isaza, Sebastian ; Gaydadjiev, Georgi (2010)

The SARC architecture is composed of multiple processor types and a set of user-managed direct memory access (DMA) engines that let the runtime scheduler overlap data transfer and computation. The runtime system automatically allocates tasks on the heterogeneous cores and schedules the data transfers through the DMA engines. SARC's programming model supports various highly parallel applications...