Inst. Technische Informatik und Mikroelektronik

132 Items

Recent Submissions
Spectral turning bands for efficient Gaussian random fields generation on GPUs and accelerators

Hunger, Lars ; Cosenza, Biagio ; Kimeswenger, Stefan ; Fahringer, Thomas (2015)

A random field (RF) is a set of correlated random variables associated with different spatial locations. RF generation algorithms are of crucial importance for many scientific areas, such as astrophysics, geostatistics, computer graphics, and many others. Current approaches commonly make use of 3D fast Fourier transform (FFT), which does not scale well for RF bigger than the available memory; t...

An evaluation of current SIMD programming models for C++

Pohl, Angela ; Cosenza, Biagio ; Álvarez-Mesa, Mauricio ; Chi, Chi Ching ; Juurlink, Ben (2016)

SIMD extensions were added to microprocessors in the mid '90s to speed-up data-parallel code by vectorization. Unfortunately, the SIMD programming model has barely evolved and the most efficient utilization is still obtained with elaborate intrinsics coding. As a consequence, several approaches to write efficient and portable SIMD code have been proposed. In this work, we evaluate current progr...

A Quantitative Analysis of the Memory Architecture of FPGA-SoCs

Göbel, Matthias ; Elhossini, Ahmed ; Chi, Chi Ching ; Álvarez-Mesa, Mauricio ; Juurlink, Ben (2017)

In recent years, so called FPGA-SoCs have been introduced by Intel (formerly Altera) and Xilinx. These devices combine multi-core processors with programmable logic. This paper analyzes the various memory and communication interconnects found in actual devices, particularly the Zynq-7020 and Zynq-7045 from Xilinx and the Cyclone V SE SoC from Intel. Issues such as different access patterns, cac...

The LPGPU2 Project: Low-Power Parallel Computing on GPUs

Juurlink, Ben ; Lucas, Jan ; Mammeri, Nadjib ; Bliss, Martyn ; Keramidas, Georgios ; Kokkala, Chrysa ; Richards, Andrew (2017)

The LPGPU2 project is a 30-month-project (Innovation Action) funded by the European Union. Its overall goal is to develop an analysis and visualization framework that enables GPU application developers to improve the performance and power consumption of their applications. To achieve this overall goal, several key objectives need to be achieved. First, several applications (use cases) need to b...

GPU Parallelization of HEVC In-Loop Filters

Wang, Biao ; de Souza, Diego F. ; Álvarez-Mesa, Mauricio ; Chi, Chi Ching ; Juurlink, Ben ; Ilic, Aleksandar ; Roma, Nuno ; Sousa, Leonel (2017-01)

In the High Efficiency Video Coding (HEVC) standard, multiple decoding modules have been designed to take advantage of parallel processing. In particular, the HEVC in-loop filters (i.e., the deblocking filter and sample adaptive offset) were conceived to be exploited by parallel architectures. However, the type of the offered parallelism mostly suits the capabilities of multi-core CPUs, thus ma...

Efficient HEVC decoder for heterogeneous CPU with GPU systems

Wang, Biao ; Álvarez-Mesa, Mauricio ; Chi, Chi Ching ; Juurlink, Ben ; de Souza, Diego F. ; Ilic, Aleksandar ; Roma, Nuno ; Sousa, Leonel (2016)

The High Efficiency Video Coding (HEVC) standard provides higher compression efficiency than other video coding standards but at the cost of increased computational load, which makes it hard to achieve real-time encoding/decoding of high-resolution, high-quality video sequences. In this paper, we investigate how Graphics Processing Units (GPUs) can be employed to accelerate HEVC decoding. GPUs ...

Syntax Element Partitioning for high-throughput HEVC CABAC decoding

Habermann, Philipp ; Chi, Chi Ching ; Álvarez-Mesa, Mauricio ; Juurlink, Ben (2017)

Encoder and decoder implementations of the High Efficiency Video Coding (HEVC) standard have been subject to many optimization approaches since the release in 2013. However, the real-time decoding of high quality and ultra high resolution videos is still a very challenging task. Especially entropy decoding (CABAC) is most often the throughput bottleneck for very high bitrates. Syntax Element Pa...

Static optimization in PHP 7

Popov, Nikita ; Cosenza, Biagio ; Juurlink, Ben ; Stogov, Dmitry (2017)

PHP is a dynamically typed programming language commonly used for the server-side implementation of web applications. Approachability and ease of deployment have made PHP one of the most widely used scripting languages for the web, powering important web applications such as WordPress, Wikipedia, and Facebook. PHP's highly dynamic nature, while providing useful language features, also makes it ...

A Methodology for Predicting Application-Specific Achievable Memory Bandwidth for HW/SW-Codesign

Göbel, Matthias ; Elhossini, Ahmed ; Juurlink, Ben (2017)

The trend of using heterogeneous computing and HW/SW-Codesign approaches allows increasing performance significantly while reducing power consumption. One of the main challenges when combining multiple processing devices is the communication, as an inefficient communication configuration can pose a bottleneck to the overall system performance. To address this problem, we present a methodology t...

E²MC: Entropy Encoding Based Memory Compression for GPUs

Lal, Sohan ; Lucas, Jan ; Juurlink, Ben (2017)

Modern Graphics Processing Units (GPUs) provide much higher off-chip memory bandwidth than CPUs, but many GPU applications are still limited by memory bandwidth. Unfortunately, off-chip memory bandwidth is growing slower than the number of cores and has become a performance bottleneck. Thus, optimizations of effective memory bandwidth play a significant role for scaling the performance of GPUs....

ALUPower: Data Dependent Power Consumption in GPUs

Lucas, Jan ; Juurlink, Ben (2016)

Existing architectural power models for GPUs count activities such as executing floating point or integer instructions, but do not consider the data values processed. While data value dependent power consumption can often be neglected when performing architectural simulations of high performance Out-of-Order (OoO) CPUs, we show that this approach is invalid for estimating the power consumption ...

Autotuning Stencil Computations with Structural Ordinal Regression Learning

Cosenza, Biagio ; Durillo, Juan J. ; Ermon, Stefano ; Juurlink, Ben (2017)

Stencil computations expose a large and complex space of equivalent implementations. These computations often rely on autotuning techniques, based on iterative compilation or machine learning (ML), to achieve high performance. Iterative compilation autotuning is a challenging and time-consuming task that may be unaffordable in many scenarios. Meanwhile, traditional ML autotuning approaches expl...

Local memory-aware kernel perforation

Maier, Daniel ; Cosenza, Biagio ; Juurlink, Ben (2018)

Many applications provide inherent resilience to some amount of error and can potentially trade accuracy for performance by using approximate computing. Applications running on GPUs often use local memory to minimize the number of global memory accesses and to speed up execution. Local memory can also be very useful to improve the way approximate computation is performed, e.g., by improving the...

Automatic Data Layout Optimizations for GPUs

Kofler, Klaus ; Cosenza, Biagio ; Fahringer, Thomas (2015)

Memory optimizations have became increasingly important in order to fully exploit the computational power of modern GPUs. The data arrangement has a big impact on the performance, and it is very hard for GPU programmers to identify a well-suited data layout. Classical data layout transformations include grouping together data fields that have similar access patterns, or transforming Array-of-St...

Behavioral Spherical Harmonics for Long-Range Agents’ Interaction

Cosenza, Biagio (2015)

We introduce behavioral spherical harmonic (BSH), a novel approach to efficiently and compactly represent the directional-dependent behavior of agent. BSH is based on spherical harmonics to project the directional information of a group of multiple agents to a vector of few coefficients; thus, BSH drastically reduces the complexity of the directional evaluation, as it requires only few agent-gr...

The SARC Architecture

Ramirez, Alex ; Cabarcas, Felipe ; Juurlink, Ben ; Álvarez-Mesa, Mauricio ; Sanchez, Friman ; Azevedo, Arnaldo ; Meenderinck, Cor ; Ciobanu, Cătălin ; Isaza, Sebastian ; Gaydadjiev, Georgi (2010)

The SARC architecture is composed of multiple processor types and a set of user-managed direct memory access (DMA) engines that let the runtime scheduler overlap data transfer and computation. The runtime system automatically allocates tasks on the heterogeneous cores and schedules the data transfers through the DMA engines. SARC's programming model supports various highly parallel applications...

Learning robotic perception through prior knowledge

Jonschkowski, Rico (2018)

Intelligent robots must be able to learn; they must be able to adapt their behavior based on experience. But generalization from past experience is only possible based on assumptions or prior knowledge (priors for short) about how the world works. I study the role of these priors for learning perception. Although priors play a central role in machine learning, they are often hidden in the deta...

Optimal DC/AC Data Bus Inversion Coding

Lucas, Jan ; Lal, Sohan ; Juurlink, Ben (2018)

GDDR5 and DDR4 memories use data bus inversion (DBI) coding to reduce termination power and decrease the number of output transitions. Two main strategies exist for encoding data using DBI: DBI DC minimizes the number of outputs transmitting a zero, while DBI AC minimizes the number of signal transitions. We show that neither of these strategies is optimal and reduction of interface power of up...

No Free Lunch in Ball Catching: A Comparison of Cartesian and Angular Representations for Control

Höfer, Sebastian ; Raisch, Jörg ; Toussaint, Marc ; Brock, Oliver (2018)

How to run most effectively to catch a projectile, such as a baseball, that is flying in the air for a long period of time? The question about the best solution to the ball catching problem has been subject to intense scientific debate for almost 50 years. It turns out that this scientific debate is not focused on the ball catching problem alone, but revolves around the research question what c...

On latency in GPU throughput microarchitectures

Andersch, Michael ; Lucas, Jan ; Álvarez-Mesa, Mauricio ; Juurlink, Ben (2015)

Modern GPUs provide massive processing power (arithmetic throughput) as well as memory throughput. Presently, while it appears to be well understood how performance can be improved by increasing throughput, it is less clear what the effects of micro-architectural latencies are on the performance of throughput-oriented GPU architectures. In fact, little is publicly known about the values, behavi...