Fachgebiet Architektur eingebetteter Systeme

46 Items

Recent Submissions
A reconfigurable architecture for real-time image compression on-board satellites

Manthey, Kristian (2017)

Data products of optical remote sensing systems are increasingly used in many areas of our everyday life. The spatial as well as the spectral resolution of satellite image data increases steadily with new missions resulting in a higher precision of known procedures and new application scenarios. While the memory capacity requirements can still be fulfilled, the transmission capacity becomes inc...

libWater: heterogeneous distributed computing made easy

Grasso, Ivan ; Pellegrini, Simone ; Cosenza, Biagio ; Fahringer, Thomas (2013)

Clusters of heterogeneous nodes composed of multi-core CPUs and GPUs are increasingly being used for High Performance Computing (HPC) due to the benefits in peak performance and energy efficiency. In order to fully harvest the computational capabilities of such architectures, application developers often employ a combination of different parallel programming paradigms (e.g. OpenCL, CUDA, MPI an...

Low-power high-efficiency video decoding using general purpose processors

Chi, Chi Ching ; Álvarez-Mesa, Mauricio ; Juurlink, Ben (2015)

In this article, we investigate how code optimization techniques and low-power states of general-purpose processors improve the power efficiency of HEVC decoding. The power and performance efficiency of the use of SIMD instructions, multicore architectures, and low-power active and idle states are analyzed in detail for offline video decoding. In addition, the power efficiency of techniques suc...

A generic implementation of a quantified predictor on FPGAs

Thomas, Gervin ; Elhossini, Ahmed ; Juurlink, Ben (2014)

Predictors are used in many fields of computer architectures to enhance performance. With good estimations of future system behaviour, policies can be developed to improve system performance or reduce power consumption. These policies become more effective if the predictors are implemented in hardware and can provide quantified forecasts and not only binary ones. In this paper, we present and e...

An automatic input-sensitive approach for heterogeneous task partitioning

Kofler, Kofler ; Grasso, Ivan ; Cosenza, Biagio ; Fahringer, Thomas (2013)

Unleashing the full potential of heterogeneous systems, consisting of multi-core CPUs and GPUs, is a challenging task due to the difference in processing capabilities, memory availability, and communication latencies of different computational resources. In this paper we propose a novel approach that automatically optimizes task partitioning for different (input) problem sizes and different h...

A QHD-capable parallel H.264 decoder

Chi, Chi Ching ; Juurlink, Ben (2011)

Video coding follows the trend of demanding higher performance every new generation, and therefore could utilize many-cores. A complete parallelization of H.264, which is the most advanced video coding standard, was found to be difficult due to the complexity of the standard. In this paper a parallel implementation of a complete H.264 decoder is presented. Our parallelization strategy exploits ...

Composable local memory organisation for streaming applications on embedded MPSoCs

Ambrose, Jude ; Molnos, Anca ; Nelson, Andrew ; Cotofana, Sorin ; Goossens, Kees ; Juurlink, Ben (2011)

Multi-Processor Systems on a Chip (MPSoCs) are suitable platforms for the implementation of complex embedded applications. An MPSoC is composable if the functional and temporal behaviour of each application is independent of the absence or presence of other applications. Composability is required for application design and analysis in isolation, and integration with linear effort. In this paper...

Poster: implications of merging phases on scalability of multi-core architectures

Manivannan, Madhavan ; Juurlink, Ben ; Stenstrom, Per (2011)

Amdahl's Law estimates parallel applications with negligible serial sections to potentially scale to many cores. However, due to merging phases in data mining applications, the serial sections do not remain constant. We extend Amdahl's model to accommodate this and establish that Amdahl's Law can overestimate the scalability offered by symmetric and asymmetric architectures for such application...

Automatic problem size sensitive task partitioning on heterogeneous parallel systems

Grasso, Ivan ; Kofler, Klaus ; Cosenza, Biagio ; Fahringer, Thomas (2013)

In this paper we propose a novel approach which automatizes task partitioning in heterogeneous systems. Our framework is based on the Insieme Compiler and Runtime infrastructure. The compiler translates a single-device OpenCL program into a multi-device OpenCL program. The runtime system then performs dynamic task partitioning based on an offline-generated prediction model. In order to derive t...

Programming parallel embedded and consumer applications in OpenMP superscalar

Andersch, Michael ; Chi, Chi Ching ; Juurlink, Ben (2012)

In this paper, we evaluate the performance and usability of the parallel programming model OpenMP Superscalar (OmpSs), apply it to 10 different benchmarks and compare its performance with corresponding POSIX threads implementations.

Amdahl's law for predicting the future of multicores considered harmful

Juurlink, Ben ; Meenderinck, Cor (2012)

Several recent works predict the future of multicore systems or identify scalability bottlenecks based on Amdahl's law. Amdahl's law implicitly assumes, however, that the problem size stays constant, but in most cases more cores are used to solve larger and more complex problems. There is a related law known as Gustafson's law which assumes that runtime, not the problem size, is constant. In ot...

Evaluation of parallel H.264 decoding strategies for the Cell Broadband Engine

Chi, Chi Ching ; Juurlink, Ben ; Meenderinck, Cor (2010)

How to develop efficient and scalable parallel applications is the key challenge for emerging many-core architectures. We investigate this question by implementing and comparing two parallel H.264 decoders on the Cell architecture. It is expected that future many-cores will use a Cell-like local store memory hierarchy, rather than a non-scalable shared memory. The two implemented parallel algor...

Spatio-temporal SIMT and scalarization for improving GPU efficiency

Lucas, Jan ; Andersch, Michael ; Álvarez-Mesa, Mauricio ; Juurlink, Ben (2015)

Temporal SIMT (TSIMT) has been suggested as an alternative to conventional (spatial) SIMT for improving GPU performance on branch-intensive code. Although TSIMT has been briefly mentioned before, it was not evaluated. We present a complete design and evaluation of TSIMT GPUs, along with the inclusion of scalarization and a combination of temporal and spatial SIMT, named Spatiotemporal SIMT (STS...

GPGPU workload characteristics and performance analysis

Lal, Sohan ; Lucas, Jan ; Andersch, Michael ; Álvarez-Mesa, Mauricio ; Elhossini, Ahmed ; Juurlink, Ben (2014)

GPUs are much more power-efficient devices compared to CPUs, but due to several performance bottlenecks, the performance per watt of GPUs is often much lower than what could be achieved theoretically. To sustain and continue high performance computing growth, new architectural and application techniques are required to create power-efficient computing systems. To find such techniques, however, ...

Nexus#: a distributed hardware task manager for task-based programming models

Dallou, Tamer ; Engelhardt, Nina ; Elhossini, Ahmed ; Juurlink, Ben (2015)

In the era of multicore systems, it is expected that the number of cores that can be integrated on a single chip will be 3-digit. The key to utilize such a huge computational power is to extract the very fine parallelism in the user program. This is non-trivial for the average programmer, and becomes very hard as the number of potential parallel instances increases. Task-based programming model...

Low-complexity VBR controller for spatial-CGS and temporal scalable video coding

Sanz-Rodríguez, Sergio ; Díaz-de-María, Fernando ; Rezaei, Mehdi (2009)

This paper presents a rate control (RC) algorithm for the scalable extension of the H.264/AVC video coding standard. The proposed rate controller is designed for real-time video streaming with buffer constraint. Since a large buffer delay and bit rate variation are allowed in this kind of applications, our proposal reduces the quantization parameter (QP) fluctuation to provide consistent visual...

Improving the parallelization efficiency of HEVC decoding

Chi, Chi Ching ; Álvarez-Mesa, Mauricio ; Juurlink, Ben ; George, Valeri ; Schierl, Thomas (2012)

In this paper we present a new parallelization approach for HEVC decoding called Overlapped Wavefront (OWF). It is based on wavefront processing and improves its parallelization efficiency by allowing overlapped execution of consecutive pictures. Furthermore, in this strategy of the decoding steps are performed on a CTB basis rather than on a picture basis, which improves data locality. Our imp...

How a single chip causes massive power bills

Lucas, Jan ; Lal, Sohan ; Andersch, Michael ; Álvarez-Mesa, Mauricio ; Juurlink, Ben (2013)

Modern GPUs are true power houses in every meaning of the word: While they offer general-purpose (GPGPU) compute performance an order of magnitude higher than that of conventional CPUs, they have also been rapidly approaching the infamous “power wall”, as a single chip sometimes consumes more than 300W. Thus, the design space of GPGPU microarchitecture has been extended by another dimension: po...

High performance memory accesses on FPGA-SoCs

Göbel, Matthias ; Chi, Chi Ching ; Álvarez-Mesa, Mauricio ; Juurlink, Ben (2015)

FPGA-SoCs like Xilinx's Zynq-7000 and Altera's Generation 10 SoCs provide an integrated platform for HW/SW-co design applications. Computationally complex tasks can be implemented in the programmable logic part while control logic is implemented on the CPU. A potential bottleneck in such approaches is the interface latency and the data transfer throughput. Especially the data transfer to and fr...

Low-complexity motion-based saliency map estimation for perceptual video coding

Mejía-Ocaña, Ana Belén ; de-Frutos-López, Manuel ; Sanz-Rodríguez, Sergio ; del-Ama-Esteban, Óscar ; Peláez-Moreno, Carmen ; Díaz-de-María, Fernando (2011)

In this paper, a low-complexity motion-based saliency map estimation method for perceptual video coding is proposed. The method employs a camera motion compensated vector map computed by means of a hierarchical motion estimation (HME) procedure and a Restricted Affine Transformation (RAT)-based modeling of the camera motion. To allow for a computationally efficient solution, the number of layer...