Radeon R7 200:  Radeon R7 240 |  R7 250 |  R7 250X |  R7 260 |  R7 265 |  Radeon R9 200:  Radeon R9 270 |  R9 270X |  R9 280 |  R9 280X |  R9 290 |  R9 290X |  R9 295X2
GeForce 700:  GeForce GT 730  |  GT 740  |  GTX 750  |  GTX 750 Ti  |  GTX 760  |  GTX 770  |  GTX 780 |  GTX 780 TI |  GeForce TITAN:  GeForce GTX TITAN |  GTX TITAN Black |  GTX TITAN Z

Sep 27th, 2010

NVIDIA Research Summit GTC 2010 Posters Available

GPU Technology Conference

Algorithms & Numerical Techniques

A02 – Accelerating Symbolic Computations on NVIDIA Fermi
We present the first implementation of a complete modular resultant algorithm on the graphics hardware. Our recent developments taking advantage of new NVidia Fermi GPU architecture and instruction set allowed us to achieve about 150x speed-up over a modular resultant algorithm from Maple 13.
Author: Pavel Emeliyanenko (Max-Planck Institute for Informatics)

A03 – Particle-In-Cell Simulations on the GPU
Particle-In-Cell simulations represent an important technique in the field of kinetic plasma simulations. 2D particle pushing and conserved current aggregation has been implemented in CUDA. On a TESLA C1060 the CUDA code is 4 times faster than SSE2 optimized code on a quad core INTEL XEON processor.
Author: Hartmut Ruhl (Ludwig-Maximilians-University)

A04 – Parallel Ant Colony Optimization with CUDA
The Ant Colony Optimization (ACO) Algorithm is a metaheuristic that is used to find shortest paths in graphs. By using CUDA to implement an ACO algorithm, we achieved significant improvement in performance over a highly-tuned sequential CPU implementation. The construction step of the ACO algorithm consists of each ant creating an independent solution, and this step is where most of the computation is spent. Since the construction step is the same for most ACO variations, parallelizing this step will also allow for easy adaptation to different pheromone updating functions. Currently, our research tests this hypothesis on the travelling salesmen problem.
Author: Octavian Nitica (University of Delaware)

A05 – High Performance and Scalable Radix Sorting for GPU Stream Architectures
The need to rank and order data is pervasive, and sorting operations are fundamental to many algorithms. This poster presents a very efficient method for sorting large sequences of fixed-length keys (and values) using GPU stream processors. Compared to the state-of-the-art, our implementation demonstrates multiple factors of speedup (up to 3.8x) for all NVIDIA GPGPUs. For this domain of sorting problems, we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture: our stock NVIDIA GTX480 sorting results exceed the 1G keys/sec average sorting rate (i.e., one billion 32-bit keys sorted per second).
Author: Duane Merrill (University of Virginia)

A06 – Task Management for Irregular Workloads on the GPU
We explore software mechanisms for managing irregular tasks on graphics processing units. Traditional GPU programming guidelines teaches us how to efficiently program the GPU for data parallel pipelines with regular input and output. We present a strategy for solving task parallel pipelines which can handle irregular workloads on the GPU. We demonstrate that dynamic scheduling and efficient memory management are critical problems in achieving high efficiency on irregular workloads. We showcase our results on a real time Reyes rendering pipeline.
Author: Stanley Tzeng (University of California, Davis)

A07 – A Hybrid Method for Solving Tridiagonal Systems on GPU
Tridiagonal linear systems are of importance to many problems in numerical analysis and computational fluid dynamics, as well as to computer graphics applications in video games and computer-animated films. This poster presents our study on the performance of multiple tridiagonal algorithms on a GPU. We design a novel hybrid algorithm that combines a work-efficient algorithm with a step-efficient algorithm in a way well-suited for a GPU architecture. Our hybrid solver achieves 8x and 2x speedup respectively in single precision and double precision over a multi-threaded highly-optimized CPU solver and a 2x speedup over a basic GPU solver.
Author: Yao Zhang (University of California, Davis)

A08 – Development of Desktop Computing Applications and Engineering Tools on GPUs
A GPU competence center and laboratory for research and collaboration within academia and partners in industry has been established in 2008 at section for Scientific Computing, DTU informatics, Technical University of Denmark. In GPULab we focus on the utilization of GPUs for high-performance computing applications and software tools in science and engineering, inverse problems, visualization, imaging, dynamic optimization. This poster illustrates the latest and most interesting projects that have been developed at our center.
Author: Hans Henrik B. Soerensen (Technical University of Denmark)

A09 – Ballot Counting for Optimal Binary Prefix Sum
This poster describes a new technique for performing binary prefix sums using Fermi’s new __ballot() and __popc() functions. These instructions greatly increase intra-warp communication, allowing for an 80% speedup over standard GPU methods in applications like Radix Sort. It also points to future research that will enable suffix array construction, Burrows-Wheeler Transform, and the BZIP algorithm to take advantage of these instructions for efficient GPU compression.
Author: David Whittaker (University of Alabama at Birmingham)

A10 – Deriving Parallelism and GPU Acceleration of Algorithms with Inter-Dependent Data Fields
This poster presents an approach to derive parallelism in algorithms that involve building sparse matrix that represents relationships between inter-dependent data fields and enhancing its performance on the GPU. This work compares the algorithm performance on the GPU to its CPU variant that employs the traditional sparse matrix-vector multiplication (SpMV) approach. We have also compared our algorithm performance with CUSP SpMV on GPU. The softwares used in this work are MATLAB and Jacket – GPU engine for MATLAB
Author: James Malcolm (Accelereyes)

A11 – Parallelizing the Particle Level Set Method
The particle level set is widely used as an accurate interface tracking tool in simulation, computer vision and other related fields. However, high computation cost prevents applying this method to real-time and interactive scenarios.
This work intensively used parallel design patterns that are implemented in the thrust library, like compaction, reduction and scattering, to parallelize the particle level set method in order to attain real-time performance.
Author: Wen Zheng (Stanford University)

A12 – Accelerating Cuda Graph Algorithms at Maximum Warp
Graphs are powerful data representations favored in many computational domains. GPUs have showed promising results in this domain, but their performance when the graph is highly irregular. In this study, we propose three general schemes to accelerate graph algorithms on a modern GPU architecture: (i) deferred processing of outliers, (ii) efficient dynamic workload balancing and (iii) warp-based execution exploiting threads in a SIMD-like manner. Our evaluation reveals that our schemes exhibit up to 9x speedup over previous GPU algorithms and 23x over single CPU execution on irregular graphs.They also yield up to 30% improvement,even for regular graphs
Author: Sungpack Hong (Stanford University)

A13 – Implementation of Adaptive Cross Approximation on NVIDIA GPUs
The Method of Moments is a popular computational method for solving integral equations in electromagnetics. However, it suffers from high computational and memory costs since it requires the solution of a dense linear system. The Adaptive Cross Approximation (ACA) is an effective technique for compressing the system matrix thereby reducing the necessary storage as well as the number of operations required to solve the system. Acceleration of the ACA MoM with NVIDIA GPUs can finally enable the solution of “real world” scattering problems on a personal workstation in a practical timeframe.
Author: Daniel Faircloth (Georgia Tech Research Institute)

A14 – A GPU Accelerated Continuous-based Discrete Element Method for Elastodynamics Analysis
The Continuum-based Distinct Element Method (CDEM) is the combination of Finite Element Method (FEM) and Discrete Element Method (DEM), which is mainly used in general structural analyses, as well as landslide stability evaluations, coal and gas outburst analyses. By means of CUDA and a GTX-285 VGA card, the GPU version achieves hundreds times speedup ratio.
Author: Zhaosong Ma (Institute of Mechanics, Chinese Academy of Sciences)

A15 – GPU Algorithms for NURBS Minimum Distance and Clearance Computations
We present GPU algorithms and strategies for accelerating distance queries and clearance computations on models made of trimmed NURBS surfaces. We provide a generalized framework for using GPUs as co-processors in accelerating CAD operations. The accuracy of our algorithm is based on the model space precision, unlike earlier graphics algorithms that were based only on image space precision. Our algorithms are at least an order of magnitude faster and about two orders of magnitude more accurate than the commercial solid modeling kernel ACIS.
Author: Adarsh Krishnamurthy (University of California, Berkeley)

A16 – Gate-Level Simulation with GP-GPUs
This poster describes my research work on how to leverage the GP-GPU execution parallelism to achieve high performance in the time consuming problem of gate-level simulation of digital hardware designs.
Author: Debapriya Chatterjee (University of Michigan)

A17 – CUDA Implemenation of Barrier Option Valuation using Jump-Diffusion Model and Browning Bridge
Impressive speedups up to 100x using GPUs compared to CPUs are achieved by taking advantage data parallelism, increased bandwidth and the ability to hide latency. We have implemented a Monte Carlo valuation of a barrier option modeled by a standard diffusion process with a jump diffusion term obeying an underlying Poisson process to account for rare events. In addition, a Brownian Bridge is incorporated to account for barrier crossings in between diffusion trajectories and to reduce bias. This option is representative of exotic options which lack a closed-form solution and are amenable to Monte Carlo type methods for valuation.
Author: Vincent Natoli (Stone Ridge Technology)

Astronomy & Astrophysics

B01 – Black Holes in Galactic Nuclei Simulated with Large GPU Clusters in CAS
Many, if not all galaxies harbour supermassive black holes. If galaxies merge, which is quite common in the process of hierarchical structure formation in the universe, their black holes sink to the centre of the merger remnant and form a tight binary. Depending on initial conditions and time supermassive black hole binaries are prominent gravitational wave sources, if they ultimately come close together and coalesce. We model such systems as gravitating N-body systems (stars) with two or more massive bodies (black holes), including if necessary relativistic corrections to the classical Newtonian gravitational forces (Kupi et al. 2006, Berentzen et al.2009).
Author: Rainer Spurzem (National Astronomical Obersvatories, Chinese Academy of Sciences)

Audio Processing

C01 – Exploring Recognition Network Representations for Efficient Speech Inference on the GPU
We explore two contending recognition network representations for speech inference engines: the linear lexical model (LLM) and the weighted finite state transducer (WFST) on NVIDIA GTX285 and GTX480 GPUs. We demonstrate that while an inference engine using the simpler LLM representation evaluates 22x more transitions per second than the advanced WFST representation, the simple structure of the LLM representation allows 4.7-6.4x faster evaluation and 53-65x faster operands gathering for each state transition. We illustrate that the performance of a speech inference engine based on the LLM representation is competitive with the WFST representation on highly parallel GPUs.
Author: Jike Chong (Parasians, LLC)

C02 – Efficient Automatic Speech Recognition on the GPU
Automatic speech recognition (ASR) technology is emerging as a critical component in data analytics for a wealth of media data being generated everyday. ASR-based applications contain fine-grained concurrency that has great potential to be exploited on the GPU. However, the state-of-art ASR algorithm involves a highly parallel graph traversal on an irregular graph with millions of states and arcs, making efficient parallel implementations highly challenging. We present four generalizable techniques including: dynamic data-gather buffer, find-unique, lock-free data structures using atomics, and hybrid global/local task queues. When used together, these techniques can effectively resolve ASR implementation challenges on an NVIDIA GPU.
Author: Jike Chong (Parasians, LLC)

Computational Fluid Dynamics

D01 – High-Order Unstructured Compressible Flow Solver on the GPU
The objective of this project is to develop a scalable and efficient high-order unstructured compressible flow solver for GPUs. The solver allows the achievement of arbitrary order of accuracy for flows over complex geometries. High-order solvers require more operations per degree of freedom, thus making them highly suitable for massively parallel processors. Preliminary results indicate speed-ups up to 70x with the Tesla C1060 compared to the Intel i7 CPU. Memory access was optimized using shared and texture memory.
Author: Patrice Castonguay (Stanford University)

D02 – Parallel 3D Geometric Multigrid Solver on GPU Clusters
An investigation of the performance and scalability of a multigrid pressure Poisson equation solver running on a GPU cluster.
Author: Dana Jacobsen (Boise State University)

D03 – Acceleration of mesh-free CFD using CUDA
In this work, the acceleration of a mesh-free Computational Fluid Dynamics (CFD) code is performed using CUDA. The poster gives an overview of the CUDA implementation strategy and the resulting performance increase.
Author: Gilles Civario (Irish Centre for High-End Computing)

D04 – Airblast Modelling on Multiple Tesla units
We used NVIDIA Tesla GPUs to accelerate the solution of hyperbolic partial differential equations, with application to modelling airblast generated by industrial bench mining operations. Parallelisation over multiple GPUs was achieved using MPI.
Author: Sean Lovett (University of Cambridge)

D05 – Implementation of High-Order Adaptive CFD Methods on GPGPUs
This poster describes our implementation of adaptive high-order CFD methods on GPUs. A speedup factor of up to 44 has been achieved for 2D flow problems.
Author: Z.J. Wang (Iowa State University)

D06 – Computational Fluid Dynamics on GPU
Computational Fluid Dynamics, an important branch in HPC field, has a history of seeking and requiring higher computational performance. The traditional way to satisfy this quest is to use faster machines or supercomputers. Yet these approaches seem inconvenient and costly to many individual researchers. We investigated the use of GPU to accelerate CFD codes and tested the performances on CUDA and OpenCL platform. We have ported 2D cave flow, 2D Riemann, and 2D flow over a RAE2882 airfoil to the GPU and explored some GPU-specific optimization strategies. In most cases, approximately 16 to 63 x speed up can be achieved.
Author: Long Wang (Supercomputing Center, Chinese Academy of Sciences)

Computer Graphics

E01 – Dynamic and Implicit Trees for Graphics and Visualization on the GPU
We propose a new way to represent trees that allows for faster algorithms, that are simple to implement (especially on the GPU), and with a lower memory overhead than previous approaches. Using our data structure, we have seen significant improvements in both volume ray casting and ray tracing applications over previous state-of-the-art methods.
Author: Nathan Andrysco (Purdue University)

E02 – Fragment-Parallel Composite and Filter
In this poster, we describe our recent work in the area of programmable graphics pipelines by presenting a fragment-parallel formulation of an A-buffer-style composite and filter equation, and describe its implementation on a modern GPU.
Author: Anjul Patney (University of California, Davis)

Computer Vision

F01 – Architecture Aware Design for a Parallel Object Recognition System
We have developed a parallel object recognition system using CUDA, achieving 70x-80x speedup against the original serial implementation. In order to optimize our implementation, we evaluated the performance of different parallelization strategies on some key computations in the object recognition system. Finally we concluded that the parallel implementation performance is sensitive to input data properties. Therefore, we should dynamically adjust the parallelization strategy at runtime to optimize key computations.
Author: Bor-Yiing Su (University of California, Berkeley)

F02 – Dense Point Trajectories by GPU-Accelerated Large Displacement Optical Flow
In this poster we discuss a method for computing point trajectories based on a fast parallel implementation of a recent optical flow algorithm that tolerates fast motion. The parallel implementation of large displacement optical flow runs about 78x faster than the serial C++ version. We use this implementation is a point tracking application. Our resulting technique tracks up to three orders of magnitude more points and is 46% more accurate than the Kanade-Lucas-Tomasi tracker. Compared to the Particle Video tracker, we achieve 66% better accuracy while retaining the ability to handle large displacements while running an order of magnitude faster.
Author: Narayanan Sundaram (University of California, Berkeley)

F03 – Visual Cortex on a Chip: Large-scale, Real-Time Functional Models of Visual Cortex on a GPGPU
Los Alamos National Laboratory’s Petascale Synthetic Visual Cognition project is exploring full-scale, real-time functional models of human visual cortex to understand how human vision achieves its accuracy, robustness and speed. Commercial-off-the-shelf hardware to support this modeling is rapidly improving, e.g., a teraflop GPGPU card costs ~$500 and is ~size of mouse cortex. We present results demonstrating image classification on UAV aerial video with a visual cortex model running on a 240-core NVIDIA GeForce GTX285, and see >x10 speed-up. As this technology continues to improve, cortical modeling on GPGPU devices has the potential to revolutionize computer vision.
Author: Steven Brumby (Los Alamos National Laboratory)

F04 – Fermi in Action: Robust Background Subtraction for Real-time Video Analysis
Background subtraction is one of the important image processing steps for video surveillance and many computer vision problems such as tracking & recognition. However, robust background subtraction that adapts well to variable environment changes is highly computational and consumed large amount of memory. Thus, its practical application is often limited. Here, we aimed to expand its usage and tackle vision problems that requires high frame rate camera such as real-time sports analysis, real-time object detection and recognition. Using recent advances in accelerator hardware – NVIDIA Fermi Architecture and taking advantage of heterogeneous computing , we are able to gain good performance that allows to use in these practical applications.
Author: Melvin Wong (Institute for Infocomm Research)

F05 – Bridging Neuroscience and GPU Computing to Build General Purpose Computer Vision
The construction of artificial vision systems and the study of biological vision are naturally intertwined as they represent simultaneous efforts to forward- and reverse-engineer systems with similar goals. Here, we present a high-throughput approach to more expansively explore biologically-inspired models by leveraging GPUs. We show that this approach can yield significant gains in performance on object and face recognition (including “Labeled Faces in the Wild” challenge and faces from Facebook), consistently outperforming the state-of-the-art. We highlight how the application of flexible programming tools, such as high-level scripting, template metaprogramming/auto-tuning, can enable large performance gains, while managing complexity for the developer.
Author: Nicolas Pinto (Massachusetts Institute of Technology)

F06 – CUDA for Vision and Imaging Library
CUVI Lib (CUDA for Vision and Imaging Library) is a software library that provides a set of GPU accelerated computer vision and image processing functions. CUVI can both be utilized as an add-on library for the NVIDIA’s NPP (NVIDIA Performance Primitives) as it compliments the functionality present in NPP as well as it can be used as a standalone library ready to be plugged into end-user C/C++ applications.
Author: Salman Ul Haq (TunaCode)

F07 – GPU-Friendly Multi-View Stereo Reconstruction Using Surfel Representation and Graph Cuts
We present a new surfel (surface element) based multi-view stereo algorithm which runs entirely on GPU. We utilize flexibility of surfel-based 3D shape representation and global optimization by graph cuts in a same framework.The orientation of the constructed surfel candidates imposes an effective constraint that reduces the effect of the minimal surface bias. The entire processing pipeline is implemented on the latest GPU to speed up the processing significantly. Experimental results show that the proposed approach reconstructs the 3D shape of an object accurately and efficiently, which runs more than 100 times faster than on CPU.
Author: In Kyu Park (Inha University)

F08 – CUDA Accelerated Face Recognition
A GPU based implementation of a face recognition solution using PCA with Eigenfaces algorithm.
Author: Jayadeep Vijayan (NeST Software)

F09 – GPU Driven Dense Reconstruction for Community Photo Collections
We present a system to reconstruct dense 3D models from community photo collections. First images are described using GIST and are clustered using hamming distances. Each of these clusters is geometrically verified and connected using Geotags. Connected clusters are bundle adjusted and the obtained registration is used to estimate depthmaps that are finally fused to obtain dense 3D models. Each of the above steps, except Bundle Adjustment, is implemented in CUDA and runs on multiple GPUs . The performance of our pipeline is two order of magnitude faster on one order more images compared to state of the art method.
Author: Jan-Michael Frahm (University of North Carolina, Chapel Hill)

F10 – Portable Central Vision Enhancement System for Macular Degeneration Patients
Vision enhancement systems is an alternative visual aid device to enhance the remaining vision for visual impairment subjects. Our aim is to develop a mobile central vision enhancement system for macular degeneration patients. Three different types of enhancement algorithms have been developed and their efficiency was tested on low vision patients. These three algorithms have been implemented on a portable low power devic. The NVIDIA system-on-a-chip Tegra has been chosen for this implementation.
Author: Chloe Vaniet (Imperial College London)

F11 – Dense Stereo Vision on GPU
A dense stereo vision for a material handling dual-arm industrial robot have been implemented with the Rectification, Stereo Correspondence and 3D Pose from depth are ported out to GPU using CUDA.
Author: Esubalew Bekele (Universal Robotics Inc.)

F12 – Upsampling Range Data in Dynamic Environments
We present a flexible, parallelized method for fusing information from optical and range sensors based on an accelerated high-dimensional filtering approach. Our system takes as input a sequence of monocular camera images as well as a stream of sparse range measurements as obtained from a laser or other sensor system. Our method produces a dense, high-resolution depth map of the scene, automatically generating confidence values for every interpolated depth point. We describe how to integrate priors on object shape, motion and appearance and how to achieve an efficient implementation using parallel processing hardware such as GPUs.
Author: Hendrik Dahlkamp (Stanford University)

F13 – GPU Accelerated Marker-less Motion Capture
In this work, we derive an efficient filtering algorithm for tracking human pose at 4-10 frames per second using a stream of monocular depth images. The key idea is to combine an accurate generative model-which is achievable in this setting using state of the art GPU hardware-with a discriminative model that feeds data-driven evidence about body part locations. We describe a novel algorithm for propagating the noisy evidence about body part locations up the kinematic chain using the unscented transform.We provide extensive experimental results on 28 real-world sequences using automatic ground-truth annotations from a commercial motion capture system.
Author: Varun Ganapathi (Stanford University)

F14 – 3D Facial Feature Modeling with Active Appearance Models
Active Appearance Models (AAM) is a powerful tool for modeling and matching objects under shape deformations and texture variations. It learns characteristics of objects by building a compact statistical model from applying Principal Component Analysis (PCA) to a set of labeled data. Although AAM has been widely applied in the fields of computer vision, due to its flexible framework, it still cannot satisfy the requirement of real-time situations. To alleviate this problem, we address the computational complexity of the fitting procedure by running the AAM optimization algorithm on a GPU using a hybrid CPU / GPU block processing architecture.
Author: Tim Llewellynn (nViso / EPFL)

F15 – OpenCV on GPU
OpenCV is a free open source library of computer vision algorithms. Recently a new module consisting of functions implemented on GPU was introduced in OpenCV. It consists of several methods for calculating stereo correspondence between two images that is used to reconstruct a 3D scene. A simple block-matching algorithm works up to 10x faster compared to a CPU implementation in OpenCV providing real-time processing of HD stereo pairs on Tesla cards. Belief propagation-based algorithms show 20-50x speedup compared to a CPU implementation.
Author: Anatoly Baksheev (ITEEZ)

Databases & Data Mining

G02 – Speculative Query Processing
With an increasing amount of data and user demands for fast query processing, the optimization of database operations continues to be a challenging task. A common optimization method is to leverage parallel hardware architectures. With the introduction of general-purpose GPU computing, massively parallel hardware has become available within commodity hardware. To efficiently exploit this technology, we introduce the method of speculative query processing. This speculative query processing works on index structures to efficiently support heavily used database operations. To show the benefits and opportunities of our approach, we present a fine and coarse grain implementation for multidimensional queries.
Author: Peter Volk (Technische Universität Dresden)

G03 – Virtual Local Stores
We propose a mechanism to provide the benefits of a software-managed memory hierarchy on top of a hierarchy of hardware-managed caches. A virtual local store (VLS) is mapped into the virtual address space of a process and backed by physical main memory, but is stored in a partition of the hardware-managed cache when active. This reduces context switch cost, and allows VLSs to migrate with their process thread. The partition allocated to the VLS can be rapidly reconfigured without flushing the cache, allowing programmers to selectively use VLS in a library routine with low overhead.
Author: Henry Cook (University of California, Berkeley)

Embedded & Automotive

H01 – Driver Assistance: Speed-Limit Sign Recognition on the GPU
We investigate the use of differentGPU-based implementations for performing real-time speed limit sign recognition on a resource-constrainedembedded system. The system recognized US and European Union speed-limits at over 88% accuracy while running in real-time. The system is hardware-accelerated using CUDA and OpenGL. It introduces a novel technique for detecting speed-limit signs which is only possible with the aid of GPU processing.
Author: Vladimir Glavtchev (BMW)

H02 – Complex Automotive Applications
NVIDIA GPU architecture becomes a very interesting hardware target for complex automotive application. We implemented the same automotive application on several different hardware targets and analyzed the maximum frame rate and the effective CPU charge. This paper shows how real-time applications like pedestrian detection and driving assistance take benefits from a massively parallel “central” architecture like GPU/CUDA. Real-time performance and zero-delay transfers can be achieved using a full asynchronous implementation. The same approach can really multiply the application performance by the number of GPU devices present on the embedded system, at a reasonable power consumption.
Author: Marius Vasiliu (University of Paris Sud)

High Performance Computing

I01 – A GPU-based Architecture for Real-Time Data Assessment at Synchrotron Experiments
Modern X-ray imaging cameras provide millions of pixels and several thousand frames per second. To process such an amount of information we have optimized the reconstruction software employed at the tomography beamlines of ANKA and ESRF synchrotrons to use the computational power of modern graphic cards. Using GPUs as compute coprocessors we were able to reduce the reconstruction time by a factor 30 and process a typical data set of 20GB in 40 seconds. The time needed for the first evaluation of the reconstructed sample is reduced significantly and quasi real-time visualization is now possible.
Author: Suren Chilingaryan (Karlsruhe Institute of Technology)

I02 – Automatic High-Performance GPU code Generation using CUDA-CHiLL
This poster presents a system to automatically generate high-performance GPU code starting from an input sequential loop nest computation. The compiler analyzes input computation in C and automatically generates a set of equivalent code variants represented by transformation recipe. These recipes guide the underlying code transformation and generation framework to apply code transformations and ultimately produces CUDA code.
We use the system to generate high performing CUDA code for four BLAS functions, matrix transpose and convolution stencils. The results mostly outperform CUBLAS2.2/CUDA_SDK2.2 and naive GPU kernel and can achieve perform up to 435GF(mm) with avg speedup up to 1.78x.
Author: Malik M Khan (USC/ UoU)

I03 – CSIRO Advances in GPU Computing. What could you do with 256 GPUs?
The Commonwealth Scientific and Industrial Research Organisation (CSIRO) is Australia’s national science agency. CSIRO is currently applying GPU Computing on a scale ranging from single GPU workstations through to their 256 GPU cluster. This poster showcases some of CSIRO’s work in the areas GPU accelerated biological imaging, image deconvolution, synchrotron science and CT reconstruction, and statistical inference in complex environmental models. Speedups of between 8 to 230x have been seen across these applications areas using a broard range of GPU computing platforms.
Author: Luke Domanski (CSIRO)

I04 – High Performance Agent-Based Simulation with FLAME for the GPU
The Flexibile Large-scale Agent Modelling Environment for the GPU (FLAME GPU) addresses the performance and architecture limitations of previous work by presenting a flexible framework approach to ABM on the GPU. Most importantly it addresses the issue of agent heterogeneity through the use of state machine based agent representation. This representation allows agents to be separated into associated state lists which are processed in batches to allow very diverse population of agents whilst avoiding large divergence in parallel code kernels. The use of the GPU allows AB models to be visualised in real time, which further widens the application of ABM to real-time simulations.
Author: Paul Richmond (University of Sheffield)

I05 – The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite
SHOC is a benchmark suite for heterogeneous systems. This poster describes the suite and presents recent performance measurements.
Author: Kyle Spafford (Oak Ridge National Lab)

I06 – HyperFlow: An Efficient Dataflow Architecture for Multi CPU-GPU Systems
We propose a new pipeline architecture that can take advantage of the many processing elements available in modern CPU-GPU systems to maximize performance in visualization and computational tasks. Our architecture is very flexible and allows the construction of classical parallel algorithms such as data streamers and map/reduce templates. We also discuss examples and performance benchmarks that demonstrate the potential of our system.
Author: Huy Vo (University of Utah)

I07 – MPI-CUDA Applications Checkpointing
We propose a checkpoint/restart tool for multi-GPU applications such as MPI-CUDA applications
Author: Nguyen Toan (Tokyo Institute of Technology)

I08 – Particle Simulations using DEM on GPUs
Particle based numerical methods are an emerging field since the GPU/CUDA technique became widely accepted in the last years.
80% of the whole material,used in pharmaceutical technology are powders. Numerical simulations of such material is possible by using the Discrete Element Method (DEM). The main restrictions here is compute power together with the problem size. Only a few ten-thousand particles lead to weeks to months of compute time in order to reflect processes of a few minutes in real time.DEM scales excelent with the massively-parallel CUDA environment, enabling us to access the million particle range in acceptable job runtimes.
Author: Charles Radeke (University Graz)

I09 – Mastering Multi-GPU Computing on a Torus Network
We describe APEnet+, the new generationof our 3D torus network which scales up to tens of thousands of cluster nodes with linear cost. The basic component is a custom PCIe adapter with six high-speed links, designed around a programmable HW component (FPGA), a nice environment for studying integration techniques between GPUs and network interfaces. The highlevel programming model is MPI, while a low-level RDMA API is also available.
Author: Davide Rossetti (National Institute of Nuclear Physics)

I10 – Poster: Atmospheric Modelling, Simulation and Visualization using CUDA
The Laboratory Meteorological Dynamics (LMD) by CNRS weather model is used extensively for research and weather forecasting purposes.
Simulation of atmospheric climate is one of the most challenging computational tasks because of its numerical complexity and simulation time. The numerical simulations must be obviously achieved faster than in real time to use them in decision support.
Author: Priyanka Sah (Indian Institute of Technology, Delhi)

I11 – Automatic Program Generation for the Fermi – DFT Transform
The goal of SPIRAL is to push the limits of automation in software and hardware development and optimization numerical kernels beyond what is possible with current tools. In this research, we address the problem of an efficient high performance computing platform of libraries automatically generated by a computer forNVIDIA GPU architectures. Spiral generates code that automatically bypasses all the architectural restrictions on GPUs, shared memory bank conflicts, global memory coalescing and pushes code to the limits (maximum number of threads, register pressure, etc.). The procedure of code generation is fast, platform dependent, easy to rewrite and problem adaptable.
Author: Christos Angelopoulos (Carnegie Mellon University)

I12 – Fast N-body Algorithms for Dynamic Problems on the GPU
we present an extension of the earlier algorithm by Gumerov & Duraiswami (J. Comput. Phys., 2008) which adapts the FMM to the GPU, where the data structures are efficiently generated on the GPU as well. Details and performance on current architectures will be presented.
Author: Qi Hu (University of Maryland)

I13 – GPU Acceleration of Cube Calculus Operations
In our current work, we present the first massively parallel, GPU accelerated implementation of the Cube Calculus operations for multivalued and binary logic, also called Cube Calculus Machine (CCM). Substantial speedups upto the order of 85x are achieved using the CUDA enabled nVIDIA Tesla GPU compared to the CPU implementation on a sequential processor.CC is a very efficient and convenient mathematical formalism for representation, processing and synthesis of binary and multivalued logic which has significant applications in logic synthesis, image processing and machine learning. Thus, massive speedups achieved using GPUs are very encouraging to build future parallel VLSI EDA systems
Author: Vamsi Parasa (Portland State University)

I14 – An Atomic Tesla
We examined the possibility of using an Atom-based host system to control a Tesla S1070. Our simple benchmarks found that Atom-based systems should be viable for codes with serial portions small enough to make Amdahl’s Law irrelevant. Such systems would have a much lower power draw than ‘traditional’ GPU clusters.
Author: Richard Edgar (Massachusetts General Hospital)

I15 – ICHEC’s GPU Research: Porting of Scientific Application on NVIDIA GPU
ICHEC is the Irish National HPC centre, with a mission to provide both high performance computing resources and expertise for the Irish research community. In addition to its core mission of research enablement, ICHEC started in May 2009 an exploratory activity in GPGPU and CUDA programming. Quantum Espresso is an increasingly popular molecular dynamic package, mainly developed by the DEMOCRITOS group in Trieste (IT). PWscf is part of the Qauntum Espresso suite which performs electronic and ionic structure calculations. Interesting part on the porting of PWscf is an high performance [ZD]gemm which execute in parallel between CPU and GPU.
Author: Ivan Girotto (Irish Centre for High-End Computing)

I16 – Implementation of Smith-Waterman algorithm in OpenCL for GPUs
In the poster is presented the implementation of Smith-Waterman algorithm done in OpenCL. This implementation is capable of computing similarity indexes between query sequences and a reference sequence with or without sequence alignment paths. In accordance with the requirement for the target application in cancer research the implementation provides processing of very long reference sequences (in the order of millions of nucleotides). Performance compares favorably against CPU, being on the order of 14 – 610 times faster; 4.5 times faster than the Farrar’s implementation. It is also on par with CUDASW++v2.0.1 performance, but with less constraints in sequence length.
Author: Dzmitry Razmyslovich (Institute of Computer Engineering, University of Heidelberg)

I17 – Computing Strongly Connected Components in Parallel on CUDA
The problem of decomposition of a directed graph into its strongly connected components is a fundamental graph problem inherently present in many scientific and commercial applications. We show how existing parallel algorithms can be reformulated in order to be accelerated by NVIDIA CUDA technology. We design a new CUDA-aware procedure for pivot selection and we redesign the parallel algorithms in order to allow for CUDA accelerated computation. We experimentally demonstrate that with a single GTX 280 GPU card we can easily outperform optimal serial CPU algorithm.
Author: Milan Ceska (Masaryk University)

I18 – A CUDA Runtime Target for the Sequoia Compiler
We describe an implementation of the Sequoia Runtime interface in CUDA that enables the Sequoia compiler to target programs written in Sequoia for single and multiple GPU systems.
Author: Michael Bauer (Stanford University)

I19 – GPU Computing for Real-Time Optical Measurement Techniques
Measuring displacement and strains during deformation of advanced materials which are too small, big, compliant, soft or hot are typical scenarios where non-contact techniques are needed. Using Digital Image Correlation and Tracking, strain can be calculated from a series of consecutive images with sub pixel resolution. However, the image processing is a computation intensive task and can’t be performed in real time using general purpose processors. We implemented 3 stage pipelined architecture: images are loaded, preprocessed using CPU, and correlated on GPUs. Using two GTX295 cards we were able to reach 35 times speedup compared to fastest Core i7 processor.
Author: Suren Chilingaryan (Karlsruhe Institute of Technology)

I20 – An MPI/CUDA Implementation of Discontinuous Galerkin Time Domain Method for Maxwell’s Equations
We describe an MPI/CUDA approach to solve Maxwell’s equations in time domain by means of an Interior Penalty Discontinuous Galerkin Time Domain Methods and a local time stepping algorithm. We show that MPI/CUDA provides 10x speed up versus MPI/CPU, in double precision. Moreover, we present scalability results and an 85% parallelization efficiency up to 40 GPUs on the Glenn cluster of Ohio Supercomputing Center. Finally, we study an electromagnetic cloaking example for a broad band signal(8-11GHz), to show the potential of our approach to solve real life examples in short simulation times.
Author: Stylianos Dosopoulos (Ohio State University)

I22 – Development and Application of a Peta-Scale GPU Cluster for Multi-Scale Discrete Simulation – Mole-8.5
Mole-8.5 is the first GPGPU supercomputer of petascale using Tesla C2050 in the world, designed and established in April 2010 by Institute of Process Engineering (IPE), Chinese Academy of Sciences. A designing philosophy utilizing the similarity between hardware, software and the problems to be solved is embodied, based on the multi-scale method and discrete simulation approaches developed at IPE. With the multi-scale discrete software developed by IPE, Mole-8.5 has already carried out large-scale simulations of high scientific significance covering areas such as chemical engineering, oil exploitation, metallurgy, demonstrating the supercomputer as a paradigm of green computation in innovative architecture.
Author: Xiaowei Wang (Institute of Process Engineering, Chinese Academy of Sciences)

I23 – Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster
Linpack is a de facto standard benchmark for supercomputer. We introduce the implementation and tuning technology of Linpack benchmark on IPE Mole-8.5 Cluster equipped with NVIDA Tesla C2050 (Fermi) GPU, including CPU/GPU overlap, streaming (pipeline) technology and CPU/GPU affinity. As a result, we got 207.3TFlops and IPE Mole-8.5 Cluster ranked No.19 on Top500 June 2010 list. In addition, we analyze the bottleneck of Linpack benchmark on this system.
Author: Xianyi Zhang (Institute of Software, Chinese Academy of Sciences)

I24 – Atomic Hedgehog: Productive High-Performance Computing with Python
cl.oquence is a new programming language which embeds OpenCL’s semantics into Python as a library, allowing the intermixing of dynamically typed Python code and statically typed OpenCL code and demonstrating new concepts in programming language design. By utilizing automatic type inference and other features, it aims to make programming highly productive without sacrificing any of the performance associated with GPU languages. We describe this system as well as an application of it to large-scale simulations, particularly those used in theoretical neurobiology.
Cyrus Omar (Carnegie Mellon University)

Imaging

J01 – Neurite Detection using CUDA, GPU Accelerated Biological Imaging for High-Content Analysis
The analysis of microscopic neurite structures in images is an important for studying the effects of lead compounds on brain diseases or the regeneration of brain cells after trauma. In High-Content Analysis (HCA) 100s to 1000s of microscopy images are processed during automated experiments. The speed of the image processing in these situations greatly affects the workflow throughput. We report some early results on GPU acceleration of the Neurite Detection module in our groups’ HCA-Vision. The most time consuming algorithm steps are accelerated by up to 13.6x resulting in a 3.3x speedup for the entire algorithm (70% of theretical maximum).
Author: Luke Domanski (CSIRO)

J02 – Fast Radon Transform via Fast Non-uniform FFTs on GPUs
Fast Radon Transform is required in X-ray Phase Contrast Tomography performed at the Advanced Light Source, Lawrence Berkeley National Lab. We describe a fast implementation based on fast non-uniform FFTs on GPUs.
Author: Chao Yang (Lawrence Berkeley National Laboratory)

J03 – Projected Conjugate Gradient Solvers on GPU and its Applications
In this work, the focus is specifically on how to speedup the projected CG algorithm utilizing the GPU. It is shown that the projected CG method can be used within the single precision accuracy of the current GPU. One benefit gained through use of the projected CG is that it reduces the total number of matrix vector multiplications, which is usually a bottleneck for an efficient GPU-based Krylov-based algorithm. A modified projection based CG algorithm in the thesis is further proposed which shows a better performance. Numerical results using the GPU are provided to support the proposed algorithm.
Author: Youzuo Lin (Arizona State University)

J04 – Real-time Direct Georeferencing of Images from Airborne Line Scan Cameras
The Norwegian Defense Research Establishment (FFI) is developing a technology demonstrator for airborne real-time hyperspectral target detection. The system includes two nadir-pointing line scan cameras. The line scanned images are georeferenced in real-time by intersecting rays cast from the cameras with a 3D model of the terrain underneath. The georeferenced images may then easily be ortho-rectified (e.g by using texture mapping in OpenGL) and overlaid digital maps. This poster presents the performance of a cuda implementation of the georeferencing method.
Author: Trym Vegard Haavardsholm (Norwegian Defence Research Establishment (FFI))

J05 – CUDA Acceleration of Color Histogram Matching
Histogram matching techniques are methods for the adjustment of color in a pair of images. It can be used as a preliminary stage for several video applications as for example 3D content creation. In such application two cameras separated a known distance acquire video streams that can be combined in order to compute a depth map. As both cameras take slightly different scenes they can be lit by different sources becoming a possible color shift between their streams and thus penalizing the quality and the user experience. Our approach considers the use of a NVIDIA 3D broadcast solution system with professional HD cameras.
Author: Antonio Sanz (Universidad Rey Juan Carlos)

Life Sciences

K01 – Generalized Linear Model (GLM) Based Quantitative Trait Locus (QTL) Analysis
Relating Genotype to Phenotype in Complex Environments has been identified as one of the grand challenges of plant sciences. Under the umbrella of the iPlant Collaborative funded by the Plant Science Cyberinfrastructure Collaborative program of the NSF, our goal is to develop GPU implementation of the General Linear Model (GLM) to statistically link genotype to phenotype and dramatically decrease the execution time for GLM analyses. GPU based highly parallelized Forward Regression stage of the GLM achieved 177x speedup over the Matlab based serial version. Results of this study will enable larger, more intensive genetic mapping analyses to be conducted.
Author: Ali Akoglu (University of Arizona)

K02 – GPU-REMuSiC: The Implementation of Constrain Multiple Sequence Alignment on Graphics Processing Unit
We implement RE-MuSiC tool on multi-GPUs (called GPU-REMuSiC) with NVIDIA CUDA. By a special model implementation, the DP computation time in GPU-REMuSiC running on single and two GeForce GTX 260 cards achieves more than 75 and 130 speedups comparing to that in sequential RE-MuSiC running on Intel i7 920 CPU, respectively.
Author: Chun-Yuan Lin (Chang Gung University)

K03 – The Virtual Heart: Working Towards Interactive CUDA Based Simulations of Cardiac Function
Heart disease is the leading cause of death in the developed world. Despite this, our understanding of cardiac dysfunction is limited. Our goal is to create a realistic virtual model of the heart to develop insight into this clinically important problem. The computational complexity of the ‘virtual heart’ has been prohibitive until very recently. However, the continued development of massive parallelization using CUDA and GPU technology has now made this a realistic and achievable goal.
Author: Stefano Charissis (Victor Chang Cardiac Research Institute)

Machine Learning & Artificial Intelligence

L01 – CUDA Creatures
CUDA Creatures applies parallel algorithms to the iterated Prisoner’s Dilemma, a classic study of the evolution of cooperation. We bring interactivity to parameter space exploration by achieving 600x to 800x speedups on GTX 260.
Author: Andrew Hershberger (Stanford University)

Medical Imaging & Visualization

M01 – Real-time Ultrasound Data Processing for Regional Anesthesia Guidance
Ultrasound imaging techniques such as Doppler flow imaging and acoustic radiation force impulse (ARFI) imaging require estimation of velocity or displacement from the received echoes. Real-time processing and display of images allows for real-time guidance of procedures, improving patient safety and efficacy. Using CUDA, the processing code has been implemented in pre-clinical regional anesthesia studies investigating new methods for localizing where fluid is being injected. The computation time has been reduced from 20 minutes to 18 seconds, resulting in the rapid display of dynamic images of the fluid being injected.
Author: Stephen Rosenzweig (Duke University)

M02 – GPU-Accelerated Texture Decompression of Biomedical Image Stacks
Histopathology is the microscopic examination of tissue in order to study the manifestations of disease. High resolutions images are vital for accurate diagnoses and a major obstacle to the use of digital imaging in histopathology has been the inability to display these large images at interactive rates. We have created a tool for interactive visualization of biomedical image stacks using GPU-accelerated on-the-fly texture decompression. The image stacks are compressed using a novel approach custom tailored for the data we are dealing with, i.e. data exhibiting exceptionally high coherence between the slices of each image stack.
Author: Chirantan Ekbote (Harvard University)

M03 – Accelerated Large Scale Spherical Model Forward Solutions for the EEG/MEG using CUDA
The study presented in the poster looks at the utility of a CUDA based approach to improve the computational speed of the spherical model EEG and MEG forward solution for large scale 3-D dipole grid (on order of 1000 and up) and sensor locations (on order of 100 and up). Fast computation of the forward solution is critical in improving the speed of the inverse solution in biosource imaging. The inverse solution gives the location of the epileptogenic foci from the EEG and MEG measurements.
Author: Nitin Bangera (MIND Research Network)

M04 – CUDA Accelerated Real Time Volumetric Cardiac Image Enhancement
CUDA enables high data rate real time volumetric cardiac ultrasound image enhancement. Substantial improvements in processing data rate and memory bandwidth demand over a CPU based approach were found with CUDA.
Author: Ismayil Guracar (Siemens Medical Solutons)

M05 – Efficient Visualization of Salient Manifolds in Scalar, Vector, and Tensor Fields
Our research focuses on harnessing the massively parallel compute power of the GPU to visually explore complex datasets. We propose adaptive GPU-based approaches that intertwines computation and rendering. Along side we present novel dynamic data structures for the GPU. Our research include the visualization of salient structures in vector fields using LCS, extraction of ridge and valley surfaces from volumetric scalar fields with scale analysis, and efficient volume / surface rendering.
Author: Samer Barakat (Purdue University)

M06 – Highly Parallel Image Reconstruction for Positron Emission Tomography (PET)
We present a novel method of computing line projection operations required for list-mode ordered-subsets expectation-maximization (OSEM) for fully 3-D PET image reconstruction on a GPU using the CUDA framework. Our method overcomes challenges such as compute thread divergence and exploits GPU capabilities such as shared memory and atomic operations. This new GPU-CUDA implementation is 120X faster than a reference CPU implementation. The image quality is preserved with root mean squared (RMS) deviation between the images generated using the CPU and the GPU being 0.08%, which has negligible effect in typical clinical applications.
Author: Jingyu Cui (Stanford University)

Molecular Dynamics

N01 – Energy Evaluation of Rosetta Proteins Using CUDA
In this poster, we describe preliminary results using CUDA to accelerate the energy evaluation of proteins folded by the Rosetta software suite.
Author: Will Kohut (University of California, Davis)

N02 – GPU Accelerated Molecular Dynamics Algorithms for Soft Matter Systems using HOOMD-Blue
The rheological, thermodynamic, and self-assembly behavior of liquids, colloids, polymers, foams, gels, granular materials and biological systems are often studied in simulation by using coarse-grained models based on molecular dynamics algorithms. The open source general purpose particle dynamics code HOOMD-Blue has been expanded to include the simulation techniques and pair potentials used to study this class of problems.
Author: Carolyn Phillips (University of Michigan)

N03 – Accelerating Molecular Modeling using GPUs
Computing electrostatic interactions in a biomolecule contributes towards the understanding of its structure and function, e.g., ligand binding, complex formation, and proton transport. However, such calculations on a desktop computer can take on the order of days, or even weeks, to run. Consequently, scientists seek to either reduce the algorithmic complexity, massively accelerate the computation with a GPU, or both. Our approach, based on an analytical linearized Poisson Boltzmann algorithm, delivers a 120-fold speed-up on a GPU (vs. a CPU-optimized -O3 with hand-tuned SSE). When combined with our hierarchical charge partitioning (HCP) multiscale method, however, the delivered speed-up approaches 20,000-fold.
Author: Wuchun Feng (Virginia Tech)

Neuroscience

O01 – Distributed Multi-Level Out-of-Core Volume Rendering
In neuroscience, scans of brain tissue are acquired using electron microscopy, resulting in extremely high-resolution volume data with sizes of many terabytes. To support the work of neurobiologists, interactive exploration of such volumes requires new approaches for distributed out-of-core volume rendering. A major goal of our distributed GPU volume rendering system is to sustain a pixel-to-voxel ratio of about 1:1. This display-aware approach effectively bounds the working set size required for ray-casting, which makes it largely independent of the volume resolution. Currently, our system achieves interactive volume rendering of 43GB and 92GB volumes on 1 to 8 Tesla nodes.
Author: Markus Hadwinger (King Abdullah University of Science and Technology)

Programming Languages & Techniques

P01 – GPU-to-CPU Callbacks
Our poster outlines GPU-to-CPU callbacks, a method for the GPU to request work from the CPU. We give some motivation, demonstrate the code architecture, and give samples of CPU and GPU code that show callbacks being executed.
Author: Jeff Stuart (University of California, Davis)

Physics Simulation

Q01 – Acceleration of Computational Electromagnetics Physical Optics – Shooting and Bouncing Ray Method
Electromagnetic fields radiated by a 1964 Ford Thunderbird are calculated over 50 times faster than a standard CPU by using a Quadro FX 5800 GPU.
Author: Huan-Ting Meng (University of Illinois at Urbana-Champaign)

Q02 – Massively Parallel Micromagnetic FEM Calculations with Graphical Processing Units
We adapted our Micromagnetic Simulator “TetraMag” to NVIDIA’s CUDA architecture, resulting in a significant increase in calculation speed and cost efficiency over the most recent PC-based machines. The poster gives an outline of the general challenges and the methods used to adapt the solutions to GPUs as well as benchmark results obtained using standard micromagnetic problems.
Author: Elmar Westphal (Forschungszentrum Juelich)

Q03 – Multiplying Speedups: GPU-Accelerated Fast Multipole BEM, for Applications in Protein Electrostatics
We have developed a fast multipole boundary element method (BEM) for biomolecular electrostatics. With GPU acceleration of the FMM, there is a multiplicative speed-up resulting from the fast O(N) algorithm and GPU hardware. With this method, we can obtain converged results for multi-million atom systems in less than an hour, using multi-GPU clusters.
Author: Lorena Barba (Boston University)

Q04 – GPU-Powered Control of a Compliant Humanoid Robot
The ECCEROBOT project deals with the construction and control of a robot with a humanoid skeleton and muscle-like compliant, elastic actuators. The nonlinear passive and active coupling between the skeletal elements, combined with the effect of environmental interaction, present an extremly complex control problem. Our solution; motor programs are found using physics-based simulation of both the robot and its environment to locate candidate movements. For real time control multiple copies of the simulation must be run in faster than real time, requiring the use of GPU acceleration. Further, in order to capture the environment we use GPU-accelerated dense reconstruction vision.
Author: Alan Diamond (University Of Sussex, UK)

Programming Languages & Techniques

R01 – A Speech Recognition Application Framework for Highly Parallel Implementations on the GPU
Data layout, data placement, and synchronization processes are not usually part of a speech application expert’s daily concerns. Yet failure to carefully take these concerns into account in a highly parallel implementation on the graphics processing units (GPU) could mean an order of magnitude of loss in application performance. We present an application framework for parallel programming of automatic speech recognition (ASR) applications that allows a speech application expert to effectively implement speech applications on the GPU, and demonstrate how the ASR application framework has enabled a Matlab/Java programmer to achieve a 20x speedup in application performance on a GPU.
Author: Jike Chong (Parasians, LLC)

R02 – Scalable Computer Vision Applications
We are developing a domain specific language for computer vision algorithms that facilitates rapid implementation of algorithms that are scalable and portable across CPU-GPU architectures. The presented approach significantly lowers the barrier of implementation of computer vision algorithms for heterogeneous CPU-GPU architectures, and enables a single implementation to automatically scale to use additional hardware as it becomes available.
Author: Rami Mukhtar (NICTA)

R03 – Language and Compiler Extensions for Heterogeneous Computing
GPGPU architectures offer large performance gains over their traditional CPU counterparts for many applications. However, current GPU programming models present numerous challenges to the programmer: lower-level languages, explicit data movement, loss of portability, and performance optimization challenges. In this paper, we present novel methods and compiler transformations that increase productivity by enabling users to easily program GPUs using the high productivity programming language Chapel.
Author: Albert Sidelnik (University of Illinois at Urbana-Champaign)

Signal processing

S01 – Achieving 1 TFLOP for the Radio Astronomy Correlator
In this work we apply CUDA, using the Fermi architecture, to the problem of cross-correlation arising in radio astronomy. This accounts for the bulk of computation in radio astronomy, and essentially is described by vector outer-products. Traditionally this task is performed using FPGAs, and the goal of this work was to see how efficiently GPUs could be used for this task. We describe the tiling strategies and optimization techniques employed to maximize performance. We achieve in excess of 1 teraflop per second using a single GeForce GTX 480, which corresponds to 78% of peak performance,
Author: Michael Clark (Harvard University)

S02 – CUDA Implementation of Software for Identifying Post-Translational Modifications
InsPecT is a software for identifying post-translational modifications of protein. With the help of the MS-Alignment algorithm, InsPecT can search PTMs in unrestrictive mode, even reveal unknown types of modifications. However, the MS-Alignment has a tremendous time complexity and takes more than 99% computing time of InsPecT. We accelerated MS-Alignment on GPUs. After optimization and parallelization with MPI, cuda-InsPecT, a new open source software based on MPI+CUDA with high efficiency is born.
Author: Long Wang (Supercomputing Center, Chinese Academy of Sciences)

Tools & Libraries

U01 – Mint: An OpenMP to CUDA Translator
We aim to facilitate GPU programming for finite difference applications. We have developed Mint, a source to source compiler to generate CUDA code from OpenMP code. Mint transforms omp parallel for loops into CUDA kernels and applies domain specific optimizations such as shared memory, register and kernel fuse optimizations. Since our translator targets structured grid problems, it optimizes the code better than the general purpose compilers. In this poster, we present translation and optimization steps along with our initial performance results.
Author: Didem Unat (University of California, San Diego)

U02 – Real-Time Particle Simulation in the Blender Game Engine with OpenCL
The goal of this project is to produce interactive scientific visualizations that can be used in educational games. We use the computational power of OpenCL to enable features in the Blender Game Engine that would otherwise not be possible in real-time. By adding an interactive particle system to the game engine, we set the stage to demonstrate many interesting scientific phenomena (molecular dynamics, fluid dynamics, statistics) with the added benefit of real-time special effects for games in general.
Author: Ian Johnson (Florida State University)

U03 – GStream: A General-Purpose Data Streaming Framework on GPU Clusters
In this poster, we propose GStream, a general-purpose, scalable data streaming framework on GPUs. The contributions of GStream are as follows: (1) We provide powerful, yet concise language abstractions suitable to describe conventional algorithms as streaming problems. (2) We project these abstraction onto GPUs to fully exploit their inherent massive data- parallelism. (3) We demonstrate the viability of streaming on accelerators. Experiments show that the proposed framework provides flexibility, programmability and performance gains for various benchmarks from a variety of domains, including but not limited to data streaming, data parallel problems, numerical codes and text search.
Author: Yongpeng Zhang (North Carolina State University)

U04 – NukadaFFT : An Auto-Tuning FFT Library for CUDA GPUs
We have released our FFT library for CUDA GPUs. Most of algorithms and auto-tuning technologies of FFT for CUDA are already published. The library now supports new Fermi architecture and works with CUDA 3.0 or later.
Author: Akira Nukada (Tokyo Institute of Technology)

Video Processing

V01 – Real-Time Color Space Conversion for High Resolution Video
Color space conversion or color correction is a widely used technique to adapt the color characteristics of video material to the display technology employed (e.g. CRT, LCD, projection) or to create a certain artistic look. As color correction often is an interactive task and colorists need a direct response, state-of-the-art real-time color correction systems for video are so far based on expensive dedicated hardware. This submission shows the feasibility to replace dedicated color correction systems by General Purpose GPUs. It is shown that a single Tesla C2050 GPU supports real-time color correction up to a resolution of 4096×2048 pixel.
Author: Klaus Gaedke (Technicolor)

V02 – 3D Object Detection in Digital Holographic Microscope Images
Digital Holographic Microscopy (DHM) is based on the classical holographic principle invented by Hungarian physicist Dennis Gabor. The holographic images are acquired by a CCD camera. Depth slices can be reconstructed using Fourier transform. The numerical reconstruction and further image processing for object detection is done using General Purpose Graphical Processor Units (GPGPU).
Author: Vilmos Szabo (Pazmany Peter Catholic University)