OpenCL Studio 1.13 – Giest Software Labs Inc.

Giest Software Labs Inc. brings a new development environment for OpenCL known OpenCL Studio. OpenCL Studio combines OpenCL and OpenGL into a single integrated development environment that allows you to visualize OpenCL computation using 3D rendering techniques. Allows you to Develop OpenCL programs in real time using integrated source code editors, compilers and profilers, Analyze work progress using predefined visualization capabilities based on OpenGL interoperability, Build sophisticated prototyping environments using the Lua scripting language and a toolbox of user interface widgets.

Key Features of OpenCL Studio
  • Tight integration of OpenCL and OpenGL into a single integrated development environment for high performance computing.
  • Feature-rich source code editors and Lua scripting capabilities for the rapid development of kernels, shaders and control logic.
  • Toolbox of predefined 3D rendering constructs and 2D user interface widgets for complex visualization tasks and interactivity.
  • Flexible plug-in architecture to import proprietary data and integrate with existing processing pipelines.
  • Growing library of parallel algorithms for common parallel processing and visualization tasks.
  • Cross-platform runtime engine.

Watch OpenCL Studio in Action


On 28th Feb, 2011, NVIDIA® Announced its CUDA® 4.0 major toolkit release for developing parallel applications using NVIDIA GPUs.

The NVIDIA CUDA Toolkit specially the 4.0 is designed to make parallel programming easier, and enable more developers to port their applications to GPUs.
NVIDIA’s CUDA 4.0 has now three distinctive features:

  • NVIDIA GPUDirect™ 2.0 Technology: Offers support for peer-to-peer communication among GPUs within a single server or workstation. This enables easier and faster multi-GPU programming and application performance.
  • Unified Virtual Addressing (UVA): Povides a single merged-memory address space for the main system memory and the GPU memories, enabling quicker and easier parallel programming.
  • Thrust C++ Template Performance Primitives Libraries: Provides a collection of powerful open source C++ parallel algorithms and data structures that ease programming for C++ developers. With Thrust, routines such as parallel sorting are 5X to 100X faster than with Standard Template Library (STL) and Threading building Blocks (TBB).

The CUDA 4.0 architecture release includes a number of other key features and capabilities, including:

  • MPI Integration with CUDA Applications
    Modified MPI implementations automatically move data from and to the GPU memory over Infiniband when an application does an MPI send or
    receive call.
  • Multi-thread Sharing of GPUs
    Multiple CPU host threads can share contexts on a single GPU, making it easier to share a single GPU by multi-threaded applications.
  • Multi-GPU Sharing by Single CPU Thread
    A single CPU host thread can access all GPUs in a system. Developers can easily coordinate work across multiple GPUs for tasks such as “halo” exchange in applications.
  • New NPP Image and Computer Vision Library — A rich set of image transformation operations that enable rapid development of imaging and computer vision applications.
  • New and Improved Capabilities
    • Auto performance analysis in the Visual Profiler
    • New features in cuda-gdb and added support for MacOS
    • Added support for C++ features like new/delete and virtual functions
    • New GPU binary disassembler.

A release candidate of CUDA Toolkit 4.0 will be available
free of charge beginning March 4, 2011, by enrolling in the CUDA
Registered Developer Program at:

GpuCV: GPU-accelerated Computer Vision

GpuCV is an open-source GPU-accelerated image processing and Computer Vision library. It offers an Intel’s OpenCV-like programming interface for easily porting existing OpenCV applications, while taking advantage of the high level of parallelism and computing power available from recent graphics processing units (GPUs). It is distributed as free software under the CeCILL-B license.

GpuCV offers GPU-accelerated replacement routines that are fully compatible with their OpenCV counterpart. Image processing applications programmers do not have to care for graphics context nor hardware. Examples applications are provided. Operator developers may access graphics functionalities through the GpuCV framework that automatically manages shader programs, textures and advanced OpenGL extensions.


TPC (Texture Processing Cluster)

The TPC is a concept that is found on NVIDIA GPUs. On G80 and GT200 architectures, a TPC, or Texture / Processor Cluster, is a group made up of several SMs, a texture unit and some logic control.

The SM is a Streaming Multiprocessor and is made up to several SPs (or Streaming Processors), several SFUs (or Special Function Unit – the unit used for transcendental functions such as sine or cosine). A Streaming Processor is also called a CUDA core (in the new Fermi terminology).

The TPC of a G80 GPU has 2 SMs while the TPC of a GT200 has 3 SMs.

A SP includes several ALUs and FPUs. An ALU is an arithmetical and Logical Unit and a FPU is a Floating Point Unit. The SP is the real processing element that acts on vertex or pixel data.

Several TPCs can be grouped in higher level entity called a Streaming Processor Array.

In OpenCL terminology, a SM is called a Compute Unit or CU.

But in NVIDIA’s new GPU, the GF100 / Fermi, the TPC is no longer valid: only remain the SMs. We can also say that on Fermi architecture, a TPC = a SM.

In Fermi architecture, a SM is made up of two SIMD 16-way units. Each SIMD 16-way has 16 SPs then a SM in Fermi has 32 SPs or 32 CUDA cores.

Prior GPUs used IEEE 754-1985 floating point arithmetic. The Fermi architecture implements the new IEEE 754-2008 floating-point standard, providing the  fused multiply-add (FMA) instruction for both single and double precision arithmetic. FMA improves over a multiply-add (MAD) instruction by doing the multiplication and addition with a single final rounding step, with no loss of precision in the addition. FMA is more accurate than performing the operations
separately. GT200 implemented double precision FMA.

In GT200, the integer ALU was limited to 24-bit precision for multiply operations; as a result, multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly designed integer ALU supports full 32-bit precision for all instructions, consistent with standard programming language requirements. The integer ALU is also optimized to efficiently support 64-bit and extended precision operations. Various instructions are supported, including
Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population count.


1760 PS3′s in a cluster: USAF

USAF creates a cluster made up of 1760 Sony PlayStation 3. Defense Department engineers and scientists claim that they have developed the biggest, fastest interactive computer.

The supercomputer is nicknamed the Condor Cluster, will allow very fast analysis of large high-resolution imagery – billions of pixels a minute, taking what used to take several hours down to mere seconds. Its sophisticated algorithms also will allow scientists to better identify objects flying in space, where movement and distance create blurring, with higher-quality images than possible before. Its capacity makes the PlayStation 3 cluster about the 33rd largest computer in the world.

Condor will be housed in upstate New York at a center that is part of the Dayton, Ohio-based Air Force Research Laboratory network, other service branches and centers can access it.