CUDA 5 Release Candidate (Pre-Production)

CUDA 5 introduces several new tools and features that make it easier than ever to add GPU acceleration to your applications, including:

New Nsight, Eclipse Edition helps you explore the power of GPU computing with the productivity of Eclipse on Linux and MacOS

  • Develop, debug, and profile your GPU application all within a familiar Eclipse-based  IDE
  • Integrated expert analysis system provides automated performance analysis and step-by-step guidance to fix performance bottlenecks in the code
  • Easily port CPU loops to CUDA kernels with automatic code refactoring
  • Semantic highlighting of CUDA code makes it easy to differentiate GPU Code from CPU code
  • Integrated CUDA code samples makes it quick and easy to get started
  • Generate code faster with CUDA aware auto code completion and inline help

GPU callable libraries now possible with GPU Library Object Linking

  • Compile independent sources to GPU object files and link together into a larger application
  • Design plug-in APIs that allow developers to extend the functionality of your kernels
  • Efficient and familiar process for developing large GPU applications
  • Enables 3rd party ecosystem for GPU callable libraries

GPUDirect RDMA provides fastest possible communication between GPUs and other PCI-E devices

  • Direct memory access (DMA) supported between NIC and GPU without the need for CPU-side data buffering
  • Significantly improved MPISendRecv efficiency between GPU and other nodes in a network
  • Eliminates CPU bandwidth and latency bottlenecks
  • Works with variety of 3rd party network and storage devices

Dynamic Parallelism enables programmers to easily accelerate parallel nested loops on the new Kepler GK110 GPUs

  • Developers can easily spawn new parallel work from within GPU code
  • Minimizes the back and forth between the CPU and GPU
  • Enables GPU acceleration for a broader set of popular algorithms, including adaptive mesh refinement used in aerospace and automotive computational fluid dynamics (CFD) simulations
  • Supported natively on Kepler II architecture GPUs, preview programming guide and whitepaper available today

Watch the overview of  CUDA 5.0:

CUDA compiler goes Open Source

NVIDIA would be announcing the latest versions of their development CUDA toolkits. Parallel Nsight, NVIDIA’s Visual Studio development toolkit, has just had its second release candidate for version 2.1 released. CUDA 4.1 is also being released as a release candidate.

Till date the CUDA compiler toolchain was developed entirely within NVIDIA as a proprietary product; developers could write tools that could generate PTX code (NVIDIA’s intermediate virtual ISA). The compiling of PTX to binary code was handled by NVIDIA’s tools. Next week at GTC Asia, CUDA 4.1 brings a different way: the CUDA compiler is now being built against LLVM, the modular compiler.

LLVM ain’t a true compiler i.e. it doesn’t generate binary code on its own, but as a modular compiler it’s earned quite a reputation for generating efficient intermediate code and for being easy to add new support for new languages and architectures to. If you can generate code that goes into LLVM, then you can get out code for any architecture LLVM supports and it will probably be pretty efficient too. LLVM has been around for quite some time – and is most famously used as the compiler for Mac OS X and iOS starting with Mac OS X 10.6 – but this is the first time it’s been used for a GPU in this fashion.

Benefits of LLVM for CUDA

  • Immediate benefits include shorter compile times (upto 50% times faster) and slightly faster performing code for CUDA developers. Meanwhile the nature of GPU computing means that application/kernel performance won’t improve nearly as much – LLVM can’t parallelize your code for you – but it should be able to generate slightly smarter code, particularly code from non-NVIDIA languages where developers haven’t been able to invest as much time in optimizing their PTX code generation.
  • Moving to LLVM is not completely about immediate performance benefits, it also marks the start of a longer transition by NVIDIA.
  • In NVIDIA’s case moving to LLVM not only allows them to open up GPU computing to additional developers by making it possible to support more languages, but it allows CUDA developers to build CUDA applications for more architectures than just NVIDIA’s GPUs. Currently it’s possible to compile CUDA down to x86 through The Portland Group’s proprietary x86 CUDA compiler, and the move to LLVM would allow NVIDIA to target not just x86, but ARM too. ARM in fact is more than likely the key to all of this – just as how developers want to be able to use CUDA on their x86 + NVGPU clusters, they will want to be able to use CUDA on their Denver (ARM) + NVGPU clusters.

Accoriding to the source, NVIDIA will not be releasing CUDA LLVM in a truly open source manner, but they will be releasing the source in a manner akin to Microsoft’s “shared source” initiative – eligible researchers and developers will be able to apply to NVIDIA for access to the source code. This allows NVIDIA to share CUDA LLVM with the necessary parties to expand its functionality without sharing it with everyone and having the inner workings of the Fermi code generator exposed.


ATI Radeon HD 7000 Series to Hit Q2 2012

ATI Radeon’s next flagship is all set to target Q2 2012.

The ATI Radeon 7k series codenamed Southern Islands are expected to be based on something known as the VLIW4 (Very Long Instruction Word ) SP arrangement circuits. The same circuitry is used in the high end HD 6900 series. FYI the HD 6800 Series were based on VLIW5 arrangement. Theoretically both architectures provide similar computation power, VLIW4 providing reduced die size. Without jumping further, lets see whats this VLIW4 is?

VLIW4 Architecture

Click here to know more about this Architecture

VLIW4 is a chip architecture, you can say it is the GPU technology architecture. In VLIW4 where the number 4 (of VLIW4) means group of 4 SP’s (or 4 SIMD processor) which form a functional unit or AMD Radeon cores (Math units) of this new circuit architecture. All 4 SP’s have same computational potential. However, two of the SP’s especially every 3rd and 4th SP in the group has some special function which is currently unknown, but it may be used for scheduling other SP’s in the group. In 7k series, each SIMD will now be able to perform two GPE’s (Graphics Processing Engine) Cycle. This means almost 3-4x performance increase is expected as compared to the current generation one’s. VLIW designs are unique and are designed to execute many operations from the same task in parallel by breaking it up into smaller groupings called Wavefronts (In case you are wondering about NVIDIA, this is same as CUDA cores /WARP). In AMD’s case a wavefront is a group of 64 pixels/values (A warp on the other hand is 32-bit) and the list of instructions to be executed against them. Ideally, in a wavefront a group of 4 or 5 instructions will come down the pipe and be completely non-interdependent, allowing every Radeon core to be fed. [Read More at @nandtech]

While 7k series will be based on VLIW4, HD7900 will built on a different technology, a stepping stone for AMD’s GPU architecture called as GCN.

GCN (Graphics Core Next) Architecture

ATI brings in the Graphics Core Next (GCN) which is the architectural basis for AMD’s future GPUs, both for discrete products and for GPUs integrated with CPUs as part of AMD’s APU products. AMD will be introducing a major overhaul of its traditional GPU architecture for future generation products in order to meet the direction of the market and where they want to go with their GPUs in the future. The main focus is on the compute capability of the GPU’s targeting towards high performance computing for enterprise markets where precision is a key. The GCN is completely a non-VLIW4 architecture, means that the architecture will be focusing completely on the threads now instead of instructions. To be precise, the non-VLIW4 SIMD would run parallelism at thread level also known as the TLP (Thread Level Parallelism),  the one which NVIDIA follows. The new 7k series with GCN architecture would be able to perform double precision floating point operations i.e FP64. There’s more to it.

Leaked Specification ATI Radeon HD 7k Series


For the first time, ATI GPU’s higher end 7k series 7970 and 7950 will harbor XDR™2 memories. The XDR™2 memory architecture is the world’s fastest memory system solution capable of providing more than twice the peak bandwidth per device when compared to a GDDR5-based system. Further, the XDR 2 (2005 by RAMBUS) memory architecture delivers this performance at 30% lower power than GDDR5 at equivalent bandwidth.

Click here to go to RAMBUS XDR™2 Page

XDR™2 will provide high-performance to gaming, graphics and multi-core compute applications. Each XDR 2 DRAM can deliver up to 80GB/s of peak bandwidth from a single, 4-byte-wide, 20Gbps XDR 2 DRAM device. With this capability, systems can achieve memory bandwidth of over 500GB/s on a single SoC. Watch to know more.

AMD launching the Southern islands in 2012 Q2 seems very exciting. AMD will be launching its 7k series with the 28nm GPU with VLIW4 and will later introduce the Graphics Core Next. With the advancement and more and more opting of DX11, 7k series would definitely accelerate the current tessellation performance and mostly the long awaited GPU Compute Capability.



Parallel NSight Debugging on Single GPU!!

I remember and if I’m not wrong, on Oct, 2009 (to be very precise 😉 ) NVIDIA launched its first development environment for massively parallel computing; found its place inside Microsoft Visual Studio, the world’s most popular development environment, known as the NEXUS.

By the way the product is now known as the Parallel Nsight. The current release is 2.0. NVIDIA® Parallel Nsight™ brings lots of feature set to the massively parallel programmers and developers giving access to more tools and workflows they expect from developing on the CPU, support to Microsoft Visual Studio 2008/2010, support for CUDA Toolkit version 3.2/4.0, attach to process support, PTX/SASS assembly debugging, other advanced debugging and analysis capabilities, graphics performance and stability enhancements. No matter the environment is fantastic and helpful in many ways.

In the Past, lots of developers and enthusiasts have shown there interests in this particular tool. On top of it this tool is Free-of-charge for the visual studio developers. Remember that I’m stressing on “Visual Studio”!! In fact the environment is only available on Windows(Windows Vista and Windows 7, both x86 and x64 platforms) and visual studio.

NVIDIA® Parallel Nsight™ software supports GeForce 9, 200 and 400 Series Graphics Processors, as well as, select Quadro and Tesla GPUs. For complete list of supported GPU’s, see here. That means you must have atleast one of them to use the NSight.

NSight supports one or more hardware configuration, namely:

  • Single GPU
  • Dual GPU
  • Two Systems with Single GPU in each
  • Dual GPU System
    SLI MultiOS

To know more about these configurations, kindly visit this link->Hardware Configurations.

I guess that’s enough introduction to the NSight tool. You must try this tool as soon as possible if you are interested in massively parallel programming or if you are already doing so, grab a copy of it from NSight Developer site.

Coming back to the actual motive of this article, “Debugging on Single GPU system”.

Past few months I have seen lot many people asking “How can I debug my CUDA program on a single GPU system? I can’t debug my program on a single GPU system, its worthless downloading NSight for single GPU owners”. To be honest NSight was actually targeted towards enterprise development and scientific R&D. Over the years the GPU’s have become powerful, affordable and realization of power of GPU was brought by NVIDIA CUDA. So GPGPU is no more an enterprise or scientific R&D field.

Many enthusiasts have shown there interests and have started integrating the CUDA into there applications. CUDA has also found its place among the student projects and most of them have been using Single GPU systems.

Back to the topic:

Debugging CUDA applications on single GPU machine have always been a priority amongst the enthusiasts. This type of Debugging is also known as the “Local Debugging”.

Local Debugging, because both host and target is the same machine. As per the NVIDIA® Parallel Nsight™ you cannot perform the C/C++ debugging unless that particular machine has at-least two GPU’s each must adhere to the supported list of GPU’s. See the available hardware configurations that NVIDIA suggests. So technically speaking you cannot perform a debugging on a Single GPU machine. This is due to fact that debugging may produce undesirable results that might hang or force restart the display driver, and if you have only single GPU then you might not be able to debug the application as you would desire.

But don’t worry this article will show you how to do that even if you have a single GPU system. There is a simple trick behind this procedure.

So before starting to explore this tip/trick, check your gear first (Going with the current versions):

Assuming that you are familiar with GPU programming with CUDA and some background of NSight. If you have used CUDA only and no idea about this environment, then the following video might help you think a little! Have a good look at the features what NSight has to give you.

  • Download and Install NVIDIA CUDA Toolkit 4.0 (You must be a registered developer at NVIDIA Developer Site)
  • Install Microsoft Visual Studio 2010
  • Download and Install NVIDIA Parallel NSight 2.0 (You must be a registered developer at NVIDIA Developer Site)
  • NVIDIA Drivers (270.61 WHQL as on writing this article)
  • Motherboard Drivers and Display drivers.

To achieve debugging on a single GPU system,

  • You must switch of your system. Oh yes and I am not joking.
  • Boot your system and enter into the BIOS setting. Go to the advance BIOS Settings or search for an option that says “Display Init First” or something like that. Default must be PCI or PEG (PCI Express Graphics). Change it to “On-board”. Switch of again.
  • Now physically connect you display to the  on-board display output feed. If you have multiple monitors then switch to one of them which is connected to the on-board display. If you have single display then you can also use external display switch.
  • Boot the PC and log on to Windows. Install the motherboard drivers and display drivers if you haven’t already.
  • Now you can run Visual Studio and NSight to debug your program.
  • That’s all you are done.

At this time your primary display is on-board. PCIe GPU acts as secondary GPU. When you launch NSight it checks for dual display in the windows registry. So simply speaking a Single NVIDIA GPU system can be used for debugging.

There is one more thing I would like to clear at this stage. NVIDIA suggests and advertise that NSight is only compatible with GeForce 9 Series and above. In reality NSight can be used with few GeForce 8-Series also. For this your card must at-least support Compute Capability(CC) 1.1. For example GeForce 8400GS, 8500GT also support NSight. Rather all cards that support CC 1.1 and above are supported by NSight.

Learn more about Nsight at Dr.Dobbs

ENZO 2011

PathScale® ENZO is a complete GPGPU and multi-core solution, which tightly couples the best programming models with highly optimizing code generation for NVIDIA Tesla. ENZO reflects our dedication to and investment in GPGPU with over a decade of combined engineering time invested so far. By leveraging the HMPP open standard directives, ENZO does optimizations that quickly transform any existing C, C++ or Fortran codebase into highly efficient parallel code for GPU or multi-core systems.

ENZO highlights

  • High performance C, C++, and Fortran EKOPath compilers
  • HMPP C, C++ and Fortran compilers
  • PathScale C++ template and class libraries for GPGPU
  • CUDA compatible compiler
  • PathDB debugger with GPGPU support
  • PathAS assembler with GPGPU support
  • PSCNV open source compute Tesla driver
  • True GPGPU network Zero copy
  • Productivity tools for GPGPU programming
  • Only GPGPU solution for Linux, Solaris and FreeBSD

Register Here to Know more: