General Purpose GPUs - 60x the Power of Today's CPUs?
Original Article Date: 2009-02-17
The nVidia Tesla S1070
- 960-cores and 1000GFLOPS in a 1U enough for you?
A couple of years ago I had a hunch, and I passed on this hunch to a number of you.
Basically, I said
that a third manufacturer would arise to challenge Intel and AMD in the CPU market.
I had seen the phenomenal growth and maturity of nVidia,
the graphics chip manufacturer, in the half-decade previous. I'd noted that a new powerhouse within the PC was rising to rival the computing power
of the main CPU. That powerhouse was the Graphics Processing Unit (or "GPU").
Fuelled by the millions of dollars from insatiable gamers in need of faster frame-rates
while playing, Doom, Quake, Half-Life and Far-Cry (in that chronological order...), nVidia
developed more and more powerful chips that would render the complex series of floating-point
calculations necessary to create a virtual world of objects and scenery within the
Their great rival ATi, was also responding to gamers needs by developing faster
and more powerful chips, but it was clear that they were no competition to the technological
brains and management acumen of nVidia. As many of you know, ATi was acquired by
AMD in 2006, but that's another story. It's just worth mentioning here the irony
that nVidia's greatest rival was bought by a CPU manufacturer, in an article about
nVidia morphing into a manufacturer of high performance "CPUs".
The graphics company parallelled their success in the gaming market with display
cards aimed at the workstation sector, specifically the QuadroFX
series, that are optimised for line and layer based graphical rendering common to
3D design, engineering and scientific displays, as opposed to heavily textured gaming
From Processing Graphics to Processing... Anything
Several years ago, out of the GeForce and QuadroFX product lines, a feature was quietly spawned that enabled programmers to write code in the C programming language
that could run on the graphics card's GPU. And that code could be anything, not just
graphics routines. It didn't occur to anyone at the time, perhaps, that this feature
would lead to a revolution in the power of scientific and high-performance computing
that is still in its early stages, but which promises to potentially change the world.
For it seemed that graphics processors were particularly good at performing certain
types of massively-parallel calculations.
Much better, in fact, than a conventional
CPU, which had evolved along a very different path due to the more general
demands of managing an entire PC.
Perhaps this remarkable performance isn't surprising when one considers the architecture
of the typical GPU. Multi-core CPUs are now the standard, but have only become so
in the last couple of years, and in any case are limited currently to four cores.
GPUs on the other hand, perhaps due to the unique problems of solving large-scale 3D
graphics problems, evolved with multiple cores at a relatively early stage. Today,
nVidia have a GPU with no less than 240 cores! That's sixty
times that of a conventional quad-core CPU!
Couple this with the super-high memory bandwidth between a GPU and its RAM, and
you have something that can theoretically deliver orders of magnitude of performance
gain over a traditional CPU.
To compare, the nVidia GeForce GTX285 advertises 159GB/s
bandwidth between its 240 cores and 1GB of GDDR3 RAM. This compares with 15-20GB/s
achievable with the fastest conventional CPUs today. So as long as
you can write your
code to fit within the RAM of the graphics card, then you're looking at an order
of magnitude increase in data transfer.
So GPUs are no longer just for graphics, it seems.
To reflect the difference in purpose between GPUs as graphics processors and GPUs
as simple number crunchers for general computing, the term GPGPU
(General Purpose Graphics Processing Unit) has been coined. But could GPGPUs really replace the CPU we know
60x the Performance of a CPU? Seriously?
Could one expect to get sixty times the performance in parallel computing applications?
In an industry where claims of performance gains of 20% or 50% are noticed, to suggest
something would be an order of magnitude (i.e. 10 times or 1000% greater)
faster in performance would seem wild indeed. And yet this is exactly what nVidia
are claiming. Performance improvements of 30x, 40x, even 100x are not only claimed
by the manufacturer, but appear to be backed up by many end-users in the scientific community
who have switched over to running their simulations on GPGPUs.
Sounds unbelievable, doesn't it? And yet it appears that orders of magnitude of
performance gain are indeed achievable. And since we're talking about using commodity
graphics cards, the price tag is quite reasonable. So what's the catch?
Well, firstly, to really get the benefit of the multi-core architecture of a GPGPU,
your code has to be written in a massively multi-threaded (i.e. parallel) way. So
general computing, which relies on complex interactions and wait states between
different devices and modules is not especially suitable for this architecture. So don't say goodbye to your beloved CPU. But applications
which allow massive delegation of repetitive code tasks out to multiple execution threads could
benefit enormously. These include:
- Scientific and engineering modelling of real-world
phenomena, such as fluid dynamics, quantum mechanics, astrophysics, weather patterns, protein and molecular simulation
- Image and video processing and rendering - this is a special topic, which I'll cover
in its own section below
- Financial modelling and economic systems analysis and prediction.
The second "catch" is that you have to write your code especially
for the GPU.
A regular compiler would simply run any code on the main system CPU and ignore all
that power sitting on your graphics card. To make things easier, nVidia have developed a special C compiler that will run
the code on the GPU, using a software framework called CUDA. Full details
of this framework can be found at nVidia's CUDA website.
So, with a small number of exceptions (see below), it's not something you can just plug in overnight, and hope to suddenly deliver
to you 60x the power of a traditional CPU in your software. You have to do some
work. But with performance gains like this, most scientists, engineers, analysts
and video professionals would think it would be worth at least investigating. And besides, scientists are used to writing
their own code!
Lastly, those concerned with working in double-precision floating
point math shouldn't get too excited, yet - the quoted performance figures are for single-precision
numerics, double-precision performance is significantly lower (about one-tenth).
nVidia say that they are addressing this issue and it is expected that future GPGPUs
will perform much better in double-precision. But that said, even
the double-precision power of a GPGPU may well exceed that of a traditional CPU.
Video and Image Rendering
The application of GPGPUs to video and image rendering is somewhat different
to scientific computing in that it lends itself to commercial, off-the-shelf software
solutions that allow the user to leverage the power of the GPGPU.
Adobe Creative Suite 4 has special plug-in for CUDA-enabled GPGPU rendering of video
which is included with the nVidia Quadro CX graphics card. This card is essentially
a Quadro FX4800 with the plug-in software included. According to nVidia and Adobe,
HD video can be rendered with Premiere Pro in only 25% of the time
compared with a traditional dual-core CPU, suggesting that the Quadro CX has the
rendering power equivalent to about eight traditional CPU cores. That's impressive.
With an eight CPU core system (i.e. 2 x quad-core Opterons or Xeons), therefore, including a Quadro CX would enable
rendering of HD video in about half the time.
The Quadro CX provides additional benefits to other titles in Adobe's Creative Suite
- nVidia has a web-page found here describing such features.
Adobe CS4 is just one off-the-shelf application that provides GPU acceleration right
now to video professionals. But expect more video and effects software vendors to
follow on from this lead in the coming months and years. This is because video processing
lends itself so well to massively parallel computations that GPGPUs excel in, and
a software vendor has it in their interests to offer packaged solutions for GPU
computing that make their products more attractive to demanding video professionals.
Hardware Solutions for GPU Computing
nVidia Tesla C1060
- 240-cores and 4GB RAM. Where's the DVI connector?
In addition to the Quadro CX mentioned above, nVidia offer two hardware solutions
specifically for General Purpose GPU Computing, under the Tesla product brand.
The first solution is for workstations. The Tesla C1060
is basically a Quadro FX5800 without the video output, or a GeForce GTX285 with
extra RAM. It
comes with 240-cores, 4GB of GDDR3 RAM, 102GB/s memory bandwidth
and a single precision computing power of approximately 1000GFLOPS
(one trillion floating-point operations per second, near enough). That is some
serious number crunching - when compared with a current conventional quad-core CPU delivering,
at most, 100GFLOPS. The C1060 mounts into a
standard available PCI-Express*16
slot. Multiple C1060s can be incorporated into a workstation, the only limitation
being the number of available PCI-Express*16 slots on the motherboard, and space
(the card is double-width).
The main advantage to using a C1060 over the equivalent (and much cheaper) GeForce
GTX285 card is that the C1060 has four times the RAM available. This
allows more room for RAM-hungry, implicit, matrix-based calculations common to
simulations. Explicit code, however, which computes on-the-fly, requires less RAM,
and so you could do well on a budget with 3 or 4 GTX285s in your workstation (about
the cost of a single C1060), if you can fit your code into a smaller memory footprint.
For rack-based compute clusters, nVidia have the 1U Tesla S1070
box (see picture at article head), which is four C1060s mounted horizontally on two central PCI-Express buses.
These buses connect to your main server's PCI-Express slots using special cabling.
The cards are powered by a power supply within the 1U box. Having four times the
power of a C1060, the figures are impressive - 960-cores, 16GB GDDR3 RAM, 410GB/s
bandwidth, 4TFLOPS computing power. Pretty beefy, and all within just 1U.
Both Tesla solutions, plus the lower cost GTX285s are now available
on many of our workstation
and server lines. Check out the new GPU Computing section, for instance, on our
two biggest selling items - the CADIZ Dual Opteron workstation and VEGA Dual Xeon rack server. More importantly, perhaps, is that we have especially specced
out our MESSINA Workstation with a 4-way PCIE 2.0 slot board, that can enable
you to operate up to FOUR GPGPUs in a single box!
As it stands currently, one has to either develop one's own code to run on the massively-parallel
GPGPUs in nVidia's graphics cores, or pick through many of the open-source libraries
appearing in the fledgling CUDA community. Over time, however, I expect that more
and more software vendors will package functionality in off-the-shelf
will unlock the power of GPGPUs. Adobe, for example, already has the CUDA-enabled
plug-in for Creative Suite 4, and the popular math suite MATLAB has a software development
kit (SDK) for CUDA.
GPU computing is one of those rare technological advances that does not
come with a hefty price tag. So whilst most GPU computing is still currently in
the hands of specialist programmers, rapid adoption of this new technology is likely,
as long as it is disseminated broadly. I don't think it'll be too long, therefore,
before most workstation and high-performance computing users will be referring to
the graphics chip as often as their main processor when it comes to "getting work
With the advent of computing power an order of magnitude over what was previously
available, the door opens to many more scientists, engineers, video professionals
and statisticians being able to do things with numbers that were previously impossible on their budgets.
More creative, realistic and gorgeous looking video effects could entertain us at
the movies or on TV. And as simulation plays a pivotal role in scientific advance
today, a quantum leap in computing power leads to more complex and more realistic
simulations of physical phenomena. These in turn could directly lead to new discoveries,
new understandings and new inventions that could change our lives, and quite possibly,
Chief Systems Engineer