NVIDIA just announced full specifications of Pascal GP100.
NVIDIA Pascal GP100 has 3840 CUDA cores
NVIDIA unveiled specifications of so-called Big Pascal. The GPU architecture has been modified. With Pascal each Streaming Multiprocessor now has 64 CUDA cores (Maxwell had 128). There are 60 SMx in GP100, so in total we have 3840 CUDA cores. Each SM has 4 TMU (Texture Mapping Unit), so that gives us 240 TMUs.
Each SM has 2:1 ratio of FP32 to FP64 units. It means that FP64 performance has been massively improved compared to Kepler and Maxwell.
The GPU is made in 16nm Fin-FET fabrication node. GP100 has up to 16 GB of HBM2 memory. The processor has has eight 512-bit memory controllers with total width of 4096-bit. Maximum bandwidth is reported at 720 GB/s. Unfortunately rather comprehensive blog post at NVIDIA website does not explain everything.
It’s worth noting that Tesla P100 is not using the full chip.
Key features of GP100:
- Extreme performance—powering HPC, deep learning, and many more GPU Computing areas;
- NVLink™—NVIDIA’s new high speed, high bandwidth interconnect for maximum application scalability;
- HBM2—Fastest, high capacity, extremely efficient stacked GPU memory architecture;
- Unified Memory and Compute Preemption—significantly improved programming model;
- 16nm FinFET—enables more features, higher performance, and improved power efficiency.
NVIDIA GP100 Specifications | |||
---|---|---|---|
Tesla Products | Tesla K40 | Tesla M40 | Tesla P100 |
GPU | GK110 (Kepler) | GM200 (Maxwell) | GP100 (Pascal) |
SMs | 15 | 24 | 56 |
TPCs | 15 | 24 | 28 |
FP32 CUDA Cores / SM | 192 | 128 | 64 |
FP32 CUDA Cores / GPU | 2880 | 3072 | 3584 |
FP64 CUDA Cores / SM | 64 | 4 | 32 |
FP64 CUDA Cores / GPU | 960 | 96 | 1792 |
Base Clock | 745 MHz | 948 MHz | 1328 MHz |
GPU Boost Clock | 810/875 MHz | 1114 MHz | 1480 MHz |
FP64 GFLOPs | 1680 | 213 | 5304 |
Texture Units | 240 | 192 | 224 |
Memory Interface | 384-bit GDDR5 | 384-bit GDDR5 | 4096-bit HBM2 |
Memory Size | Up to 12 GB | Up to 24 GB | 16 GB |
L2 Cache Size | 1536 KB | 3072 KB | 4096 KB |
Register File Size / SM | 256 KB | 256 KB | 256 KB |
Register File Size / GPU | 3840 KB | 6144 KB | 14336 KB |
TDP | 235 Watts | 250 Watts | 300 Watts |
Transistors | 7.1 billion | 8 billion | 15.3 billion |
GPU Die Size | 551 mm² | 601 mm² | 610 mm² |
Manufacturing Process | 28-nm | 28-nm | 16-nm |
Compute Capability
The Compute Capability has ben updated to 6.0.
Pascal Compute Capability | |||
---|---|---|---|
GPU | Kepler GK110 | Maxwell GM200 | Pascal GP100 |
Compute Capability | 3.5 | 5.3 | 6.0 |
Threads / Warp | 32 | 32 | 32 |
Max Warps / Multiprocessor | 64 | 64 | 64 |
Max Threads / Multiprocessor | 2048 | 2048 | 2048 |
Max Thread Blocks / Multiprocessor | 16 | 32 | 32 |
Max 32-bit Registers / SM | 65536 | 65536 | 65536 |
Max Registers / Block | 65536 | 32768 | 65536 |
Max Registers / Thread | 255 | 255 | 255 |
Max Thread Block Size | 1024 | 1024 | 1024 |
CUDA Cores / SM | 192 | 128 | 64 |
Shared Memory Size / SM Configurations (bytes) | 16K/32K/48K | 96K | 64K |
The Pascal GP100 Architecture: Faster in Every Way
With every new GPU architecture, NVIDIA introduces major improvements to performance and power efficiency. The heart of the computation in Tesla GPUs is the SM, or streaming multiprocessor. The streaming multiprocessor creates, manages, schedules and executes instructions from many threads in parallel.
Like previous Tesla GPUs, GP100 is composed of an array of Graphics Processing Clusters (GPCs), Streaming Multiprocessors (SMs), and memory controllers. GP100 achieves its colossal throughput by providing six GPCs, up to 60 SMs, and eight 512-bit memory controllers (4096 bits total). The Pascal architecture’s computational prowess is more than just brute force: it increases performance not only by adding more SMs than previous GPUs, but by making each SM more efficient. Each SM has 64 CUDA cores and four texture units, for a total of 3840 CUDA cores and 240 texture units.
Delivering higher performance and improving energy efficiency are two key goals for new GPU architectures. A number of changes to the SM in the Maxwell architecture improved its efficiency compared to Kepler. Pascal builds on this and incorporates additional improvements that increase performance per watt even further over Maxwell. While TSMC’s 16nm Fin-FET manufacturing process plays an important role, many GPU architectural modifications were also implemented to further reduce power consumption while maintaining high performance.
The following table provides a high-level comparison of Tesla P100 specifications compared to previous-generation Tesla GPU accelerators.