Just a heads up to those who were waiting to learn more about Pascal, NVIDIA’s latest architecture. Each major release is usually accompanied by whitepaper that brakes down all important aspects of the new hardware. Tesla P100 launch is no exception.
Full GP100 GPU
Figure 7 shows a full GP100 GPU with 60 SM units (different products can use different configurations of GP100). The Tesla P100 accelerator uses 56 SM units
Pascal Streaming Multiprocessor
GP100’s SM incorporates 64 single-precision (FP32) CUDA Cores. In contrast, the Maxwell and Kepler SMs had 128 and 192 FP32 CUDA Cores, respectively. The GP100 SM is partitioned into two processing blocks, each having 32 single-precision CUDA Cores, an instruction buffer, a warp scheduler, and two dispatch units. While a GP100 SM has half the total number of CUDA Cores of a Maxwell SM, it maintains the same register file size and supports similar occupancy of warps and thread blocks.
FP64 Cores
Each SM in GP100 features 32 double precision (FP64) CUDA Cores, which is one-half the number of FP32 single precision CUDA Cores. A full GP100 GPU has 1920 FP64 CUDA Cores. This 2:1 ratio of single precision (SP) units to double precision (DP) units aligns better with GP100’s new datapath configuration, allowing the GPU to process DP workloads more efficiently.
L1/L2 cache
GP100 features a unified 4096 KB L2 cache that provides efficient, high speed data sharing across the GPU. In comparison, GK110’s L2 cache was 1536 KB, while GM200 shipped with 3072 KB of L2 cache. With more cache located on-chip, fewer requests to the GPU’s DRAM are needed, which reduces overall board power, reduces memory bandwidth demand, and improves performance.