TensorFloat-32
This article needs additional citations for verification. (April 2025) |
| Floating-point formats |
|---|
| IEEE 754 |
|
| Other |
| Alternatives |
| Tapered floating point |
TensorFloat-32 (TF32) is a numeric floating point format designed for Tensor Core running on certain Nvidia GPUs. It was first implemented in the Ampere architecture [1]. TensorFloat-32 combines the 8-bit exponent size of IEEE single precision with the 10-bit mantissa size of half precision.
Format
[edit]The binary format is:
- 1 sign bit
- 8 exponent bits
- 10 significand bits (also called mantissa, or precision bits)
The 19-significant-bit format fits within a double word (32 bits), and while it lacks precision compared with a normal 32-bit IEEE 754 floating-point number, it provides much faster computation, up to 8 times on a A100 (compared to a V100 using FP32).[2]
Stored in the same space as FP32, it is not a distinct storage format, but a specification for reduced-precision FP32 multiply–accumulate operations. FP32 inputs are rounded to TF32, multiplied to produce a 21-bit product (including the implicit msbit, this is an 11×11→22-bit multiply), and summed into a standard FP32 accumulator.[3]
See also
[edit]References
[edit]- ^ Kharya, Paresh (14 May 2020). "TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x". Retrieved 7 January 2026.
- ^ "NVIDIA TF32". 8 February 2023. Retrieved 23 May 2024.
- ^ Stosic, Dusan; Micikevicius, Paulius (27 January 2021). "Accelerating AI Training with NVIDIA TF32 Tensor Cores". Retrieved 10 December 2025.