Choosing the Right Model Format
Selecting the correct model format depends on your hardware capabilities and memory constraints.
BF16 (Brain Float 16) – Use if BF16 acceleration is available
- A 16‑bit floating‑point format designed for faster computation while retaining good precision.
- Provides similar dynamic range as FP32 but with lower memory usage.
- Recommended if your hardware supports BF16 acceleration (check your device's specs).
- Ideal for high‑performance inference with reduced memory footprint compared to FP32.
📌 Use BF16 if:
✔ Your hardware has native BF16 support (e.g., newer GPUs, TPUs).
✔ You want higher precision while saving memory.
✔ You plan to requantize the model into another format.
📌 Avoid BF16 if:
❌ Your hardware does not support BF16 (it may fall back to FP32 and run slower).
❌ You need compatibility with older devices that lack BF16 optimization.
F16 (Float 16) – More widely supported than BF16
- A 16‑bit floating‑point high precision but with less range than BF16.
- Works on most devices with FP16 acceleration support (including many GPUs and some CPUs).
- Slightly lower numerical precision than BF16 but generally sufficient for inference.
📌 Use F16 if:
✔ Your hardware supports FP16 but not BF16.
✔ You need a balance between speed, memory usage, and accuracy.
✔ You are running on a GPU or another device optimized for FP16 computations.
📌 Avoid F16 if:
❌ Your device lacks native FP16 support (it may run slower than expected).
❌ You have memory limitations.
Hybrid Precision Models (e.g., bf16_q8_0
, f16_q4_K
) – Best of Both Worlds
These formats selectively quantize non‑essential layers while keeping key layers in full precision (e.g., attention and output layers).
- Named like
bf16_q8_0
(means full‑precision BF16 core layers + quantized Q8_0 other layers).
- Strike a balance between memory efficiency and accuracy, better than fully quantized models without requiring full BF16/F16.
📌 Use Hybrid Models if:
✔ You need better accuracy than quant‑only models but can’t afford full BF16/F16 everywhere.
✔ Your device supports mixed‑precision inference.
✔ You want to optimize trade‑offs for production‑grade models on constrained hardware.
📌 Avoid Hybrid Models if:
❌ Your target device doesn’t support mixed or full‑precision acceleration.
❌ You are operating under ultra‑strict memory limits (use fully quantized formats instead).
Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low‑VRAM Inference
Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
- Lower‑bit models (Q4_K) → Best for minimal memory usage, may have lower precision.
- Higher‑bit models (Q6_K, Q8_0) → Better accuracy, requires more memory.
📌 Use Quantized Models if:
✔ You are running inference on a CPU and need an optimized model.
✔ Your device has low VRAM and cannot load full‑precision models.
✔ You want to reduce memory footprint while keeping reasonable accuracy.
📌 Avoid Quantized Models if:
❌ You need maximum accuracy (full‑precision models are better for this).
❌ Your hardware has enough VRAM for higher‑precision formats (BF16/F16).
Very Low‑Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)
Optimized for very high memory efficiency, ideal for low‑power devices or large‑scale deployments where memory is critical.
- IQ3_XS: Ultra‑low‑bit (3‑bit), very high memory efficiency – use when even Q4_K is too large.
- IQ3_S: Slightly less aggressive than XS – use on low‑memory devices.
- IQ3_M: Medium block size – better accuracy than IQ3_S.
- Q4_K: 4‑bit with block‑wise optimization – good for low‑memory devices where Q6_K is too large.
- Q4_0: Pure 4‑bit, optimized for ARM devices – ideal for low‑memory/ARM inference.
Ultra Low‑Bit Quantization (IQ1_S, IQ1_M, IQ2_S, IQ2_M, IQ2_XS, IQ2_XSS)
Ultra‑low‑bit (1‑2 bit) with extreme memory efficiency.
- Use case: fit the model into very constrained memory.
- Trade‑off: very low accuracy – may not function as expected; test fully before using.
Summary Table: Model Format Selection
Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
BF16 | Very High | High | BF16‑supported GPU/CPU | High‑speed inference with reduced memory |
F16 | High | High | FP16‑supported GPU/CPU | Inference when BF16 isn’t available |
Q4_K | Medium‑Low | Low | CPU or Low‑VRAM devices | Memory‑constrained inference |
Q6_K | Medium | Moderate | CPU with more memory | Better accuracy with quantization |
Q8_0 | High | Moderate | GPU/CPU with moderate VRAM | Highest accuracy among quantized models |
IQ3_XS | Low | Very Low | Ultra‑low‑memory devices | Max memory efficiency, low accuracy |
IQ3_S | Low | Very Low | Low‑memory devices | Slightly more usable than IQ3_XS |
IQ3_M | Low‑Medium | Low | Low‑memory devices | Better accuracy than IQ3_S |
Q4_0 | Low | Low | ARM‑based/embedded devices | ARM‑optimized inference |
Ultra Low‑Bit (IQ1/2_*) | Very Low | Extremely Low | Tiny edge/embedded devices | Extreme memory fit; low accuracy |
Hybrid (e.g., bf16_q8_0 ) | Medium–High | Medium | Mixed‑precision capable hardware | Balance: performance & memory, near‑FP accuracy |