Choosing the Right Model Format

Selecting the correct model format depends on your hardware capabilities and memory constraints.

BF16 (Brain Float 16) – Use if BF16 acceleration is available

📌 Use BF16 if:
✔ Your hardware has native BF16 support (e.g., newer GPUs, TPUs).
✔ You want higher precision while saving memory.
✔ You plan to requantize the model into another format.

📌 Avoid BF16 if:
❌ Your hardware does not support BF16 (it may fall back to FP32 and run slower).
❌ You need compatibility with older devices that lack BF16 optimization.

F16 (Float 16) – More widely supported than BF16

📌 Use F16 if:
✔ Your hardware supports FP16 but not BF16.
✔ You need a balance between speed, memory usage, and accuracy.
✔ You are running on a GPU or another device optimized for FP16 computations.

📌 Avoid F16 if:
❌ Your device lacks native FP16 support (it may run slower than expected).
❌ You have memory limitations.

Hybrid Precision Models (e.g., bf16_q8_0, f16_q4_K) – Best of Both Worlds

These formats selectively quantize non‑essential layers while keeping key layers in full precision (e.g., attention and output layers).

📌 Use Hybrid Models if:
✔ You need better accuracy than quant‑only models but can’t afford full BF16/F16 everywhere.
✔ Your device supports mixed‑precision inference.
✔ You want to optimize trade‑offs for production‑grade models on constrained hardware.

📌 Avoid Hybrid Models if:
❌ Your target device doesn’t support mixed or full‑precision acceleration.
❌ You are operating under ultra‑strict memory limits (use fully quantized formats instead).

Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low‑VRAM Inference

Quantization reduces model size and memory usage while maintaining as much accuracy as possible.

📌 Use Quantized Models if:
✔ You are running inference on a CPU and need an optimized model.
✔ Your device has low VRAM and cannot load full‑precision models.
✔ You want to reduce memory footprint while keeping reasonable accuracy.

📌 Avoid Quantized Models if:
❌ You need maximum accuracy (full‑precision models are better for this).
❌ Your hardware has enough VRAM for higher‑precision formats (BF16/F16).

Very Low‑Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)

Optimized for very high memory efficiency, ideal for low‑power devices or large‑scale deployments where memory is critical.

Ultra Low‑Bit Quantization (IQ1_S, IQ1_M, IQ2_S, IQ2_M, IQ2_XS, IQ2_XSS)

Ultra‑low‑bit (1‑2 bit) with extreme memory efficiency.


Summary Table: Model Format Selection

Model FormatPrecisionMemory UsageDevice RequirementsBest Use Case
BF16Very HighHighBF16‑supported GPU/CPUHigh‑speed inference with reduced memory
F16HighHighFP16‑supported GPU/CPUInference when BF16 isn’t available
Q4_KMedium‑LowLowCPU or Low‑VRAM devicesMemory‑constrained inference
Q6_KMediumModerateCPU with more memoryBetter accuracy with quantization
Q8_0HighModerateGPU/CPU with moderate VRAMHighest accuracy among quantized models
IQ3_XSLowVery LowUltra‑low‑memory devicesMax memory efficiency, low accuracy
IQ3_SLowVery LowLow‑memory devicesSlightly more usable than IQ3_XS
IQ3_MLow‑MediumLowLow‑memory devicesBetter accuracy than IQ3_S
Q4_0LowLowARM‑based/embedded devicesARM‑optimized inference
Ultra Low‑Bit (IQ1/2_*)Very LowExtremely LowTiny edge/embedded devicesExtreme memory fit; low accuracy
Hybrid (e.g., bf16_q8_0)Medium–HighMediumMixed‑precision capable hardwareBalance: performance & memory, near‑FP accuracy