Choosing the Right Model Format

Selecting the correct model format depends on your hardware capabilities and memory constraints.

BF16 (Brain Float 16) – Use if BF16 acceleration is available

A 16‑bit floating‑point format designed for faster computation while retaining good precision.
Provides similar dynamic range as FP32 but with lower memory usage.
Recommended if your hardware supports BF16 acceleration (check your device's specs).
Ideal for high‑performance inference with reduced memory footprint compared to FP32.

📌 Use BF16 if:
✔ Your hardware has native BF16 support (e.g., newer GPUs, TPUs).
✔ You want higher precision while saving memory.
✔ You plan to requantize the model into another format.

📌 Avoid BF16 if:
❌ Your hardware does not support BF16 (it may fall back to FP32 and run slower).
❌ You need compatibility with older devices that lack BF16 optimization.

F16 (Float 16) – More widely supported than BF16

A 16‑bit floating‑point high precision but with less range than BF16.
Works on most devices with FP16 acceleration support (including many GPUs and some CPUs).
Slightly lower numerical precision than BF16 but generally sufficient for inference.

📌 Use F16 if:
✔ Your hardware supports FP16 but not BF16.
✔ You need a balance between speed, memory usage, and accuracy.
✔ You are running on a GPU or another device optimized for FP16 computations.

📌 Avoid F16 if:
❌ Your device lacks native FP16 support (it may run slower than expected).
❌ You have memory limitations.

Hybrid Precision Models (e.g., `bf16_q8_0`, `f16_q4_K`) – Best of Both Worlds

These formats selectively quantize non‑essential layers while keeping key layers in full precision (e.g., attention and output layers).

Named like bf16_q8_0 (means full‑precision BF16 core layers + quantized Q8_0 other layers).
Strike a balance between memory efficiency and accuracy, better than fully quantized models without requiring full BF16/F16.

📌 Use Hybrid Models if:
✔ You need better accuracy than quant‑only models but can’t afford full BF16/F16 everywhere.
✔ Your device supports mixed‑precision inference.
✔ You want to optimize trade‑offs for production‑grade models on constrained hardware.

📌 Avoid Hybrid Models if:
❌ Your target device doesn’t support mixed or full‑precision acceleration.
❌ You are operating under ultra‑strict memory limits (use fully quantized formats instead).

Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low‑VRAM Inference

Quantization reduces model size and memory usage while maintaining as much accuracy as possible.

Lower‑bit models (Q4_K) → Best for minimal memory usage, may have lower precision.
Higher‑bit models (Q6_K, Q8_0) → Better accuracy, requires more memory.

📌 Use Quantized Models if:
✔ You are running inference on a CPU and need an optimized model.
✔ Your device has low VRAM and cannot load full‑precision models.
✔ You want to reduce memory footprint while keeping reasonable accuracy.

📌 Avoid Quantized Models if:
❌ You need maximum accuracy (full‑precision models are better for this).
❌ Your hardware has enough VRAM for higher‑precision formats (BF16/F16).

Very Low‑Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)

Optimized for very high memory efficiency, ideal for low‑power devices or large‑scale deployments where memory is critical.

IQ3_XS: Ultra‑low‑bit (3‑bit), very high memory efficiency – use when even Q4_K is too large.
IQ3_S: Slightly less aggressive than XS – use on low‑memory devices.
IQ3_M: Medium block size – better accuracy than IQ3_S.
Q4_K: 4‑bit with block‑wise optimization – good for low‑memory devices where Q6_K is too large.
Q4_0: Pure 4‑bit, optimized for ARM devices – ideal for low‑memory/ARM inference.

Ultra Low‑Bit Quantization (IQ1_S, IQ1_M, IQ2_S, IQ2_M, IQ2_XS, IQ2_XSS)

Ultra‑low‑bit (1‑2 bit) with extreme memory efficiency.

Use case: fit the model into very constrained memory.
Trade‑off: very low accuracy – may not function as expected; test fully before using.

Summary Table: Model Format Selection

Model Format	Precision	Memory Usage	Device Requirements	Best Use Case
BF16	Very High	High	BF16‑supported GPU/CPU	High‑speed inference with reduced memory
F16	High	High	FP16‑supported GPU/CPU	Inference when BF16 isn’t available
Q4_K	Medium‑Low	Low	CPU or Low‑VRAM devices	Memory‑constrained inference
Q6_K	Medium	Moderate	CPU with more memory	Better accuracy with quantization
Q8_0	High	Moderate	GPU/CPU with moderate VRAM	Highest accuracy among quantized models
IQ3_XS	Low	Very Low	Ultra‑low‑memory devices	Max memory efficiency, low accuracy
IQ3_S	Low	Very Low	Low‑memory devices	Slightly more usable than IQ3_XS
IQ3_M	Low‑Medium	Low	Low‑memory devices	Better accuracy than IQ3_S
Q4_0	Low	Low	ARM‑based/embedded devices	ARM‑optimized inference
Ultra Low‑Bit (IQ1/2_*)	Very Low	Extremely Low	Tiny edge/embedded devices	Extreme memory fit; low accuracy
Hybrid (e.g., `bf16_q8_0`)	Medium–High	Medium	Mixed‑precision capable hardware	Balance: performance & memory, near‑FP accuracy