Hugging Face GGUF Selection Guide | Layer Bumping with llama.cpp

Coming from a Hugging Face model card?

Some Hugging Face GGUF repos link here using wording like “Layer bumping with llama.cpp”. This page explains (1) how those GGUF variants were produced, and (2) what the names mean so you can select an appropriate download for your local runtime and memory limits.

Layer Bumping with llama.cpp: what it means for GGUF variants

GGUF variants are typically produced by converting a base model (commonly bf16 or f16) into GGUF, then applying quantization formats to reduce file size and runtime memory. Many users then run the GGUFs via llama.cpp or compatible runtimes such as Jan, LM Studio, and Ollama.

On readyforquantum.com, these GGUF variants are also commonly used in evaluation and operational workflows (for example the Network Monitor Assistant and its TestLLM workflow tuning). The same GGUF naming and conversion details apply whether you are downloading from Hugging Face for local chat, benchmarking, or pipeline testing.

🔧 GGUF Conversion Notes (llama.cpp)

These GGUF files are generated with the llama.cpp conversion / quantization pipeline. The pipeline converts a source model (commonly bf16 or f16) into GGUF, then applies quantization formats (K-family or IQ-family) to reduce model size.

Layer / tensor “bumping”
When creating some quants, the llama.cpp tooling may keep certain tensors at higher precision than the rest (often described as “bumping” key tensors/layers). In practice, this is typically done via llama.cpp options such as --tensor-type, where specific tensors are assigned a higher-precision type while the remaining weights are quantized more aggressively. This can improve quality at lower bit-rates while slightly increasing file size.

Quant family consistency
These releases are created with a single quantization family per GGUF: K-family quants (Q*_K_*) are not mixed with IQ-family quants (IQ*) inside the same file.

BF16 (Brain Float 16) – Full precision base

📌 Notes on BF16 files in these releases
✔ Included when a high-precision baseline or a requantization starting point is provided.
✔ Useful for comparing against quantized variants.

📌 Compatibility note
❗ Hardware without BF16 acceleration may run BF16 workloads more slowly (often via fallback paths).

F16 (Float 16) – Alternative full precision base

📌 Notes on F16 files in these releases
✔ Included when FP16 is preferred as a full-precision baseline or conversion source.

📌 Compatibility note
❗ Devices lacking native FP16 acceleration may not see the expected performance benefits.

Hybrid Precision Models (e.g., bf16_q8_0, f16_q4_K) – Mixed tensor types

These formats reflect a conversion where some tensors remain higher precision while others are quantized. In practice this is implemented by llama.cpp’s conversion/quantization options (commonly via tensor-type selection / “bumping”).

📌 How hybrids are produced
✔ Converted using llama.cpp with mixed tensor types (often via --tensor-type mapping).
✔ “Bumped” tensors (kept higher precision) are usually those that are more sensitive to aggressive quantization.

📌 Practical implication
❗ Hybrid files may be larger than fully quantized files of the same nominal quant level, because some tensors remain higher precision.

Quantized Models (Q4_K, Q6_K, Q8, IQ*, etc.) – GGUF quantization formats

Quantization reduces model size and memory usage by representing weights in fewer bits, while using per-block scaling/metadata to preserve as much model behavior as possible.

🧩 Quant Family Consistency in these releases

When creating these GGUFs, layers/tensors are kept within the same quant family per file:

K-family models use Q*_K_* formats consistently across quantized tensors.
IQ-family models use IQ* formats consistently across quantized tensors.
✖ K-family and IQ-family formats are not mixed within the same GGUF.

This mirrors how llama.cpp’s kernels and tensor loaders are typically exercised during inference and keeps the build process and runtime behavior predictable.

Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)

These formats are produced to prioritize high memory efficiency. IQ-family formats are importance-aware and are designed to preserve quality per byte at very low bit-rates. The Q4 variants provide a common 4-bit baseline in the K-family (and legacy Q4_0).

Ultra Low-Bit Quantization (IQ1_S, IQ1_M, IQ2_S, IQ2_M, IQ2_XS, IQ2_XSS)

Ultra-low-bit (1-2 bit) formats are produced for extreme compression. They can be useful for fitting large models into very constrained memory, but they represent an aggressive tradeoff.


Summary Table: Model Format Selection

Model Format Precision Memory Usage Device Requirements Typical Scenario
BF16Very HighHighBF16-supported GPU/CPUFull-precision baseline; requant source
F16HighHighFP16-supported GPU/CPUFull-precision baseline where FP16 is preferred
Q4_KMedium-LowLowCPU or Low-VRAM devicesK-family 4-bit GGUF quant
Q6_KMediumModerateCPU with more memoryK-family higher-bit GGUF quant
Q8_0HighModerateGPU/CPU with moderate VRAMHigh-fidelity quantized variant
IQ3_XSLowVery LowUltra-low-memory devicesIQ-family 3-bit ultra-compressed GGUF
IQ3_SLowVery LowLow-memory devicesIQ-family 3-bit compressed GGUF
IQ3_MLow-MediumLowLow-memory devicesIQ-family 3-bit with improved reconstruction
Q4_0LowLowARM-based/embedded devicesLegacy 4-bit format; compatibility baseline
Ultra Low-Bit (IQ1/2_*)Very LowExtremely LowTiny edge/embedded devicesExtreme compression variants
Hybrid (e.g., bf16_q8_0)Medium–HighMediumMixed-precision capable hardwareMixed tensor types (“bumped” tensors)
🔗 References and related tools

llama.cpp implementation
The GGUF quantization formats and tensor-type mapping behavior described above are implemented in llama.cpp: github.com/ggerganov/llama.cpp

Network Monitor Assistant (readyforquantum.com)
If you are evaluating or operating models via Network Monitor Assistant, the same GGUF naming and selection concepts apply to TestLLM workflows (memory fit, speed, and quality targets). Related pages: Dashboard (Assistant), FAQ, Agent Download.