Ready For Quantum: Home | Dashboard (Assistant) | FAQ | Agent Download
Some Hugging Face GGUF repos link here using wording like “Layer bumping with llama.cpp”. This page explains (1) how those GGUF variants were produced, and (2) what the names mean so you can select an appropriate download for your local runtime and memory limits.
Layer Bumping with llama.cpp: what it means for GGUF variants
GGUF variants are typically produced by converting a base model (commonly bf16 or f16)
into GGUF, then applying quantization formats to reduce file size and runtime memory.
Many users then run the GGUFs via llama.cpp or compatible runtimes such as Jan,
LM Studio, and Ollama.
On readyforquantum.com, these GGUF variants are also commonly used in evaluation and operational workflows (for example the Network Monitor Assistant and its TestLLM workflow tuning). The same GGUF naming and conversion details apply whether you are downloading from Hugging Face for local chat, benchmarking, or pipeline testing.
On this page
These GGUF files are generated with the llama.cpp conversion / quantization pipeline. The pipeline converts a source model (commonly
bf16 or f16)
into GGUF, then applies quantization formats (K-family or IQ-family) to reduce model size.Layer / tensor “bumping”
When creating some quants, the llama.cpp tooling may keep certain tensors at higher precision than the rest (often described as “bumping” key tensors/layers). In practice, this is typically done via llama.cpp options such as
--tensor-type, where specific tensors are assigned a higher-precision type while the remaining weights are quantized more aggressively.
This can improve quality at lower bit-rates while slightly increasing file size.Quant family consistency
These releases are created with a single quantization family per GGUF: K-family quants (
Q*_K_*) are not mixed with IQ-family quants (IQ*) inside the same file.
BF16 (Brain Float 16) – Full precision base
- A 16-bit floating-point format designed for faster computation while retaining good precision.
- Provides similar dynamic range as FP32 but with lower memory usage.
- Often used as a source format for GGUF conversion before quantization.
- Ideal as a high-precision reference and for requantization into other formats.
✔ Included when a high-precision baseline or a requantization starting point is provided.
✔ Useful for comparing against quantized variants.
📌 Compatibility note
❗ Hardware without BF16 acceleration may run BF16 workloads more slowly (often via fallback paths).
F16 (Float 16) – Alternative full precision base
- A 16-bit floating-point with high precision but typically less range than BF16.
- Commonly used when FP16 acceleration is broadly supported by the target stack.
- Slightly lower numerical precision than BF16 but generally sufficient for inference and as a conversion base.
✔ Included when FP16 is preferred as a full-precision baseline or conversion source.
📌 Compatibility note
❗ Devices lacking native FP16 acceleration may not see the expected performance benefits.
Hybrid Precision Models (e.g., bf16_q8_0, f16_q4_K) – Mixed tensor types
These formats reflect a conversion where some tensors remain higher precision while others are quantized. In practice this is implemented by llama.cpp’s conversion/quantization options (commonly via tensor-type selection / “bumping”).
- Named like
bf16_q8_0(indicating a BF16-based conversion where quantization is applied usingQ8_0for selected tensors). - Provide a balance between memory efficiency and accuracy, typically improving quality versus fully-quantized models at similar size.
- The exact set of tensors retained at higher precision depends on the conversion configuration used.
✔ Converted using llama.cpp with mixed tensor types (often via
--tensor-type mapping).✔ “Bumped” tensors (kept higher precision) are usually those that are more sensitive to aggressive quantization.
📌 Practical implication
❗ Hybrid files may be larger than fully quantized files of the same nominal quant level, because some tensors remain higher precision.
Quantized Models (Q4_K, Q6_K, Q8, IQ*, etc.) – GGUF quantization formats
Quantization reduces model size and memory usage by representing weights in fewer bits, while using per-block scaling/metadata to preserve as much model behavior as possible.
- Lower-bit models (Q4_K) → smaller files and lower memory footprint; more aggressive compression.
- Higher-bit models (Q6_K, Q8_0) → larger files; generally closer to full-precision behavior.
When creating these GGUFs, layers/tensors are kept within the same quant family per file:
✔ K-family models use
Q*_K_* formats consistently across quantized tensors.✔ IQ-family models use
IQ* formats consistently across quantized tensors.✖ K-family and IQ-family formats are not mixed within the same GGUF.
This mirrors how llama.cpp’s kernels and tensor loaders are typically exercised during inference and keeps the build process and runtime behavior predictable.
Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)
These formats are produced to prioritize high memory efficiency. IQ-family formats are importance-aware and are designed to preserve quality per byte at very low bit-rates. The Q4 variants provide a common 4-bit baseline in the K-family (and legacy Q4_0).
- IQ3_XS: Ultra-low-bit (3-bit), very high compression.
- IQ3_S: Slightly less aggressive than XS.
- IQ3_M: Medium block size – typically higher reconstruction quality than IQ3_S.
- Q4_K: 4-bit K-family with block-wise optimization.
- Q4_0: Legacy pure 4-bit format often referenced for compatibility / older pipelines.
Ultra Low-Bit Quantization (IQ1_S, IQ1_M, IQ2_S, IQ2_M, IQ2_XS, IQ2_XSS)
Ultra-low-bit (1-2 bit) formats are produced for extreme compression. They can be useful for fitting large models into very constrained memory, but they represent an aggressive tradeoff.
- Use case: extremely constrained memory environments or experimentation with maximal compression.
- Trade-off: very low accuracy – validate on your target workload before relying on it.
Summary Table: Model Format Selection
| Model Format | Precision | Memory Usage | Device Requirements | Typical Scenario |
|---|---|---|---|---|
| BF16 | Very High | High | BF16-supported GPU/CPU | Full-precision baseline; requant source |
| F16 | High | High | FP16-supported GPU/CPU | Full-precision baseline where FP16 is preferred |
| Q4_K | Medium-Low | Low | CPU or Low-VRAM devices | K-family 4-bit GGUF quant |
| Q6_K | Medium | Moderate | CPU with more memory | K-family higher-bit GGUF quant |
| Q8_0 | High | Moderate | GPU/CPU with moderate VRAM | High-fidelity quantized variant |
| IQ3_XS | Low | Very Low | Ultra-low-memory devices | IQ-family 3-bit ultra-compressed GGUF |
| IQ3_S | Low | Very Low | Low-memory devices | IQ-family 3-bit compressed GGUF |
| IQ3_M | Low-Medium | Low | Low-memory devices | IQ-family 3-bit with improved reconstruction |
| Q4_0 | Low | Low | ARM-based/embedded devices | Legacy 4-bit format; compatibility baseline |
| Ultra Low-Bit (IQ1/2_*) | Very Low | Extremely Low | Tiny edge/embedded devices | Extreme compression variants |
Hybrid (e.g., bf16_q8_0) | Medium–High | Medium | Mixed-precision capable hardware | Mixed tensor types (“bumped” tensors) |
llama.cpp implementation
The GGUF quantization formats and tensor-type mapping behavior described above are implemented in llama.cpp: github.com/ggerganov/llama.cpp
Network Monitor Assistant (readyforquantum.com)
If you are evaluating or operating models via Network Monitor Assistant, the same GGUF naming and selection concepts apply to TestLLM workflows (memory fit, speed, and quality targets). Related pages: Dashboard (Assistant), FAQ, Agent Download.
More: Home | FAQ | Agent Download | Dashboard (Assistant)