Skip to content

llama.cpp¶

llama.cpp provides highly optimized CPU and GPU inference for GGUF quantized models.

Quick Start¶

from sentimatrix import Sentimatrix
from sentimatrix.config import SentimatrixConfig, LLMConfig

config = SentimatrixConfig(
    llm=LLMConfig(
        provider="llamacpp",
        model="llama-3.1-8b",
        api_base="http://localhost:8080/v1"
    )
)

async with Sentimatrix(config) as sm:
    summary = await sm.summarize_reviews(reviews)

Setup¶

Build llama.cpp¶

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j

Start Server¶

./llama-server \
    -m models/llama-3.1-8b-instruct-q4_k_m.gguf \
    --host 0.0.0.0 \
    --port 8080

Configuration¶

LLMConfig(
    provider="llamacpp",
    model="llama-3.1-8b",
    api_base="http://localhost:8080/v1",
    temperature=0.7,
    max_tokens=4096,
    timeout=120,  # CPU inference can be slower
)

Features¶

CPU Optimized: AVX, AVX2, AVX-512 support
Quantization: 2-8 bit quantization
Metal Support: Apple Silicon acceleration
CUDA/ROCm: GPU acceleration
Low Memory: Run 70B on 32GB RAM

Quantization Levels¶

Format	Size Reduction	Quality
Q8_0	50%	Excellent
Q6_K	60%	Very Good
Q5_K_M	65%	Good
Q4_K_M	75%	Good
Q3_K_M	80%	Acceptable
Q2_K	87%	Lower

Recommended Models (GGUF)¶

Model	Q4_K_M Size	RAM Required
Llama 3.2 3B	2GB	4GB
Llama 3.1 8B	5GB	8GB
Mistral 7B	4GB	8GB
Llama 3.1 70B	40GB	48GB

Example: CPU-Only Deployment¶

# Perfect for servers without GPUs
config = SentimatrixConfig(
    llm=LLMConfig(
        provider="llamacpp",
        api_base="http://localhost:8080/v1"
    )
)

async with Sentimatrix(config) as sm:
    result = await sm.analyze("Great product!")

Server Options¶

./llama-server \
    -m model.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -c 4096 \           # Context length
    -ngl 35 \           # GPU layers (0 for CPU-only)
    --threads 8 \       # CPU threads
    --parallel 4        # Concurrent requests