Skip to content

llama.cpp

llama.cpp provides highly optimized CPU and GPU inference for GGUF quantized models.

Quick Start

from sentimatrix import Sentimatrix
from sentimatrix.config import SentimatrixConfig, LLMConfig

config = SentimatrixConfig(
    llm=LLMConfig(
        provider="llamacpp",
        model="llama-3.1-8b",
        api_base="http://localhost:8080/v1"
    )
)

async with Sentimatrix(config) as sm:
    summary = await sm.summarize_reviews(reviews)

Setup

Build llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j

Start Server

./llama-server \
    -m models/llama-3.1-8b-instruct-q4_k_m.gguf \
    --host 0.0.0.0 \
    --port 8080

Configuration

LLMConfig(
    provider="llamacpp",
    model="llama-3.1-8b",
    api_base="http://localhost:8080/v1",
    temperature=0.7,
    max_tokens=4096,
    timeout=120,  # CPU inference can be slower
)

Features

  • CPU Optimized: AVX, AVX2, AVX-512 support
  • Quantization: 2-8 bit quantization
  • Metal Support: Apple Silicon acceleration
  • CUDA/ROCm: GPU acceleration
  • Low Memory: Run 70B on 32GB RAM

Quantization Levels

Format Size Reduction Quality
Q8_0 50% Excellent
Q6_K 60% Very Good
Q5_K_M 65% Good
Q4_K_M 75% Good
Q3_K_M 80% Acceptable
Q2_K 87% Lower
Model Q4_K_M Size RAM Required
Llama 3.2 3B 2GB 4GB
Llama 3.1 8B 5GB 8GB
Mistral 7B 4GB 8GB
Llama 3.1 70B 40GB 48GB

Example: CPU-Only Deployment

# Perfect for servers without GPUs
config = SentimatrixConfig(
    llm=LLMConfig(
        provider="llamacpp",
        api_base="http://localhost:8080/v1"
    )
)

async with Sentimatrix(config) as sm:
    result = await sm.analyze("Great product!")

Server Options

./llama-server \
    -m model.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -c 4096 \           # Context length
    -ngl 35 \           # GPU layers (0 for CPU-only)
    --threads 8 \       # CPU threads
    --parallel 4        # Concurrent requests