Ollama¶

Ollama allows you to run large language models locally on your machine. Perfect for privacy-sensitive applications and offline use.

Stable

Quick Facts¶

Property	Value
Cost	Free (local hardware)
Privacy	Full (no data leaves your machine)
Models	LLaMA, Mistral, Phi, Gemma, and 100+ more
Streaming	Supported
Functions	Supported
Vision	Supported (LLaVA, etc.)
Embeddings	Supported

Setup¶

Install Ollama¶

macOSLinuxWindows

brew install ollama

curl -fsSL https://ollama.com/install.sh | sh

Download from ollama.com/download

Start the Server¶

ollama serve

The server runs on http://localhost:11434 by default.

Pull a Model¶

# LLaMA 3.2 (recommended for general use)
ollama pull llama3.2

# Smaller, faster model
ollama pull phi3

# Vision-capable model
ollama pull llava

Configure Sentimatrix¶

Environment VariablePythonYAML

export OLLAMA_HOST="http://localhost:11434"

from sentimatrix.config import SentimatrixConfig, LLMConfig

config = SentimatrixConfig(
    llm=LLMConfig(
        provider="ollama",
        base_url="http://localhost:11434",
        model="llama3.2"
    )
)

llm:
  provider: ollama
  base_url: http://localhost:11434
  model: llama3.2

Available Models¶

Model	Size	RAM Required	Best For
`llama3.2`	3B	4GB	General use, fast
`llama3.2:1b`	1B	2GB	Ultra-fast, basic tasks
`llama3.1`	8B	8GB	Better quality
`llama3.1:70b`	70B	48GB+	Best quality
`mistral`	7B	8GB	Code, reasoning
`phi3`	3.8B	4GB	Fast, efficient
`gemma2`	9B	12GB	Google's model
`llava`	7B	8GB	Vision tasks

View all models: ollama.com/library

Usage Examples¶

Basic Usage¶

import asyncio
from sentimatrix import Sentimatrix
from sentimatrix.config import SentimatrixConfig, LLMConfig

config = SentimatrixConfig(
    llm=LLMConfig(
        provider="ollama",
        model="llama3.2"
    )
)

async def main():
    async with Sentimatrix(config) as sm:
        summary = await sm.summarize_reviews(reviews)
        print(summary)

asyncio.run(main())

Vision Analysis (LLaVA)¶

config = SentimatrixConfig(
    llm=LLMConfig(
        provider="ollama",
        model="llava"
    )
)

async with Sentimatrix(config) as sm:
    result = await sm.analyze_image(
        image_path="product.jpg",
        prompt="What emotions does this product image convey?"
    )

Custom Model Configuration¶

config = SentimatrixConfig(
    llm=LLMConfig(
        provider="ollama",
        model="llama3.1",
        temperature=0.3,      # Lower for more focused output
        num_ctx=8192,         # Context window size
        num_gpu=1,            # GPU layers
    )
)

Embeddings¶

async with Sentimatrix(config) as sm:
    embeddings = await sm.get_embeddings([
        "Great product!",
        "Terrible experience.",
        "It's okay."
    ])

Hardware Requirements¶

Minimum Requirements¶

Model Size	RAM	GPU VRAM	CPU
1-3B	4GB	Optional	4 cores
7-8B	8GB	6GB	8 cores
13B	16GB	10GB	8 cores
70B	48GB+	40GB+	16 cores

GPU Acceleration¶

Ollama automatically uses GPU when available:

# Check GPU status
ollama list

# Force CPU only
OLLAMA_NUM_GPU=0 ollama serve

Configuration Options¶

LLMConfig(
    provider="ollama",
    base_url="http://localhost:11434",
    model="llama3.2",

    # Model settings
    temperature=0.7,
    num_ctx=4096,          # Context window
    num_predict=512,       # Max tokens to generate
    top_k=40,
    top_p=0.9,
    repeat_penalty=1.1,

    # Hardware
    num_gpu=-1,            # Auto-detect GPU layers
    num_thread=None,       # CPU threads (auto)

    # Reliability
    timeout=120,           # Longer timeout for large models
    max_retries=3,
)

Model Management¶

List Installed Models¶

ollama list

# NAME              SIZE      MODIFIED
# llama3.2          2.0 GB    2 hours ago
# mistral           4.1 GB    1 day ago

Pull Models¶

# Latest version
ollama pull llama3.2

# Specific version
ollama pull llama3.2:latest

# Quantized version (smaller, faster)
ollama pull llama3.2:q4_0

Remove Models¶

ollama rm mistral

Create Custom Models¶

# Create Modelfile
cat << 'EOF' > Modelfile
FROM llama3.2

PARAMETER temperature 0.3
PARAMETER num_ctx 8192

SYSTEM You are a sentiment analysis expert. Always respond with structured analysis.
EOF

# Create model
ollama create sentiment-analyst -f Modelfile

Remote Access¶

Expose Ollama on Network¶

# Set environment variable
export OLLAMA_HOST="0.0.0.0:11434"
ollama serve

Connect Remotely¶

config = SentimatrixConfig(
    llm=LLMConfig(
        provider="ollama",
        base_url="http://192.168.1.100:11434",
        model="llama3.2"
    )
)

Best Practices¶

Choose the Right Model Size
- 1-3B for fast responses, basic tasks
- 7-8B for balanced quality/speed
- 70B+ for best quality (requires powerful hardware)
Use GPU Acceleration
- Significantly faster than CPU
- Check with nvidia-smi or ollama list
Adjust Context Window
- Larger context = more memory
- Match to your use case

Use Quantized Models for Speed

ollama pull llama3.2:q4_0  # Fastest
ollama pull llama3.2:q8_0  # Better quality

Troubleshooting¶

Connection refused

Ensure Ollama is running:

ollama serve

Model not found

Pull the model first:

ollama pull llama3.2

Out of memory

Use a smaller model
Use quantized version (q4_0)
Reduce context window

LLMConfig(model="llama3.2:q4_0", num_ctx=2048)

Slow responses

Use GPU acceleration
Use smaller/quantized model
Reduce num_predict

Provider Overview
LM Studio - GUI alternative
vLLM - Production server