Skip to content

vLLM¶

vLLM is a high-throughput, memory-efficient inference engine for LLMs with PagedAttention technology.

Quick Start¶

from sentimatrix import Sentimatrix
from sentimatrix.config import SentimatrixConfig, LLMConfig

config = SentimatrixConfig(
    llm=LLMConfig(
        provider="vllm",
        model="meta-llama/Meta-Llama-3.1-8B-Instruct",
        api_base="http://localhost:8000/v1"
    )
)

async with Sentimatrix(config) as sm:
    summary = await sm.summarize_reviews(reviews)

Setup¶

Install vLLM¶

pip install vllm

Start Server¶

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000

Configuration¶

LLMConfig(
    provider="vllm",
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    api_base="http://localhost:8000/v1",
    temperature=0.7,
    max_tokens=4096,
    timeout=60,
)

Features¶

PagedAttention: Efficient memory management
Continuous Batching: High throughput
Tensor Parallelism: Multi-GPU support
OpenAI Compatible: Standard API format

Performance¶

Feature	vLLM	Standard
Throughput	24x higher	Baseline
Memory	90% efficient	50-60%
Batch Size	Dynamic	Fixed

Supported Models¶

Llama ⅔/3.⅓.2
Mistral/Mixtral
Qwen/Qwen2
Falcon
MPT
Phi-⅔
And many more...

Example: High-Throughput Processing¶

# vLLM excels at batch processing
config = SentimatrixConfig(
    llm=LLMConfig(
        provider="vllm",
        model="meta-llama/Meta-Llama-3.1-70B-Instruct",
        api_base="http://localhost:8000/v1"
    )
)

async with Sentimatrix(config) as sm:
    # Process large batches efficiently
    results = await sm.analyze_batch(
        reviews,  # 10,000+ reviews
        batch_size=256
    )

Docker Deployment¶

docker run --gpus all \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model meta-llama/Meta-Llama-3.1-8B-Instruct