Multi-Modal Analysis¶
Sentimatrix supports analyzing multiple modalities - text, audio, images, and video - with fusion strategies for comprehensive sentiment understanding.
Supported Modalities¶
:material-text: Text
Standard text sentiment and emotion analysis with 48 models.
:material-microphone: Audio
Speech-to-text transcription followed by sentiment analysis.
:material-image: Image
Image captioning and visual sentiment analysis.
:material-video: Video
Frame extraction + audio analysis with result fusion.
Audio Analysis¶
Transcribe audio and analyze the sentiment of spoken content:
async with Sentimatrix() as sm:
result = await sm.analyze_audio("customer_call.wav")
print(f"Transcription: {result.transcription}")
print(f"Sentiment: {result.sentiment.label}")
print(f"Emotions: {result.emotions.top_k(3)}")
Supported Audio Formats¶
- WAV, MP3, FLAC, OGG, M4A
Transcription Engines (4 Implemented)¶
| Engine | Provider | Speed | Quality |
|---|---|---|---|
whisper-base | OpenAI (local) | Fast | Good |
whisper-medium | OpenAI (local) | Medium | Better |
whisper-large-v3 | OpenAI (local) | Slow | Best |
groq-whisper | Groq API | Fast | Excellent |
Configuration¶
from sentimatrix.config import SentimatrixConfig, LLMConfig
config = SentimatrixConfig(
llm=LLMConfig(
provider="groq", # Use Groq for fast transcription
)
)
async with Sentimatrix(config) as sm:
result = await sm.analyze_audio(
"call.wav",
engine="groq-whisper"
)
Image Analysis¶
Analyze images for visual sentiment and emotional content:
async with Sentimatrix() as sm:
result = await sm.analyze_image("product_photo.jpg")
print(f"Caption: {result.caption}")
print(f"Sentiment: {result.sentiment.label}")
print(f"Visual mood: {result.mood}")
Supported Image Formats¶
- PNG, JPEG, WEBP, GIF, BMP
Vision Models (5 Implemented)¶
| Model | Provider | Features |
|---|---|---|
llava | Ollama (local) | General vision |
blip-base | Salesforce | Fast captioning |
blip-large | Salesforce | Better captions |
blip2-opt-2.7b | Salesforce | Best quality |
gpt-4-vision | OpenAI API | Premium quality |
claude-vision | Anthropic API | Safety-focused |
gemini-vision | Google API | Multimodal |
Configuration¶
config = SentimatrixConfig(
llm=LLMConfig(
provider="openai",
model="gpt-4-vision-preview"
)
)
async with Sentimatrix(config) as sm:
result = await sm.analyze_image(
"product.jpg",
prompt="What emotions does this product image convey?"
)
Video Analysis¶
Analyze videos by extracting frames and audio:
async with Sentimatrix() as sm:
result = await sm.analyze_video("review_video.mp4")
print(f"Duration: {result.duration}s")
print(f"Frames analyzed: {result.frame_count}")
print(f"Overall sentiment: {result.sentiment.label}")
print(f"Audio sentiment: {result.audio_sentiment.label}")
print(f"Visual sentiment: {result.visual_sentiment.label}")
Supported Video Formats¶
- MP4, AVI, MOV, WEBM, MKV
Frame Extraction Methods¶
| Method | Description | Use Case |
|---|---|---|
uniform | Extract every N frames | General analysis |
keyframe | Extract key frames only | Efficient processing |
scene | Detect scene changes | Narrative analysis |
custom | User-defined intervals | Specific timestamps |
Configuration¶
async with Sentimatrix() as sm:
result = await sm.analyze_video(
"video.mp4",
frame_method="uniform",
frame_interval=5, # Every 5 seconds
analyze_audio=True,
max_frames=20,
)
Fusion Strategies¶
Combine results from multiple modalities:
Late Fusion (Default)¶
Analyze each modality separately, then combine:
result = await sm.analyze_video(
"video.mp4",
fusion_strategy="late"
)
# Result combines audio + visual sentiments
print(f"Combined: {result.sentiment}")
print(f"Audio: {result.audio_sentiment}")
print(f"Visual: {result.visual_sentiment}")
Weighted Fusion¶
Apply custom weights to each modality:
result = await sm.analyze_video(
"video.mp4",
fusion_strategy="weighted",
weights={"audio": 0.6, "visual": 0.4}
)
Dominant Fusion¶
Use the modality with highest confidence:
result = await sm.analyze_video(
"video.mp4",
fusion_strategy="dominant"
)
print(f"Dominant modality: {result.dominant_modality}")
Combined Analysis¶
Analyze multiple inputs together:
async with Sentimatrix() as sm:
result = await sm.analyze_multimodal(
text="Customer review: Great product!",
audio="voicemail.wav",
image="product_photo.jpg",
)
print(f"Text sentiment: {result.text_sentiment}")
print(f"Audio sentiment: {result.audio_sentiment}")
print(f"Image sentiment: {result.image_sentiment}")
print(f"Overall: {result.combined_sentiment}")
Hardware Requirements¶
| Modality | CPU | GPU (Recommended) |
|---|---|---|
| Text | 4GB RAM | Optional |
| Audio (Whisper base) | 4GB RAM | 4GB VRAM |
| Audio (Whisper large) | 8GB RAM | 8GB VRAM |
| Image (BLIP) | 4GB RAM | 4GB VRAM |
| Image (LLaVA) | 16GB RAM | 12GB VRAM |
| Video | 8GB+ RAM | 8GB+ VRAM |
Performance Tips¶
-
Use GPU Acceleration
-
Batch Process Frames
- Process multiple frames in batches
- Reduces overhead
-
Choose Appropriate Models
- Use smaller models for speed
- Larger models for quality
-
Limit Frame Count
- More frames = slower processing
- 10-20 frames often sufficient
Use Cases¶
| Use Case | Modalities | Recommended Setup |
|---|---|---|
| Call center analysis | Audio | Whisper + sentiment |
| Product reviews | Text + Image | BLIP + RoBERTa |
| Video testimonials | Audio + Video | Whisper + LLaVA |
| Social media | Text + Image | Fast models |
| Customer support | Audio | Groq Whisper |