Setting Up Local LLM with RAG on Fedora (AMD GPU)
Setting Up Local LLM with RAG on Fedora (AMD GPU)
A guide to installing Ollama with LightRAG for local AI with retrieval-augmented generation on AMD hardware.
Hardware
- AMD Radeon RX 7900M (or similar AMD GPU)
- 128GB unified memory
- Fedora Linux with RADV drivers (Mesa - built-in)
Performance
- 14B model: ~23 tokens/s
- 32B model: ~10 tokens/s
- GPU utilization: 100% during inference
- Embedding: nomic-embed-text (768 dimensions)
Installation Steps
1. Install Ollama
# Install Ollama natively on Fedora
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
systemctl status ollama
ollama --version
2. Configure Ollama for AMD GPU
# Optimize for AMD hardware
sudo systemctl edit ollama.service
Add these environment variables:
[Service]
Environment="OLLAMA_NUM_GPU=999"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Reload and restart:
sudo systemctl daemon-reload
sudo systemctl restart ollama
3. Pull Models
# Download LLM (choose one or both)
ollama pull qwen2.5-coder:14b # Recommended: faster
ollama pull qwen2.5-coder:32b # Better quality, slower
# Download embedding model
ollama pull nomic-embed-text
# Verify
ollama list
4. Test Ollama
# Quick test
time ollama run qwen2.5-coder:14b "Write hello world in Python" --verbose
# Monitor GPU while running
radeontop
5. Setup LightRAG
# Create directory
mkdir -p ~/lightrag
cd ~/lightrag
# Create docker-compose.yml
cat > docker-compose.yml << 'EOF'
version: '3.8'
services:
lightrag:
image: ghcr.io/hkuds/lightrag:latest
container_name: lightrag
network_mode: host
volumes:
- ./lightrag_data:/app/data:z
environment:
- LLM_BINDING=ollama
- OLLAMA_HOST=http://localhost:11434
- LLM_MODEL=qwen2.5-coder:14b
- EMBEDDING_BINDING=ollama
- EMBEDDING_HOST=http://localhost:11434
- EMBEDDING_MODEL=nomic-embed-text
- EMBEDDING_DIM=768
- WORKING_DIR=/app/data
- INPUT_DIR=/app/data/inputs
- HOST=0.0.0.0
- PORT=9621
restart: unless-stopped
EOF
# Create data directory with proper permissions
mkdir -p lightrag_data
chmod -R 777 lightrag_data
# Start LightRAG
docker compose up -d
# Check logs
docker logs -f lightrag
Note: The :z flag in the volume mount is critical for SELinux (Fedora default).
Testing the Setup
Test 1: Verify Services
# Check Ollama
curl http://localhost:11434/api/tags
# Check LightRAG
curl http://localhost:9621/health
Test 2: Add a Document
curl -X POST http://localhost:9621/documents/text \
-H "Content-Type: application/json" \
-d '{
"text": "FastAPI is a modern Python web framework for building APIs with automatic documentation. It was created by Sebastián Ramírez and uses Python type hints for validation.",
"description": "FastAPI overview"
}'
Monitor GPU activity:
radeontop
# Should show 100% Graphics pipe during processing
Test 3: Query the Knowledge Base
curl -X POST http://localhost:9621/query \
-H "Content-Type: application/json" \
-d '{
"query": "Who created FastAPI?",
"mode": "hybrid"
}'
Access Points
- LightRAG API Docs: http://localhost:9621/docs (recommended - WebUI has bugs)
- OpenWebUI: Install separately for chat interface
- Direct API: Use curl or create custom scripts
Common Issues & Solutions
SELinux Permission Denied
Symptom: PermissionError: [Errno 13] Permission denied
Solution: Add :z flag to Docker volume mount
volumes:
- ./lightrag_data:/app/data:z
host.docker.internal Not Working
Symptom: LightRAG can't connect to Ollama
Solution: Use network_mode: host in docker-compose
Slow Performance
Symptom: < 10 tokens/s
Solution: Use 14B model instead of 32B for 2-3x speed improvement
Architecture
Fedora Host
├─ RADV drivers (Mesa - automatic)
├─ Ollama (native service, port 11434)
│ ├─ ROCm GPU acceleration
│ ├─ qwen2.5-coder:14b
│ └─ nomic-embed-text
└─ Docker
└─ LightRAG (port 9621)
└─ Connects to Ollama via host network
Performance Tips
- Use 14B model for LightRAG (faster, good quality)
- Use 32B model for complex tasks via direct Ollama queries
- Monitor GPU with
radeontopto verify acceleration - Batch documents for efficiency
- Check
ollama psto see loaded models and memory usage
Quick Query Script
# Create helper script
cat > ~/rag-query << 'EOF'
#!/bin/bash
curl -s -X POST http://localhost:9621/query \
-H "Content-Type: application/json" \
-d "{\"query\": \"$1\", \"mode\": \"hybrid\"}" | jq -r '.response'
EOF
chmod +x ~/rag-query
# Usage
~/rag-query "What is FastAPI?"
Resources
- Ollama: https://ollama.com
- LightRAG: https://github.com/HKUDS/LightRAG
- Model Hub: https://ollama.com/library
Total Setup Time: ~15-20 minutes (excluding model downloads)
Disk Space Required: - Ollama: ~100MB - qwen2.5-coder:14b: ~9GB - nomic-embed-text: ~274MB - LightRAG: ~500MB