Setting Up Local LLM with RAG on Fedora (AMD GPU)

Setting Up Local LLM with RAG on Fedora (AMD GPU)

A guide to installing Ollama with LightRAG for local AI with retrieval-augmented generation on AMD hardware.

Hardware

  • AMD Radeon RX 7900M (or similar AMD GPU)
  • 128GB unified memory
  • Fedora Linux with RADV drivers (Mesa - built-in)

Performance

  • 14B model: ~23 tokens/s
  • 32B model: ~10 tokens/s
  • GPU utilization: 100% during inference
  • Embedding: nomic-embed-text (768 dimensions)

Installation Steps

1. Install Ollama

# Install Ollama natively on Fedora
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
systemctl status ollama
ollama --version

2. Configure Ollama for AMD GPU

# Optimize for AMD hardware
sudo systemctl edit ollama.service

Add these environment variables:

[Service]
Environment="OLLAMA_NUM_GPU=999"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_MAX_LOADED_MODELS=1"

Reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

3. Pull Models

# Download LLM (choose one or both)
ollama pull qwen2.5-coder:14b    # Recommended: faster
ollama pull qwen2.5-coder:32b    # Better quality, slower

# Download embedding model
ollama pull nomic-embed-text

# Verify
ollama list

4. Test Ollama

# Quick test
time ollama run qwen2.5-coder:14b "Write hello world in Python" --verbose

# Monitor GPU while running
radeontop

5. Setup LightRAG

# Create directory
mkdir -p ~/lightrag
cd ~/lightrag

# Create docker-compose.yml
cat > docker-compose.yml << 'EOF'
version: '3.8'

services:
  lightrag:
    image: ghcr.io/hkuds/lightrag:latest
    container_name: lightrag
    network_mode: host
    volumes:
      - ./lightrag_data:/app/data:z
    environment:
      - LLM_BINDING=ollama
      - OLLAMA_HOST=http://localhost:11434
      - LLM_MODEL=qwen2.5-coder:14b

      - EMBEDDING_BINDING=ollama
      - EMBEDDING_HOST=http://localhost:11434
      - EMBEDDING_MODEL=nomic-embed-text
      - EMBEDDING_DIM=768

      - WORKING_DIR=/app/data
      - INPUT_DIR=/app/data/inputs
      - HOST=0.0.0.0
      - PORT=9621
    restart: unless-stopped
EOF

# Create data directory with proper permissions
mkdir -p lightrag_data
chmod -R 777 lightrag_data

# Start LightRAG
docker compose up -d

# Check logs
docker logs -f lightrag

Note: The :z flag in the volume mount is critical for SELinux (Fedora default).

Testing the Setup

Test 1: Verify Services

# Check Ollama
curl http://localhost:11434/api/tags

# Check LightRAG
curl http://localhost:9621/health

Test 2: Add a Document

curl -X POST http://localhost:9621/documents/text \
  -H "Content-Type: application/json" \
  -d '{
    "text": "FastAPI is a modern Python web framework for building APIs with automatic documentation. It was created by Sebastián Ramírez and uses Python type hints for validation.",
    "description": "FastAPI overview"
  }'

Monitor GPU activity:

radeontop
# Should show 100% Graphics pipe during processing

Test 3: Query the Knowledge Base

curl -X POST http://localhost:9621/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Who created FastAPI?",
    "mode": "hybrid"
  }'

Access Points

  • LightRAG API Docs: http://localhost:9621/docs (recommended - WebUI has bugs)
  • OpenWebUI: Install separately for chat interface
  • Direct API: Use curl or create custom scripts

Common Issues & Solutions

SELinux Permission Denied

Symptom: PermissionError: [Errno 13] Permission denied

Solution: Add :z flag to Docker volume mount

volumes:
  - ./lightrag_data:/app/data:z

host.docker.internal Not Working

Symptom: LightRAG can't connect to Ollama

Solution: Use network_mode: host in docker-compose

Slow Performance

Symptom: < 10 tokens/s

Solution: Use 14B model instead of 32B for 2-3x speed improvement

Architecture

Fedora Host
├─ RADV drivers (Mesa - automatic)
├─ Ollama (native service, port 11434)
│  ├─ ROCm GPU acceleration
│  ├─ qwen2.5-coder:14b
│  └─ nomic-embed-text
└─ Docker
   └─ LightRAG (port 9621)
      └─ Connects to Ollama via host network

Performance Tips

  1. Use 14B model for LightRAG (faster, good quality)
  2. Use 32B model for complex tasks via direct Ollama queries
  3. Monitor GPU with radeontop to verify acceleration
  4. Batch documents for efficiency
  5. Check ollama ps to see loaded models and memory usage

Quick Query Script

# Create helper script
cat > ~/rag-query << 'EOF'
#!/bin/bash
curl -s -X POST http://localhost:9621/query \
  -H "Content-Type: application/json" \
  -d "{\"query\": \"$1\", \"mode\": \"hybrid\"}" | jq -r '.response'
EOF

chmod +x ~/rag-query

# Usage
~/rag-query "What is FastAPI?"

Resources

  • Ollama: https://ollama.com
  • LightRAG: https://github.com/HKUDS/LightRAG
  • Model Hub: https://ollama.com/library

Total Setup Time: ~15-20 minutes (excluding model downloads)

Disk Space Required: - Ollama: ~100MB - qwen2.5-coder:14b: ~9GB - nomic-embed-text: ~274MB - LightRAG: ~500MB