Published on

Ollama Ubuntu: Private KI auf Ubuntu-Maschinen konfigurieren

Authors

Warum private KI mit Ollama auf Ubuntu?

Die lokale KI-Verarbeitung gewinnt zunehmend an Bedeutung für Unternehmen, die Datenschutz, Kontrolle und Kosteneffizienz priorisieren. Ollama revolutioniert die private KI-Bereitstellung durch einfache Installation und Verwaltung von Large Language Models (LLMs) auf Ubuntu-Servern.

Vorteile von Ollama auf Ubuntu:

  • 100% Datenschutz - Keine Datenübertragung an externe APIs
  • Kosteneinsparungen - Keine API-Gebühren bei hohem Volumen
  • Vollständige Kontrolle - Anpassung und Optimierung der Modelle
  • Offline-Funktionalität - Unabhängig von Internetverbindung
  • GPU-Beschleunigung - Optimale Performance mit NVIDIA-Grafikkarten

Typische Anwendungsfälle:

  • Interne Dokumentenverarbeitung mit sensiblen Daten
  • Code-Generierung für Entwicklerteams
  • Chatbots für Kundenservice
  • Content-Generierung für Marketing
  • Datenanalyse und -aufbereitung

Systemanforderungen und Vorbereitung

Hardware-Anforderungen

# Hardware-Anforderungen für Ollama
minimum_requirements:
  cpu: '4 Cores (Intel i5/AMD Ryzen 5)'
  ram: '16 GB DDR4'
  storage: '50 GB SSD'
  gpu: 'Optional: NVIDIA GTX 1060 6GB'

recommended_requirements:
  cpu: '8+ Cores (Intel i7/AMD Ryzen 7)'
  ram: '32 GB DDR4'
  storage: '100 GB NVMe SSD'
  gpu: 'NVIDIA RTX 3080/4080 oder Tesla T4'

production_requirements:
  cpu: '16+ Cores (Intel Xeon/AMD EPYC)'
  ram: '64+ GB DDR4'
  storage: '500 GB NVMe SSD'
  gpu: 'NVIDIA A100/H100 oder Tesla V100'

Ubuntu-System vorbereiten

#!/bin/bash
# ubuntu_ollama_setup.sh

# System aktualisieren
sudo apt update && sudo apt upgrade -y

# Essentielle Pakete installieren
sudo apt install -y \
    curl \
    wget \
    git \
    build-essential \
    software-properties-common \
    apt-transport-https \
    ca-certificates \
    gnupg \
    lsb-release

# Docker installieren (optional für Container-Deployment)
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io

# Docker-Benutzer zur Gruppe hinzufügen
sudo usermod -aG docker $USER

# NVIDIA-Treiber installieren (falls GPU vorhanden)
if command -v nvidia-smi &> /dev/null; then
    echo "NVIDIA GPU erkannt - Treiber bereits installiert"
else
    echo "NVIDIA-Treiber installieren..."
    sudo apt install -y nvidia-driver-535
fi

# CUDA Toolkit installieren (für GPU-Beschleunigung)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit

# System neu starten für Treiber-Installation
echo "System wird neu gestartet..."
sudo reboot

Ollama Installation und Konfiguration

Standard-Installation

# Ollama installieren
curl -fsSL https://ollama.ai/install.sh | sh

# Ollama-Service starten
sudo systemctl start ollama
sudo systemctl enable ollama

# Installation verifizieren
ollama --version

# Erste Modell herunterladen (Llama 2 7B)
ollama pull llama2:7b

# Modell testen
ollama run llama2:7b "Hallo, wie geht es dir?"

Docker-basierte Installation

# Dockerfile für Ollama
FROM ubuntu:22.04

# System-Abhängigkeiten
RUN apt-get update && apt-get install -y \
    curl \
    wget \
    git \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Ollama installieren
RUN curl -fsSL https://ollama.ai/install.sh | sh

# Arbeitsverzeichnis
WORKDIR /app

# Ollama-Port freigeben
EXPOSE 11434

# Ollama starten
CMD ["ollama", "serve"]
# docker-compose.yml für Ollama
version: '3.8'
services:
  ollama:
    build: .
    ports:
      - '11434:11434'
    volumes:
      - ollama_data:/root/.ollama
      - ./models:/app/models
    environment:
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_ORIGINS=*
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

volumes:
  ollama_data:

Systemd-Service-Konfiguration

# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama AI Service
After=network.target

[Service]
Type=simple
User=ollama
Group=ollama
ExecStart=/usr/local/bin/ollama serve
Restart=always
RestartSec=10
Environment=OLLAMA_HOST=0.0.0.0
Environment=OLLAMA_ORIGINS=*
WorkingDirectory=/home/ollama

# GPU-Unterstützung
Environment=CUDA_VISIBLE_DEVICES=0
Environment=NVIDIA_VISIBLE_DEVICES=all

# Ressourcen-Limits
LimitNOFILE=65536
LimitNPROC=32768

[Install]
WantedBy=multi-user.target
# Service-Konfiguration anwenden
sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama
sudo systemctl status ollama

Modell-Management und Optimierung

Modell-Auswahl und Download

#!/bin/bash
# model_download.sh

# Verfügbare Modelle anzeigen
ollama list

# Verschiedene Modell-Größen herunterladen
echo "Lade verschiedene Modelle herunter..."

# Kleine Modelle für schnelle Inferenz
ollama pull llama2:7b
ollama pull codellama:7b
ollama pull mistral:7b

# Mittlere Modelle für bessere Qualität
ollama pull llama2:13b
ollama pull codellama:13b
ollama pull mistral:13b

# Große Modelle für höchste Qualität (GPU erforderlich)
ollama pull llama2:70b
ollama pull codellama:34b
ollama pull mistral:large

# Spezialisierte Modelle
ollama pull llama2:7b-chat
ollama pull codellama:7b-instruct
ollama pull mistral:7b-instruct

echo "Modell-Download abgeschlossen"

Modell-Konfiguration und Optimierung

# ollama_config.yaml
models:
  llama2_7b:
    name: 'llama2:7b'
    parameters:
      temperature: 0.7
      top_p: 0.9
      top_k: 40
      repeat_penalty: 1.1
    context_length: 4096
    gpu_layers: 35

  codellama_7b:
    name: 'codellama:7b'
    parameters:
      temperature: 0.2
      top_p: 0.95
      top_k: 50
      repeat_penalty: 1.1
    context_length: 8192
    gpu_layers: 35

  mistral_7b:
    name: 'mistral:7b'
    parameters:
      temperature: 0.5
      top_p: 0.9
      top_k: 40
      repeat_penalty: 1.05
    context_length: 8192
    gpu_layers: 32
# model_optimizer.py
import subprocess
import json
import os

class OllamaModelOptimizer:
    def __init__(self):
        self.models_dir = "/root/.ollama/models"

    def optimize_model_config(self, model_name, config):
        """Modell-Konfiguration optimieren"""
        config_path = f"{self.models_dir}/{model_name}/config.json"

        # Bestehende Konfiguration laden
        if os.path.exists(config_path):
            with open(config_path, 'r') as f:
                current_config = json.load(f)
        else:
            current_config = {}

        # Neue Konfiguration anwenden
        current_config.update(config)

        # Konfiguration speichern
        with open(config_path, 'w') as f:
            json.dump(current_config, f, indent=2)

        print(f"Modell {model_name} optimiert")

    def benchmark_model(self, model_name, prompt, iterations=10):
        """Modell-Performance benchmarken"""
        times = []

        for i in range(iterations):
            start_time = time.time()

            result = subprocess.run([
                'ollama', 'run', model_name, prompt
            ], capture_output=True, text=True)

            end_time = time.time()
            times.append(end_time - start_time)

        avg_time = sum(times) / len(times)
        min_time = min(times)
        max_time = max(times)

        return {
            "model": model_name,
            "average_time": avg_time,
            "min_time": min_time,
            "max_time": max_time,
            "iterations": iterations
        }

    def optimize_gpu_usage(self, model_name):
        """GPU-Nutzung optimieren"""
        # GPU-Layer anpassen basierend auf verfügbarem VRAM
        gpu_memory = self.get_gpu_memory()

        if gpu_memory >= 24:  # 24GB+ VRAM
            gpu_layers = 50
        elif gpu_memory >= 12:  # 12-24GB VRAM
            gpu_layers = 35
        elif gpu_memory >= 8:   # 8-12GB VRAM
            gpu_layers = 25
        else:                   # \<8GB VRAM
            gpu_layers = 15

        config = {
            "gpu_layers": gpu_layers,
            "num_ctx": 4096,
            "num_thread": 8
        }

        self.optimize_model_config(model_name, config)

    def get_gpu_memory(self):
        """Verfügbaren GPU-Speicher ermitteln"""
        try:
            result = subprocess.run(['nvidia-smi', '--query-gpu=memory.total', '--format=csv,noheader,nounits'],
                                  capture_output=True, text=True)
            return int(result.stdout.strip())
        except:
            return 0

API-Integration und Web-Interface

REST-API-Server

# ollama_api_server.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import requests
import json
import asyncio
from typing import List, Optional

app = FastAPI(title="Ollama API Server")

class ChatRequest(BaseModel):
    model: str
    message: str
    temperature: Optional[float] = 0.7
    top_p: Optional[float] = 0.9
    max_tokens: Optional[int] = 1000

class ChatResponse(BaseModel):
    response: str
    model: str
    tokens_used: int
    processing_time: float

class OllamaAPIClient:
    def __init__(self, base_url="http://localhost:11434"):
        self.base_url = base_url

    async def chat(self, request: ChatRequest) -> ChatResponse:
        """Chat-Anfrage an Ollama senden"""
        import time
        start_time = time.time()

        payload = {
            "model": request.model,
            "prompt": request.message,
            "stream": False,
            "options": {
                "temperature": request.temperature,
                "top_p": request.top_p,
                "num_predict": request.max_tokens
            }
        }

        try:
            response = requests.post(
                f"{self.base_url}/api/generate",
                json=payload,
                timeout=300
            )
            response.raise_for_status()

            result = response.json()
            processing_time = time.time() - start_time

            return ChatResponse(
                response=result["response"],
                model=request.model,
                tokens_used=result.get("eval_count", 0),
                processing_time=processing_time
            )

        except requests.exceptions.RequestException as e:
            raise HTTPException(status_code=500, detail=f"Ollama API Error: {str(e)}")

    async def list_models(self) -> List[dict]:
        """Verfügbare Modelle auflisten"""
        try:
            response = requests.get(f"{self.base_url}/api/tags")
            response.raise_for_status()
            return response.json()["models"]
        except requests.exceptions.RequestException as e:
            raise HTTPException(status_code=500, detail=f"Ollama API Error: {str(e)}")

client = OllamaAPIClient()

@app.post("/chat", response_model=ChatResponse)
async def chat_endpoint(request: ChatRequest):
    """Chat-Endpunkt"""
    return await client.chat(request)

@app.get("/models")
async def list_models():
    """Modelle auflisten"""
    return await client.list_models()

@app.get("/health")
async def health_check():
    """Health-Check"""
    try:
        models = await client.list_models()
        return {"status": "healthy", "models_count": len(models)}
    except:
        return {"status": "unhealthy"}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Web-Interface mit Streamlit

# ollama_web_interface.py
import streamlit as st
import requests
import json
from typing import List

class OllamaWebInterface:
    def __init__(self, api_url="http://localhost:8000"):
        self.api_url = api_url

    def get_models(self) -> List[str]:
        """Verfügbare Modelle abrufen"""
        try:
            response = requests.get(f"{self.api_url}/models")
            models = response.json()
            return [model["name"] for model in models]
        except:
            return ["llama2:7b", "codellama:7b", "mistral:7b"]

    def chat(self, model: str, message: str, temperature: float = 0.7):
        """Chat-Anfrage senden"""
        payload = {
            "model": model,
            "message": message,
            "temperature": temperature
        }

        try:
            response = requests.post(f"{self.api_url}/chat", json=payload)
            return response.json()
        except Exception as e:
            return {"error": str(e)}

def main():
    st.set_page_config(
        page_title="Ollama Web Interface",
        page_icon="🤖",
        layout="wide"
    )

    st.title("🤖 Ollama Private AI Interface")

    # Sidebar für Konfiguration
    with st.sidebar:
        st.header("Konfiguration")

        # Modell-Auswahl
        interface = OllamaWebInterface()
        models = interface.get_models()
        selected_model = st.selectbox("Modell auswählen:", models)

        # Parameter
        temperature = st.slider("Temperature:", 0.0, 1.0, 0.7, 0.1)
        max_tokens = st.slider("Max Tokens:", 100, 2000, 1000, 100)

        # System-Status
        st.header("System-Status")
        try:
            health = requests.get("http://localhost:8000/health").json()
            st.success(f"Status: {health['status']}")
            st.info(f"Modelle: {health.get('models_count', 'N/A')}")
        except:
            st.error("API nicht erreichbar")

    # Hauptbereich
    col1, col2 = st.columns([1, 1])

    with col1:
        st.header("💬 Chat")

        # Chat-Historie
        if "messages" not in st.session_state:
            st.session_state.messages = []

        # Nachrichten anzeigen
        for message in st.session_state.messages:
            with st.chat_message(message["role"]):
                st.markdown(message["content"])

        # Neue Nachricht
        if prompt := st.chat_input("Nachricht eingeben..."):
            st.session_state.messages.append({"role": "user", "content": prompt})
            with st.chat_message("user"):
                st.markdown(prompt)

            # AI-Antwort generieren
            with st.chat_message("assistant"):
                with st.spinner("Generiere Antwort..."):
                    response = interface.chat(selected_model, prompt, temperature)

                    if "error" in response:
                        st.error(f"Fehler: {response['error']}")
                    else:
                        st.markdown(response["response"])
                        st.session_state.messages.append({
                            "role": "assistant",
                            "content": response["response"]
                        })

    with col2:
        st.header("📊 Modell-Informationen")

        # Modell-Details
        st.subheader(f"Modell: {selected_model}")

        # Performance-Metriken
        if st.button("Performance testen"):
            with st.spinner("Teste Performance..."):
                test_prompt = "Erkläre kurz, was Kubernetes ist."
                response = interface.chat(selected_model, test_prompt, 0.5)

                if "error" not in response:
                    st.metric("Verarbeitungszeit", f"{response['processing_time']:.2f}s")
                    st.metric("Tokens verwendet", response['tokens_used'])

        # Modell-Liste
        st.subheader("Verfügbare Modelle")
        for model in models:
            st.write(f"• {model}")

        # Chat zurücksetzen
        if st.button("Chat zurücksetzen"):
            st.session_state.messages = []
            st.rerun()

if __name__ == "__main__":
    main()

Production-Deployment

Nginx Reverse Proxy

# /etc/nginx/sites-available/ollama
server {
    listen 80;
    server_name ollama.yourdomain.com;

    # SSL-Konfiguration (empfohlen)
    # listen 443 ssl;
    # ssl_certificate /path/to/cert.pem;
    # ssl_certificate_key /path/to/key.pem;

    location / {
        proxy_pass http://localhost:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # WebSocket-Unterstützung
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";

        # Timeouts
        proxy_connect_timeout 60s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;
    }

    # Ollama API direkt
    location /ollama/ {
        proxy_pass http://localhost:11434/;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Monitoring und Logging

# monitoring.py
import psutil
import GPUtil
import time
import logging
from datetime import datetime
import json

class OllamaMonitor:
    def __init__(self):
        self.logger = self.setup_logging()

    def setup_logging(self):
        """Logging konfigurieren"""
        logging.basicConfig(
            level=logging.INFO,
            format='%(asctime)s - %(levelname)s - %(message)s',
            handlers=[
                logging.FileHandler('/var/log/ollama-monitor.log'),
                logging.StreamHandler()
            ]
        )
        return logging.getLogger(__name__)

    def get_system_metrics(self):
        """System-Metriken sammeln"""
        metrics = {
            "timestamp": datetime.now().isoformat(),
            "cpu": {
                "usage_percent": psutil.cpu_percent(interval=1),
                "count": psutil.cpu_count(),
                "frequency": psutil.cpu_freq()._asdict() if psutil.cpu_freq() else None
            },
            "memory": {
                "total": psutil.virtual_memory().total,
                "available": psutil.virtual_memory().available,
                "used": psutil.virtual_memory().used,
                "percent": psutil.virtual_memory().percent
            },
            "disk": {
                "total": psutil.disk_usage('/').total,
                "used": psutil.disk_usage('/').used,
                "free": psutil.disk_usage('/').free,
                "percent": psutil.disk_usage('/').percent
            }
        }

        # GPU-Metriken (falls verfügbar)
        try:
            gpus = GPUtil.getGPUs()
            metrics["gpu"] = []
            for gpu in gpus:
                metrics["gpu"].append({
                    "id": gpu.id,
                    "name": gpu.name,
                    "memory_total": gpu.memoryTotal,
                    "memory_used": gpu.memoryUsed,
                    "memory_free": gpu.memoryFree,
                    "temperature": gpu.temperature,
                    "load": gpu.load
                })
        except:
            metrics["gpu"] = None

        return metrics

    def check_ollama_status(self):
        """Ollama-Service-Status prüfen"""
        try:
            import requests
            response = requests.get("http://localhost:11434/api/tags", timeout=5)
            return response.status_code == 200
        except:
            return False

    def log_metrics(self, metrics):
        """Metriken loggen"""
        self.logger.info(f"System Metrics: {json.dumps(metrics, indent=2)}")

        # Warnungen bei hoher Auslastung
        if metrics["cpu"]["usage_percent"] > 80:
            self.logger.warning(f"Hohe CPU-Auslastung: {metrics['cpu']['usage_percent']}%")

        if metrics["memory"]["percent"] > 85:
            self.logger.warning(f"Hohe Speicherauslastung: {metrics['memory']['percent']}%")

        if metrics["disk"]["percent"] > 90:
            self.logger.warning(f"Hohe Festplattenauslastung: {metrics['disk']['percent']}%")

    def run_monitoring(self, interval=60):
        """Monitoring-Loop"""
        self.logger.info("Ollama Monitoring gestartet")

        while True:
            try:
                # System-Metriken sammeln
                metrics = self.get_system_metrics()

                # Ollama-Status prüfen
                ollama_status = self.check_ollama_status()
                metrics["ollama_status"] = ollama_status

                # Metriken loggen
                self.log_metrics(metrics)

                # Metriken in Datei speichern
                with open("/var/log/ollama-metrics.json", "a") as f:
                    f.write(json.dumps(metrics) + "\n")

                time.sleep(interval)

            except Exception as e:
                self.logger.error(f"Monitoring-Fehler: {str(e)}")
                time.sleep(interval)

if __name__ == "__main__":
    monitor = OllamaMonitor()
    monitor.run_monitoring()

Backup und Recovery

#!/bin/bash
# ollama_backup.sh

# Backup-Konfiguration
BACKUP_DIR="/backup/ollama"
DATE=$(date +%Y%m%d_%H%M%S)
OLLAMA_DIR="/root/.ollama"

# Backup-Verzeichnis erstellen
mkdir -p $BACKUP_DIR

# Modell-Backup
echo "Erstelle Modell-Backup..."
tar -czf $BACKUP_DIR/models_$DATE.tar.gz -C $OLLAMA_DIR models/

# Konfigurations-Backup
echo "Erstelle Konfigurations-Backup..."
tar -czf $BACKUP_DIR/config_$DATE.tar.gz -C $OLLAMA_DIR config/

# Vollständiges Backup
echo "Erstelle vollständiges Backup..."
tar -czf $BACKUP_DIR/full_$DATE.tar.gz -C $OLLAMA_DIR .

# Alte Backups bereinigen (älter als 30 Tage)
find $BACKUP_DIR -name "*.tar.gz" -mtime +30 -delete

echo "Backup abgeschlossen: $BACKUP_DIR"
#!/bin/bash
# ollama_restore.sh

# Restore-Konfiguration
BACKUP_DIR="/backup/ollama"
OLLAMA_DIR="/root/.ollama"

# Neuestes Backup finden
LATEST_BACKUP=$(ls -t $BACKUP_DIR/full_*.tar.gz | head -1)

if [ -z "$LATEST_BACKUP" ]; then
    echo "Kein Backup gefunden!"
    exit 1
fi

echo "Stelle aus Backup wieder her: $LATEST_BACKUP"

# Ollama-Service stoppen
sudo systemctl stop ollama

# Backup entpacken
tar -xzf $LATEST_BACKUP -C /

# Ollama-Service starten
sudo systemctl start ollama

echo "Restore abgeschlossen"

Sicherheit und Best Practices

Sicherheitskonfiguration

# firewall_setup.sh
#!/bin/bash

# UFW Firewall konfigurieren
sudo ufw enable

# Standard-Ports erlauben
sudo ufw allow ssh
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp

# Ollama-API nur intern erreichbar
sudo ufw allow from 192.168.1.0/24 to any port 11434
sudo ufw allow from 10.0.0.0/8 to any port 11434

# Web-Interface nur für bestimmte IPs
sudo ufw allow from 192.168.1.100 to any port 8000

# Firewall-Status anzeigen
sudo ufw status verbose
# security_middleware.py
from fastapi import Request, HTTPException
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
import jwt
import time
from typing import Optional

class SecurityMiddleware:
    def __init__(self, secret_key: str):
        self.secret_key = secret_key
        self.security = HTTPBearer()

    def create_token(self, user_id: str, expires_in: int = 3600) -> str:
        """JWT-Token erstellen"""
        payload = {
            "user_id": user_id,
            "exp": time.time() + expires_in,
            "iat": time.time()
        }
        return jwt.encode(payload, self.secret_key, algorithm="HS256")

    def verify_token(self, token: str) -> Optional[str]:
        """JWT-Token verifizieren"""
        try:
            payload = jwt.decode(token, self.secret_key, algorithms=["HS256"])
            return payload["user_id"]
        except jwt.ExpiredSignatureError:
            raise HTTPException(status_code=401, detail="Token abgelaufen")
        except jwt.InvalidTokenError:
            raise HTTPException(status_code=401, detail="Ungültiger Token")

    async def authenticate(self, request: Request):
        """Authentifizierung durchführen"""
        try:
            credentials: HTTPAuthorizationCredentials = await self.security(request)
            user_id = self.verify_token(credentials.credentials)
            request.state.user_id = user_id
        except HTTPException:
            raise HTTPException(status_code=401, detail="Authentifizierung erforderlich")

Rate Limiting

# rate_limiter.py
import time
from collections import defaultdict
from fastapi import HTTPException

class RateLimiter:
    def __init__(self, requests_per_minute: int = 60):
        self.requests_per_minute = requests_per_minute
        self.requests = defaultdict(list)

    def is_allowed(self, client_id: str) -> bool:
        """Prüfen ob Anfrage erlaubt ist"""
        now = time.time()
        minute_ago = now - 60

        # Alte Anfragen entfernen
        self.requests[client_id] = [
            req_time for req_time in self.requests[client_id]
            if req_time > minute_ago
        ]

        # Neue Anfrage hinzufügen
        if len(self.requests[client_id]) < self.requests_per_minute:
            self.requests[client_id].append(now)
            return True

        return False

    def check_rate_limit(self, client_id: str):
        """Rate Limit prüfen"""
        if not self.is_allowed(client_id):
            raise HTTPException(
                status_code=429,
                detail="Rate limit überschritten. Bitte warten Sie eine Minute."
            )

Fazit: Private KI für Unternehmen

Ollama auf Ubuntu bietet eine leistungsstarke Lösung für private KI-Bereitstellung:

Technische Vorteile:

  • Einfache Installation und Konfiguration
  • GPU-Beschleunigung für optimale Performance
  • Modulare Architektur für flexible Deployment-Optionen
  • Umfassende API für Integration in bestehende Systeme

Geschäftliche Vorteile:

  • Kosteneinsparungen durch Wegfall von API-Gebühren
  • Datenschutz-Compliance durch lokale Verarbeitung
  • Vollständige Kontrolle über KI-Modelle und -Daten
  • Skalierbarkeit von Entwicklung bis Production

Nächste Schritte:

  1. Pilot-Installation auf Test-System
  2. Modell-Auswahl basierend auf Anwendungsfällen
  3. Performance-Optimierung für spezifische Workloads
  4. Integration in bestehende Workflows

Ollama macht private KI für Unternehmen zugänglich und erschwinglich, ohne Kompromisse bei Datenschutz und Kontrolle. Weitere Artikel zum Thema: Kubernetes AI Machine Learning, Local AI, GPU Computing

📖 Verwandte Artikel

Weitere interessante Beiträge zu ähnlichen Themen