ES | EN

OLLAMA LOCAL: LLMs WITHOUT INTERNET, WITHOUT COST, WITHOUT CONTEXT LIMITS

TAGS: LLMs / PRIVACY / INFRASTRUCTURE / PYTHON READ_TIME: 12 MIN
Ollama local: LLMs without internet, without cost, without context limits

Groq is fast. Gemini is capable. Claude is brilliant. But they all share one thing: they require internet, have usage limits, and your conversations pass through third-party servers. Ollama is different — it is a runtime that runs language models directly on your machine. No internet. No API key. No per-token cost. Your data never leaves your disk.

PROJECT_STATUS: STABLE

Stack: Ollama · Python · any OS
Reference project: FocOs — local LLM provider
Goal: Full technological sovereignty — the LLM runs on your machine, data never leaves your disk

01. THE PROBLEM IT SOLVES

Cloud LLM APIs carry three limitations that directly impact real workflow: they require stable internet — no connection, no LLM; they impose rate and context limits that interrupt long sessions; and the code and ideas you share go to external servers. For an independent developer building their own ecosystem, these are not theoretical concerns — they are real blockers.

For long projects, late-night work sessions, sensitive code, or anyone who values operational independence — Ollama is the answer.

// Non-technical explanation

Imagine that every time you want to ask an expert a question, you have to call them by phone, wait for availability, and pay per minute. That is what cloud LLM APIs are. Ollama is like having that expert living in your house. Always available. No phone. No bill. No one else listening to the conversation.

02. INSTALLATION — 3 MINUTES ON ANY OS

Ollama installs with a single command on Linux, an installer on Windows, and Homebrew on macOS. Once installed, it runs as a background service and exposes a REST API on localhost:11434.

# LINUX (single line):
curl -fsSL https://ollama.com/install.sh | sh

# WINDOWS:
# Download installer from: https://ollama.com/download
# Run OllamaSetup.exe — runs as background service automatically

# MACOS:
brew install ollama

# Verify installation:
ollama --version
ollama version 0.5.x

# Verify server is running:
curl http://localhost:11434/api/tags
# Returns list of installed models (empty on first run)

03. RECOMMENDED MODELS — WHICH ONE FOR WHAT

Model selection depends on available RAM and task type. The rule is simple: the largest model your hardware can run without hitting swap. Swap kills inference performance completely.

Model Min RAM Disk Best for
llama3.2:3b 8 GB ~2 GB Simple tasks, short code completion, quick queries.
llama3.2 8 GB ~5 GB General development, debugging, technical explanations. The sweet spot.
llama3.3:70b 32 GB ~40 GB Complex architecture, deep reasoning, analysis.
gemma2:9b 16 GB ~6 GB Code. Excellent capability-to-size ratio. From Google.
codellama 8 GB ~4 GB Function completion, technical debugging, refactoring. Code-specialized.
mistral 8 GB ~4 GB Writing and synthesis. Fast and efficient.
deepseek-r1:8b 8 GB ~5 GB Step-by-step reasoning. Excellent for complex logical problems.

04. DOWNLOADING AND RUNNING A MODEL

The download happens once — the model is stored locally and available offline permanently. Management commands are minimal and intuitive.

# Download a model (first time only):
ollama pull llama3.2
# Downloads ~5GB — stored locally, works offline from this point

# Run in interactive chat mode:
ollama run llama3.2
# >>> Type your message here
# Ctrl+D or /bye to exit

# Run with inline prompt:
ollama run llama3.2 "explain what this does: def fib(n): return n if n<=1 else fib(n-1)+fib(n-2)"

# Model management:
ollama list # list installed models
ollama ps # list running models
ollama rm llama3.2 # remove a model
ollama pull llama3.2 # update a model

05. PYTHON INTEGRATION — THE REST API

Ollama exposes a REST API on localhost:11434 compatible with the OpenAI format. It integrates into FocOs exactly like Groq or Gemini — the AAL abstraction layer does not distinguish between a cloud model and one running on local disk.

# Direct call with urllib — zero external dependencies
import urllib.request, json

def call_ollama(model, messages, base_url='http://localhost:11434'):
    url = f'{base_url}/api/chat'
    payload = json.dumps({
        'model': model,
        'messages': messages,
        'stream': False,
        'options': {
            'temperature': 0.7,
            'num_ctx': 4096, # Context window — tune per available RAM
        }
    }).encode('utf-8')

    req = urllib.request.Request(
        url, data=payload,
        headers={'Content-Type': 'application/json'},
        method='POST'
    )
    res = urllib.request.urlopen(req, timeout=120)
    data = json.loads(res.read())
    return data.get('message', {}).get('content', '')

# Health check before calling:
def ollama_available(base_url='http://localhost:11434'):
    try:
        urllib.request.urlopen(f'{base_url}/api/tags', timeout=2)
        return True
    except Exception:
        return False

06. STREAMING — REAL-TIME RESPONSES

For long responses, streaming dramatically improves the experience — the user sees text appearing as the model generates, instead of waiting for the full inference to complete before seeing anything.

def call_ollama_stream(model, messages, on_token):
    '''
    on_token: callback receiving each text fragment
    Example: on_token = lambda t: print(t, end='', flush=True)
    '''
    import json, urllib.request

    url = 'http://localhost:11434/api/chat'
    payload = json.dumps({
        'model': model,
        'messages': messages,
        'stream': True,
    }).encode('utf-8')

    req = urllib.request.Request(
        url, data=payload,
        headers={'Content-Type': 'application/json'},
        method='POST'
    )
    full_response = ''
    with urllib.request.urlopen(req, timeout=120) as res:
        for line in res:
            if line.strip():
                chunk = json.loads(line.decode('utf-8'))
                token = chunk.get('message', {}).get('content', '')
                if token:
                    on_token(token)
                    full_response += token
                if chunk.get('done', False):
                    break
    return full_response

07. CUSTOM MODELS — MODELFILE

Ollama lets you create custom models with a fixed system prompt via a Modelfile. This allows packaging the Chronos assistant from FocOs as a standalone model — once created, the full ecosystem context is available without manual injection on every call.

# Modelfile

FROM llama3.2

SYSTEM """
You are Chronos — Frank's development assistant.
You operate inside FocOs, the window manager of the being who builds.
You know the ecosystem: FocOs, TeliOs, KayrOs, ChronOs, OruX.
Methodology: AAL — LLM-Agnostic Architecture.
Preferred stack: Python + HTML/CSS/JS vanilla. Zero dependencies.
Principle: the contract is the truth. Code is regenerable.
Always respond in the user's language.
Prioritize practical solutions over theory.
Maximum 3 options when alternatives exist.
"""

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

# Create the custom model:
ollama create chronos -f Modelfile

# Run from terminal:
ollama run chronos

# Call from Python:
response = call_ollama('chronos', [{'role': 'user', 'content': 'hello'}])

-- CONCLUSION

Ollama turns technological independence from a principle into a practical reality. A developer with Ollama installed can build software with LLMs at 3am, without internet, without spending a cent, without any external server seeing their code. The Ollama + FocOs + AAL combination is the most sovereign stack available today for AI-assisted development: the LLM runs on your machine, the work environment is yours, the methodology is yours, and the data never leaves your disk.

> SYSTEM_READY > NODE_ONLINE

< session_end // node: exit >
> INFOGRATECH_CORE_SHELL X
$