OLLAMA LOCAL: LLMs WITHOUT INTERNET, WITHOUT COST, WITHOUT CONTEXT LIMITS
Groq is fast. Gemini is capable. Claude is brilliant. But they all share one thing: they require internet, have usage limits, and your conversations pass through third-party servers. Ollama is different — it is a runtime that runs language models directly on your machine. No internet. No API key. No per-token cost. Your data never leaves your disk.
Stack: Ollama · Python · any OS
Reference project: FocOs — local LLM provider
Goal: Full technological sovereignty — the LLM runs on your machine, data never leaves your disk
01. THE PROBLEM IT SOLVES
Cloud LLM APIs carry three limitations that directly impact real workflow: they require stable internet — no connection, no LLM; they impose rate and context limits that interrupt long sessions; and the code and ideas you share go to external servers. For an independent developer building their own ecosystem, these are not theoretical concerns — they are real blockers.
For long projects, late-night work sessions, sensitive code, or anyone who values operational independence — Ollama is the answer.
// Non-technical explanation
Imagine that every time you want to ask an expert a question, you have to call them by phone, wait for availability, and pay per minute. That is what cloud LLM APIs are. Ollama is like having that expert living in your house. Always available. No phone. No bill. No one else listening to the conversation.
02. INSTALLATION — 3 MINUTES ON ANY OS
Ollama installs with a single command on Linux, an installer on Windows, and Homebrew on macOS. Once installed, it runs as a background service and exposes a REST API on localhost:11434.
curl -fsSL https://ollama.com/install.sh | sh
# WINDOWS:
# Download installer from: https://ollama.com/download
# Run OllamaSetup.exe — runs as background service automatically
# MACOS:
brew install ollama
# Verify installation:
ollama --version
ollama version 0.5.x
# Verify server is running:
curl http://localhost:11434/api/tags
# Returns list of installed models (empty on first run)
03. RECOMMENDED MODELS — WHICH ONE FOR WHAT
Model selection depends on available RAM and task type. The rule is simple: the largest model your hardware can run without hitting swap. Swap kills inference performance completely.
| Model | Min RAM | Disk | Best for |
|---|---|---|---|
| llama3.2:3b | 8 GB | ~2 GB | Simple tasks, short code completion, quick queries. |
| llama3.2 | 8 GB | ~5 GB | General development, debugging, technical explanations. The sweet spot. |
| llama3.3:70b | 32 GB | ~40 GB | Complex architecture, deep reasoning, analysis. |
| gemma2:9b | 16 GB | ~6 GB | Code. Excellent capability-to-size ratio. From Google. |
| codellama | 8 GB | ~4 GB | Function completion, technical debugging, refactoring. Code-specialized. |
| mistral | 8 GB | ~4 GB | Writing and synthesis. Fast and efficient. |
| deepseek-r1:8b | 8 GB | ~5 GB | Step-by-step reasoning. Excellent for complex logical problems. |
04. DOWNLOADING AND RUNNING A MODEL
The download happens once — the model is stored locally and available offline permanently. Management commands are minimal and intuitive.
ollama pull llama3.2
# Downloads ~5GB — stored locally, works offline from this point
# Run in interactive chat mode:
ollama run llama3.2
# >>> Type your message here
# Ctrl+D or /bye to exit
# Run with inline prompt:
ollama run llama3.2 "explain what this does: def fib(n): return n if n<=1 else fib(n-1)+fib(n-2)"
# Model management:
ollama list # list installed models
ollama ps # list running models
ollama rm llama3.2 # remove a model
ollama pull llama3.2 # update a model
05. PYTHON INTEGRATION — THE REST API
Ollama exposes a REST API on localhost:11434 compatible with the OpenAI format. It integrates into FocOs exactly like Groq or Gemini — the AAL abstraction layer does not distinguish between a cloud model and one running on local disk.
import urllib.request, json
def call_ollama(model, messages, base_url='http://localhost:11434'):
url = f'{base_url}/api/chat'
payload = json.dumps({
'model': model,
'messages': messages,
'stream': False,
'options': {
'temperature': 0.7,
'num_ctx': 4096, # Context window — tune per available RAM
}
}).encode('utf-8')
req = urllib.request.Request(
url, data=payload,
headers={'Content-Type': 'application/json'},
method='POST'
)
res = urllib.request.urlopen(req, timeout=120)
data = json.loads(res.read())
return data.get('message', {}).get('content', '')
# Health check before calling:
def ollama_available(base_url='http://localhost:11434'):
try:
urllib.request.urlopen(f'{base_url}/api/tags', timeout=2)
return True
except Exception:
return False
06. STREAMING — REAL-TIME RESPONSES
For long responses, streaming dramatically improves the experience — the user sees text appearing as the model generates, instead of waiting for the full inference to complete before seeing anything.
'''
on_token: callback receiving each text fragment
Example: on_token = lambda t: print(t, end='', flush=True)
'''
import json, urllib.request
url = 'http://localhost:11434/api/chat'
payload = json.dumps({
'model': model,
'messages': messages,
'stream': True,
}).encode('utf-8')
req = urllib.request.Request(
url, data=payload,
headers={'Content-Type': 'application/json'},
method='POST'
)
full_response = ''
with urllib.request.urlopen(req, timeout=120) as res:
for line in res:
if line.strip():
chunk = json.loads(line.decode('utf-8'))
token = chunk.get('message', {}).get('content', '')
if token:
on_token(token)
full_response += token
if chunk.get('done', False):
break
return full_response
07. CUSTOM MODELS — MODELFILE
Ollama lets you create custom models with a fixed system prompt via a Modelfile. This allows packaging the Chronos assistant from FocOs as a standalone model — once created, the full ecosystem context is available without manual injection on every call.
FROM llama3.2
SYSTEM """
You are Chronos — Frank's development assistant.
You operate inside FocOs, the window manager of the being who builds.
You know the ecosystem: FocOs, TeliOs, KayrOs, ChronOs, OruX.
Methodology: AAL — LLM-Agnostic Architecture.
Preferred stack: Python + HTML/CSS/JS vanilla. Zero dependencies.
Principle: the contract is the truth. Code is regenerable.
Always respond in the user's language.
Prioritize practical solutions over theory.
Maximum 3 options when alternatives exist.
"""
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
# Create the custom model:
ollama create chronos -f Modelfile
# Run from terminal:
ollama run chronos
# Call from Python:
response = call_ollama('chronos', [{'role': 'user', 'content': 'hello'}])
-- CONCLUSION
Ollama turns technological independence from a principle into a practical reality. A developer with Ollama installed can build software with LLMs at 3am, without internet, without spending a cent, without any external server seeing their code. The Ollama + FocOs + AAL combination is the most sovereign stack available today for AI-assisted development: the LLM runs on your machine, the work environment is yours, the methodology is yours, and the data never leaves your disk.
> SYSTEM_READY > NODE_ONLINE