Typhon does not fight alone. It speaks through champions — LLM servers that answer on the OpenAI-compatible /v1/chat/completions endpoint. The following are probed automatically on every scan.
| Server | Port | Notes |
|---|---|---|
llama.cpp (llama-server) |
8080 | Recommended |
| Ollama | 11434 | |
| LM Studio | 1234 | |
| vLLM | 8000 | |
| text-generation-webui | 5000 | Requires OpenAI extension enabled |
| Jan | 1337 |
llama.cpp’s llama-server is the preferred backend. It exposes --flash-attn, --ctx-size, and -ngl directly — the exact parameters Typhon optimizes for.
./llama-server \
--model /path/to/model.gguf \
--port 8080 \
--flash-attn on \
--ctx-size 32768 \
-ngl 99
Key weapons:
| Flag | Effect |
|---|---|
--flash-attn on |
Enables Flash Attention 2 — reduces VRAM by 20–30% on large contexts, improves throughput. Always enable if your GPU supports it (CUDA sm ≥ 8.0). |
--ctx-size N |
Maximum context in tokens. This is the number Typhon will tell you to set. Higher = more VRAM consumed. |
-ngl 99 |
Offload all model layers to the GPU. Required for full VRAM utilization and honest benchmarks. |
--threads N |
CPU threads for prompt processing. Set to physical core count. |
!!! tip
Set --ctx-size to the value from typhon-ask or typhon-summary and leave --flash-attn on always enabled. These two flags have the highest return on VRAM and throughput of anything you can configure.
ollama serve # wake the server on port 11434
ollama run llama3.1:8b # pull and load a model
Typhon reads the loaded model automatically via /api/tags.
Setting context size in Ollama:
Forge a Modelfile with the recommended num_ctx:
FROM llama3.1:8b
PARAMETER num_ctx 32768
ollama create my-model -f Modelfile
ollama run my-model
LM Studio exposes a full OpenAI-compatible API. Typhon detects it on port 1234 and reads the loaded model from /v1/models.
To set context size, use the Context Length slider in the model settings before starting the server.
vllm serve /path/to/model \
--port 8000 \
--max-model-len 32768
--max-model-len is vLLM’s equivalent of --ctx-size.
Requires the OpenAI extension enabled:
--extensions openaipython server.py --extensions openai --model your-model
Start Jan and enable the Local API Server from settings. The server binds to port 1337 by default with a full OpenAI-compatible API.
Any server on a non-standard port can join the battle by adding an entry to KNOWN_SERVERS in typhon/scanner.py:
{
"name": "My Server",
"port": 12345,
"health": "/health",
"models": "/v1/models",
}
The port probe and model list fetch are handled automatically.