typhon-stress-test

Supported Servers

Typhon does not fight alone. It speaks through champions — LLM servers that answer on the OpenAI-compatible /v1/chat/completions endpoint. The following are probed automatically on every scan.

Server	Port	Notes
llama.cpp (`llama-server`)	8080	Recommended
Ollama	11434
LM Studio	1234
vLLM	8000
text-generation-webui	5000	Requires OpenAI extension enabled
Jan	1337

llama.cpp — the recommended champion

llama.cpp’s llama-server is the preferred backend. It exposes --flash-attn, --ctx-size, and -ngl directly — the exact parameters Typhon optimizes for.

./llama-server \
  --model /path/to/model.gguf \
  --port 8080 \
  --flash-attn on \
  --ctx-size 32768 \
  -ngl 99

Key weapons:

Flag	Effect
`--flash-attn on`	Enables Flash Attention 2 — reduces VRAM by 20–30% on large contexts, improves throughput. Always enable if your GPU supports it (CUDA sm ≥ 8.0).
`--ctx-size N`	Maximum context in tokens. This is the number Typhon will tell you to set. Higher = more VRAM consumed.
`-ngl 99`	Offload all model layers to the GPU. Required for full VRAM utilization and honest benchmarks.
`--threads N`	CPU threads for prompt processing. Set to physical core count.

!!! tip Set --ctx-size to the value from typhon-ask or typhon-summary and leave --flash-attn on always enabled. These two flags have the highest return on VRAM and throughput of anything you can configure.

Ollama

ollama serve             # wake the server on port 11434
ollama run llama3.1:8b   # pull and load a model

Typhon reads the loaded model automatically via /api/tags.

Setting context size in Ollama:

Forge a Modelfile with the recommended num_ctx:

FROM llama3.1:8b
PARAMETER num_ctx 32768

ollama create my-model -f Modelfile
ollama run my-model

LM Studio

Open LM Studio and load a model from the My Models tab
Go to the Local Server tab
Click Start Server (default port 1234)

LM Studio exposes a full OpenAI-compatible API. Typhon detects it on port 1234 and reads the loaded model from /v1/models.

To set context size, use the Context Length slider in the model settings before starting the server.

vLLM

vllm serve /path/to/model \
  --port 8000 \
  --max-model-len 32768

--max-model-len is vLLM’s equivalent of --ctx-size.

text-generation-webui

Requires the OpenAI extension enabled:

Launch with --extensions openai
The OpenAI-compatible API will answer on port 5000

python server.py --extensions openai --model your-model

Jan

Start Jan and enable the Local API Server from settings. The server binds to port 1337 by default with a full OpenAI-compatible API.

Custom servers

Any server on a non-standard port can join the battle by adding an entry to KNOWN_SERVERS in typhon/scanner.py:

{
    "name": "My Server",
    "port": 12345,
    "health": "/health",
    "models": "/v1/models",
}

The port probe and model list fetch are handled automatically.

This site is open source. Improve this page.