Inference gateway

models.nuscale.io OpenAI-compatible API · vLLM

Production LLM endpoints aligned with the NuStudio stack. Use the base URLs below from browsers, IDEs, and automation.

Base URLs

Role	Base URL
General instruct (prod-01)	`https://models.nuscale.io/v1`
Coder / agents (prod-02, vLLM)	`https://models.nuscale.io/coder/v1`

Set OPENAI_BASE_URL or client base_url to one of the above (keep the /v1 segment).

OpenAI API (coder): Use OPENAI_BASE_URL=https://models.nuscale.io/coder/v1 and model qwen2.5-coder:7b. vLLM serves real tool_calls for Qwen Code / Cursor agents (Ollama often prints tool JSON in content instead).

Models

Served name (`model` field)	Weights
`qwen2.5-7b-instruct`	Qwen2.5-7B-Instruct-AWQ
`qwen2.5-coder:7b`	Qwen2.5-Coder-7B-Instruct-AWQ (vLLM, tool parser enabled)

Run GET …/v1/models on each base URL for live limits and metadata.

List models

Instruct pool

curl -sS https://models.nuscale.io/v1/models | python3 -m json.tool

Coder (prod-02)

curl -sS https://models.nuscale.io/coder/v1/models | python3 -m json.tool

Chat completions

General instruct

curl -sS https://models.nuscale.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-7b-instruct",
    "messages": [{"role": "user", "content": "Say hello in one sentence."}],
    "max_tokens": 128,
    "temperature": 0.7
  }'

Coder (prod-02)

curl -sS https://models.nuscale.io/coder/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-coder:7b",
    "messages": [{"role": "user", "content": "Write a Python function that returns the sum of two integers."}],
    "max_tokens": 512,
    "temperature": 0.3
  }'

Notes

The instruct backend also serves /docs, /health, and OpenAPI on this host.
The coder backend is vLLM (OpenAI-compatible /v1/*) so tools/agents receive proper tool_calls responses.
On the LAN, use the same paths on prod-01 port 80 when nginx is configured.