Production LLM endpoints aligned with the NuStudio stack. Use the base URLs below from browsers, IDEs, and automation.
| Role | Base URL |
|---|---|
| General instruct (prod-01) | https://models.nuscale.io/v1 |
| Coder / agents (prod-02, vLLM) | https://models.nuscale.io/coder/v1 |
Set OPENAI_BASE_URL or client base_url to one of the above (keep the /v1 segment).
OpenAI API (coder): Use OPENAI_BASE_URL=https://models.nuscale.io/coder/v1 and model qwen2.5-coder:7b. vLLM serves real tool_calls for Qwen Code / Cursor agents (Ollama often prints tool JSON in content instead).
Served name (model field) | Weights |
|---|---|
qwen2.5-7b-instruct |
Qwen2.5-7B-Instruct-AWQ |
qwen2.5-coder:7b |
Qwen2.5-Coder-7B-Instruct-AWQ (vLLM, tool parser enabled) |
Run GET …/v1/models on each base URL for live limits and metadata.
Instruct pool
curl -sS https://models.nuscale.io/v1/models | python3 -m json.tool
Coder (prod-02)
curl -sS https://models.nuscale.io/coder/v1/models | python3 -m json.tool
General instruct
curl -sS https://models.nuscale.io/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-7b-instruct",
"messages": [{"role": "user", "content": "Say hello in one sentence."}],
"max_tokens": 128,
"temperature": 0.7
}'
Coder (prod-02)
curl -sS https://models.nuscale.io/coder/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-coder:7b",
"messages": [{"role": "user", "content": "Write a Python function that returns the sum of two integers."}],
"max_tokens": 512,
"temperature": 0.3
}'
/docs, /health, and OpenAPI on this host./v1/*) so tools/agents receive proper tool_calls responses.