UnifiedAI Documentation
UnifiedAI is a universal AI proxy that provides a single, unified API to access multiple AI providers. It supports both OpenAI v1 and Ollama API formats, making it compatible with existing tools and SDKs.
- Single API endpoint for multiple providers
- OpenAI v1 fully compatible (all parameters)
- Ollama API support
- Real-time streaming (SSE)
- Prompt caching for cost & latency optimization
- 200+ AI models from various providers
- Function calling & tool use
Quick Start
Get started with UnifiedAI in under 2 minutes using your preferred programming language.
Python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:9000/v1",
api_key="sk-test" # Any key starting with 'sk-' works
)
response = client.chat.completions.create(
model="openrouter-deepseek/deepseek-chat",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
JavaScript/TypeScript
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:9000/v1',
apiKey: 'sk-test'
});
const response = await client.chat.completions.create({
model: 'openrouter-qwen/qwen-2.5-72b-instruct:free',
messages: [{ role: 'user', content: 'Hello!' }]
});
console.log(response.choices[0].message.content);
cURL
curl -X POST http://localhost:9000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-test" \
-d '{
"model": "openrouter-deepseek/deepseek-chat",
"messages": [{"role": "user", "content": "Hello!"}]
}'
OpenAI v1 API
UnifiedAI implements the OpenAI v1 API specification, making it compatible with any tool or SDK that supports OpenAI.
Base URL
http://localhost:9000/v1
Endpoints
List Models
GET /v1/models
Returns a list of all available models from all providers.
Response:
{
"object": "list",
"data": [
{
"id": "openrouter-deepseek/deepseek-chat",
"object": "model",
"created": 1234567890,
"owned_by": "openrouter"
},
{
"id": "zai-glm-4.6",
"object": "model",
"created": 1234567890,
"owned_by": "zai"
}
]
}
Create Chat Completion
POST /v1/chat/completions
Creates a chat completion (streaming or non-streaming).
Headers:
| Header | Value | Required |
|---|---|---|
| Authorization | Bearer sk-* | Yes |
| Content-Type | application/json | Yes |
Request Body:
{
"model": "openrouter-deepseek/deepseek-chat",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
],
"stream": false,
"temperature": 0.7,
"max_tokens": 1000
}
Parameters:
| Parameter | Type | Description |
|---|---|---|
| model | string | Model ID in format: provider-modelname |
| messages | array | Array of message objects |
| stream | boolean | Enable streaming (default: false) |
| temperature | number | Sampling temperature (0.0 - 2.0) |
| max_tokens | integer | Maximum tokens to generate |
| top_p | number | Nucleus sampling parameter |
| frequency_penalty | number | Penalize repetitions (-2.0 to 2.0) |
| presence_penalty | number | Encourage new topics (-2.0 to 2.0) |
| tools | array | Tools/functions for function calling |
| tool_choice | string/object | Control tool selection behavior |
| n | integer | Number of completions to generate |
| stop | string/array | Stop sequences |
| response_format | object | Output format (text/json_object/json_schema) |
UnifiedAI supports all OpenAI v1 API parameters including advanced features like function calling, JSON mode, and prompt caching. Additional parameters like logprobs, seed, and logit_bias are also supported.
Ollama API
UnifiedAI also provides Ollama-compatible endpoints, allowing you to use it with tools like Continue.dev, Cursor, and other Ollama clients.
Base URL
http://localhost:9000/api
Endpoints
List Models (Tags)
GET /api/tags
Returns available models in Ollama format.
Chat
POST /api/chat
Generate a chat completion in Ollama format.
Request:
{
"model": "deepseek-chat",
"messages": [
{
"role": "user",
"content": "Hello!"
}
],
"stream": false
}
Generate
POST /api/generate
Generate a completion from a prompt.
Request:
{
"model": "qwen-2.5-72b",
"prompt": "Tell me a joke",
"stream": false
}
Version
GET /api/version
Returns the Ollama API version.
Authentication
UnifiedAI uses API key authentication for the OpenAI v1 API.
Any API key starting with sk- will be accepted (e.g., sk-test, sk-anything).
This is intentional for development and testing purposes.
How to Authenticate
Include the API key in the Authorization header:
Authorization: Bearer sk-test
The Ollama API does not require authentication.
OpenRouter Provider
OpenRouter is the primary provider, giving access to 200+ AI models from various vendors.
Provider Name
openrouter
Available Models
OpenRouter provides access to models from:
- OpenAI (GPT-4, GPT-3.5)
- Anthropic (Claude)
- Google (Gemini)
- Meta (Llama)
- Mistral AI
- Qwen
- DeepSeek
- And many more...
Free Models
UnifiedAI pre-configures several free models:
openrouter-qwen/qwen-2.5-72b-instruct:free
openrouter-deepseek/deepseek-chat-v3.1:free
openrouter-nvidia/nemotron-nano-9b-v2:free
openrouter-mistralai/devstral-small-2505:free
openrouter-moonshotai/kimi-dev-72b:free
Model Verification
UnifiedAI can verify if a model exists on OpenRouter before routing the request. This prevents errors for non-existent models.
GLM (Z.ai) Provider
GLM (also known as Z.ai) provides ChatGLM models including GLM-4.6, GLM-4.5, and the Z1 series.
Provider Name
zai
Available Models
zai-glm-4.6
zai-glm-4.5
zai-glm-z1-series
Features
- Real-time streaming support
- Automatic authentication handling
- X-Signature generation for API requests
Streaming
UnifiedAI supports real-time streaming using Server-Sent Events (SSE) for both OpenAI and Ollama APIs.
OpenAI Streaming
Set stream: true in your request:
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:9000/v1',
apiKey: 'sk-test'
});
const stream = await client.chat.completions.create({
model: 'openrouter-deepseek/deepseek-chat',
messages: [{ role: 'user', content: 'Write a poem' }],
stream: true
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || '');
}
Ollama Streaming
{
"model": "deepseek-chat",
"messages": [{"role": "user", "content": "Write a story"}],
"stream": true
}
The response will be sent as newline-delimited JSON (NDJSON).
Prompt Caching
UnifiedAI supports Prompt Caching, a powerful optimization technique that caches repeated prompt segments to reduce latency and costs.
- ⚡ Lower Latency: Cached content is processed instantly
- 💰 Cost Savings: Up to 90% discount on cached tokens
- 🔋 Better Performance: Process only new content
Supported Providers
| Provider | Support | Cache Duration | Discount |
|---|---|---|---|
| Anthropic Claude (via OpenRouter) |
✅ Excellent | 5 minutes | 90% |
| Google Gemini (via OpenRouter) |
✅ Good | Up to 1 hour | ~80% |
| OpenAI GPT (via OpenRouter) |
❌ Not available | N/A | N/A |
| GLM (Z.ai) | ❌ Not available | N/A | N/A |
UnifiedAI currently supports OpenRouter and GLM (Z.ai) as providers. When using OpenRouter, you get access to models from Anthropic, Google, OpenAI, and many others.
Prompt caching availability depends on the underlying model provider, not OpenRouter itself.
How to Use
Add the cache_control field to messages you want to cache:
{
"model": "openrouter-anthropic/claude-3.5-sonnet",
"messages": [
{
"role": "system",
"content": "You are an expert programmer. Here is the complete codebase... [10,000 tokens]",
"cache_control": {"type": "ephemeral"}
},
{
"role": "user",
"content": "Explain this function"
}
]
}
UnifiedAI accesses Anthropic Claude models through OpenRouter. Use the format: openrouter-anthropic/claude-3.5-sonnet
Prompt caching will work automatically when using Claude models via OpenRouter.
First Request: Processes all tokens and creates cache (normal cost)
Second Request (within 5 min): Cache hit! Only processes new query (90% discount)
Ideal Use Cases
- Code Assistants: Cache file contents while asking multiple questions
- RAG Systems: Cache large documentation once, query multiple times
- Chatbots: Cache system instructions and policies
Cache Expiration Handling
If cache expires, the provider automatically recreates it on the next request. No errors, it just costs normal price to recreate.
- Cache is based on exact content match - even a space difference creates new cache
- Minimum tokens: 1024 for Anthropic, 32K for Gemini
- Monitor cache hits via
usage.cache_read_input_tokensin response
Response with Cache Stats
{
"id": "msg_123",
"choices": [...],
"usage": {
"prompt_tokens": 1000,
"completion_tokens": 200,
"cache_creation_input_tokens": 1000, // Cache created (miss)
"cache_read_input_tokens": 0 // Tokens from cache (hit)
}
}
For more details, see the Prompt Caching Guide.
Model Naming Convention
UnifiedAI uses a specific naming convention to route requests to the correct provider.
Format
provider-modelname
Examples
| Model ID | Provider | Description |
|---|---|---|
openrouter-deepseek/deepseek-chat |
OpenRouter | DeepSeek Chat via OpenRouter |
zai-glm-4.6 |
GLM (Z.ai) | GLM 4.6 model |
openrouter-anthropic/claude-3.5-sonnet |
OpenRouter | Claude 3.5 Sonnet via OpenRouter |
Ollama Model Names
When using the Ollama API, you can use simplified model names without the provider prefix:
# Ollama format
POST /api/chat
{
"model": "deepseek-chat", # Automatically routes to provider
"messages": [...]
}
Error Handling
UnifiedAI provides detailed error messages following OpenAI's error format.
Error Response Format
{
"error": {
"message": "Error description",
"type": "error_type"
}
}
Common Error Types
| Error Type | HTTP Status | Description |
|---|---|---|
| invalid_request_error | 400 | Invalid request parameters or missing required fields |
| invalid_request_error | 401 | Missing or invalid API key |
| provider_error | 502 | Error from the upstream provider API |
| server_error | 500 | Internal server error |
Examples
Missing API Key
{
"error": {
"message": "Missing or invalid Authorization header",
"type": "invalid_request_error"
}
}
Invalid Model
{
"error": {
"message": "Provider 'unknown' is not supported",
"type": "invalid_request_error"
}
}
Provider API Error
{
"error": {
"message": "OpenRouter API error: 429 - Rate limit exceeded",
"type": "provider_error"
}
}
- Always check the HTTP status code before parsing the response
- Implement retry logic for 5xx errors
- Handle rate limits (429) with exponential backoff
- Validate model names before making requests
Support
For issues, feature requests, or questions, please visit our GitHub repository or contact support.
UnifiedAI Documentation • Version 1.1.0 • Last updated: January 2025