UnifiedAI Documentation

UnifiedAI is a universal AI proxy that provides a single, unified API to access multiple AI providers. It supports both OpenAI v1 and Ollama API formats, making it compatible with existing tools and SDKs.

Key Features

Single API endpoint for multiple providers
OpenAI v1 fully compatible (all parameters)
Ollama API support
Real-time streaming (SSE)
Prompt caching for cost & latency optimization
200+ AI models from various providers
Function calling & tool use

Quick Start

Get started with UnifiedAI in under 2 minutes using your preferred programming language.

Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:9000/v1",
    api_key="sk-test"  # Any key starting with 'sk-' works
)

response = client.chat.completions.create(
    model="openrouter-deepseek/deepseek-chat",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

JavaScript/TypeScript

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:9000/v1',
  apiKey: 'sk-test'
});

const response = await client.chat.completions.create({
  model: 'openrouter-qwen/qwen-2.5-72b-instruct:free',
  messages: [{ role: 'user', content: 'Hello!' }]
});

console.log(response.choices[0].message.content);

cURL

curl -X POST http://localhost:9000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-test" \
  -d '{
    "model": "openrouter-deepseek/deepseek-chat",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

OpenAI v1 API

UnifiedAI implements the OpenAI v1 API specification, making it compatible with any tool or SDK that supports OpenAI.

Base URL

http://localhost:9000/v1

Endpoints

List Models

GET /v1/models

Returns a list of all available models from all providers.

Response:

{
  "object": "list",
  "data": [
    {
      "id": "openrouter-deepseek/deepseek-chat",
      "object": "model",
      "created": 1234567890,
      "owned_by": "openrouter"
    },
    {
      "id": "zai-glm-4.6",
      "object": "model",
      "created": 1234567890,
      "owned_by": "zai"
    }
  ]
}

Create Chat Completion

POST /v1/chat/completions

Creates a chat completion (streaming or non-streaming).

Headers:

Header	Value	Required
Authorization	Bearer sk-*	Yes
Content-Type	application/json	Yes

Request Body:

{
  "model": "openrouter-deepseek/deepseek-chat",
  "messages": [
    {
      "role": "user",
      "content": "What is the capital of France?"
    }
  ],
  "stream": false,
  "temperature": 0.7,
  "max_tokens": 1000
}

Parameters:

Parameter	Type	Description
model	string	Model ID in format: `provider-modelname`
messages	array	Array of message objects
stream	boolean	Enable streaming (default: false)
temperature	number	Sampling temperature (0.0 - 2.0)
max_tokens	integer	Maximum tokens to generate
top_p	number	Nucleus sampling parameter
frequency_penalty	number	Penalize repetitions (-2.0 to 2.0)
presence_penalty	number	Encourage new topics (-2.0 to 2.0)
tools	array	Tools/functions for function calling
tool_choice	string/object	Control tool selection behavior
n	integer	Number of completions to generate
stop	string/array	Stop sequences
response_format	object	Output format (text/json_object/json_schema)

Full OpenAI v1 Compatibility

UnifiedAI supports all OpenAI v1 API parameters including advanced features like function calling, JSON mode, and prompt caching. Additional parameters like logprobs, seed, and logit_bias are also supported.

Ollama API

UnifiedAI also provides Ollama-compatible endpoints, allowing you to use it with tools like Continue.dev, Cursor, and other Ollama clients.

Base URL

http://localhost:9000/api

Endpoints

List Models (Tags)

GET /api/tags

Returns available models in Ollama format.

Chat

POST /api/chat

Generate a chat completion in Ollama format.

Request:

{
  "model": "deepseek-chat",
  "messages": [
    {
      "role": "user",
      "content": "Hello!"
    }
  ],
  "stream": false
}

Generate

POST /api/generate

Generate a completion from a prompt.

Request:

{
  "model": "qwen-2.5-72b",
  "prompt": "Tell me a joke",
  "stream": false
}

Version

GET /api/version

Returns the Ollama API version.

Authentication

UnifiedAI uses API key authentication for the OpenAI v1 API.

API Key Format

Any API key starting with sk- will be accepted (e.g., sk-test, sk-anything).

This is intentional for development and testing purposes.

How to Authenticate

Include the API key in the Authorization header:

Authorization: Bearer sk-test

The Ollama API does not require authentication.

OpenRouter Provider

OpenRouter is the primary provider, giving access to 200+ AI models from various vendors.

Provider Name

openrouter

Available Models

OpenRouter provides access to models from:

OpenAI (GPT-4, GPT-3.5)
Anthropic (Claude)
Google (Gemini)
Meta (Llama)
Mistral AI
Qwen
DeepSeek
And many more...

Free Models

UnifiedAI pre-configures several free models:

openrouter-qwen/qwen-2.5-72b-instruct:free
openrouter-deepseek/deepseek-chat-v3.1:free
openrouter-nvidia/nemotron-nano-9b-v2:free
openrouter-mistralai/devstral-small-2505:free
openrouter-moonshotai/kimi-dev-72b:free

Model Verification

UnifiedAI can verify if a model exists on OpenRouter before routing the request. This prevents errors for non-existent models.

GLM (Z.ai) Provider

GLM (also known as Z.ai) provides ChatGLM models including GLM-4.6, GLM-4.5, and the Z1 series.

Provider Name

zai

Available Models

zai-glm-4.6
zai-glm-4.5
zai-glm-z1-series

Features

Real-time streaming support
Automatic authentication handling
X-Signature generation for API requests

Streaming

UnifiedAI supports real-time streaming using Server-Sent Events (SSE) for both OpenAI and Ollama APIs.

OpenAI Streaming

Set stream: true in your request:

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:9000/v1',
  apiKey: 'sk-test'
});

const stream = await client.chat.completions.create({
  model: 'openrouter-deepseek/deepseek-chat',
  messages: [{ role: 'user', content: 'Write a poem' }],
  stream: true
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '');
}

Ollama Streaming

{
  "model": "deepseek-chat",
  "messages": [{"role": "user", "content": "Write a story"}],
  "stream": true
}

The response will be sent as newline-delimited JSON (NDJSON).

Prompt Caching

UnifiedAI supports Prompt Caching, a powerful optimization technique that caches repeated prompt segments to reduce latency and costs.

Benefits of Prompt Caching

⚡ Lower Latency: Cached content is processed instantly
💰 Cost Savings: Up to 90% discount on cached tokens
🔋 Better Performance: Process only new content

Supported Providers

Provider	Support	Cache Duration	Discount
Anthropic Claude (via OpenRouter)	✅ Excellent	5 minutes	90%
Google Gemini (via OpenRouter)	✅ Good	Up to 1 hour	~80%
OpenAI GPT (via OpenRouter)	❌ Not available	N/A	N/A
GLM (Z.ai)	❌ Not available	N/A	N/A

About Provider Support

UnifiedAI currently supports OpenRouter and GLM (Z.ai) as providers. When using OpenRouter, you get access to models from Anthropic, Google, OpenAI, and many others.

Prompt caching availability depends on the underlying model provider, not OpenRouter itself.

How to Use

Add the cache_control field to messages you want to cache:

{
  "model": "openrouter-anthropic/claude-3.5-sonnet",
  "messages": [
    {
      "role": "system",
      "content": "You are an expert programmer. Here is the complete codebase... [10,000 tokens]",
      "cache_control": {"type": "ephemeral"}
    },
    {
      "role": "user",
      "content": "Explain this function"
    }
  ]
}

Using OpenRouter

UnifiedAI accesses Anthropic Claude models through OpenRouter. Use the format: openrouter-anthropic/claude-3.5-sonnet

Prompt caching will work automatically when using Claude models via OpenRouter.

First Request: Processes all tokens and creates cache (normal cost)

Second Request (within 5 min): Cache hit! Only processes new query (90% discount)

Ideal Use Cases

Code Assistants: Cache file contents while asking multiple questions
RAG Systems: Cache large documentation once, query multiple times
Chatbots: Cache system instructions and policies

Cache Expiration Handling

If cache expires, the provider automatically recreates it on the next request. No errors, it just costs normal price to recreate.

Important Notes

Cache is based on exact content match - even a space difference creates new cache
Minimum tokens: 1024 for Anthropic, 32K for Gemini
Monitor cache hits via usage.cache_read_input_tokens in response

Response with Cache Stats

{
  "id": "msg_123",
  "choices": [...],
  "usage": {
    "prompt_tokens": 1000,
    "completion_tokens": 200,
    "cache_creation_input_tokens": 1000,  // Cache created (miss)
    "cache_read_input_tokens": 0          // Tokens from cache (hit)
  }
}

For more details, see the Prompt Caching Guide.

Model Naming Convention

UnifiedAI uses a specific naming convention to route requests to the correct provider.

Format

provider-modelname

Examples

Model ID	Provider	Description
`openrouter-deepseek/deepseek-chat`	OpenRouter	DeepSeek Chat via OpenRouter
`zai-glm-4.6`	GLM (Z.ai)	GLM 4.6 model
`openrouter-anthropic/claude-3.5-sonnet`	OpenRouter	Claude 3.5 Sonnet via OpenRouter

Ollama Model Names

When using the Ollama API, you can use simplified model names without the provider prefix:

# Ollama format
POST /api/chat
{
  "model": "deepseek-chat",  # Automatically routes to provider
  "messages": [...]
}

Error Handling

UnifiedAI provides detailed error messages following OpenAI's error format.

Error Response Format

{
  "error": {
    "message": "Error description",
    "type": "error_type"
  }
}

Common Error Types

Error Type	HTTP Status	Description
invalid_request_error	400	Invalid request parameters or missing required fields
invalid_request_error	401	Missing or invalid API key
provider_error	502	Error from the upstream provider API
server_error	500	Internal server error

Examples

Missing API Key

{
  "error": {
    "message": "Missing or invalid Authorization header",
    "type": "invalid_request_error"
  }
}

Invalid Model

{
  "error": {
    "message": "Provider 'unknown' is not supported",
    "type": "invalid_request_error"
  }
}

Provider API Error

{
  "error": {
    "message": "OpenRouter API error: 429 - Rate limit exceeded",
    "type": "provider_error"
  }
}

Best Practices

Always check the HTTP status code before parsing the response
Implement retry logic for 5xx errors
Handle rate limits (429) with exponential backoff
Validate model names before making requests

Support

For issues, feature requests, or questions, please visit our GitHub repository or contact support.

UnifiedAI Documentation • Version 1.1.0 • Last updated: January 2025