Why it fits
Built for real conversations
The pieces that make chat feel fast and reliable, out of the box.
Long context
Hold extended, multi-turn conversations with large context windows on open LLMs.
Multi-region routing
Requests are routed across regions to a healthy GPU automatically, so conversations stay responsive.
Streaming responses
Stream tokens over SSE for a responsive, typewriter-style experience.
Prefix caching
Repeated prompt prefixes are cached, lowering cost on long multi-turn chats.
OpenAI-compatible
Build on the same SDK and message format you already use.
Multilingual
Serve users in many languages with models like Qwen2.5.
Streaming
Token-by-token responses
Set stream=True and render replies as they arrive.
stream = client.chat.completions.create(
model="qwen2.5-7b-instruct",
messages=[{"role": "user", "content": "Tell me about EcoHash"}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="")