Skip to main content

Use case · Conversational AI

Chatbots and assistants that stay responsive

Build copilots and customer-facing assistants on open LLMs — with streaming responses, long context, and automatic failover so conversations never stall.

Why it fits

Built for real conversations

The pieces that make chat feel fast and reliable, out of the box.

Long context

Hold extended, multi-turn conversations with large context windows on open LLMs.

Multi-region routing

Requests are routed across regions to a healthy GPU automatically, so conversations stay responsive.

Streaming responses

Stream tokens over SSE for a responsive, typewriter-style experience.

Prefix caching

Repeated prompt prefixes are cached, lowering cost on long multi-turn chats.

OpenAI-compatible

Build on the same SDK and message format you already use.

Multilingual

Serve users in many languages with models like Qwen2.5.

Streaming

Token-by-token responses

Set stream=True and render replies as they arrive.

stream = client.chat.completions.create(
    model="qwen2.5-7b-instruct",
    messages=[{"role": "user", "content": "Tell me about EcoHash"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")

Build your assistant

Start with an open model and one API key.