Skip to main content

Dedicated Inference

Deploy any open model as a dedicated endpoint

Bring a HuggingFace model or your own container and serve it on dedicated GPUs — reachable through the same OpenAI-compatible API, addressed by a simple model ID.

What you get

Endpoints built for production

Dedicated capacity with the resilience and flexibility real workloads need.

Same unified API

Call your instance through the same OpenAI-compatible API as platform models — just address it with the model ID served_name:instance_id. No new base URL.

HuggingFace or custom containers

Deploy a managed vLLM server from any HuggingFace model, or bring your own container image and start command.

Multi-region routing

Run across regions and let the unified API route each request to a healthy one. As soon as a region is up, your endpoint is usable.

Multiple LoRA adapters

Serve many fine-tuned adapters on one instance and address each by its own alias (e.g. ft10:1) — no restart to add or swap them.

Sandboxed & isolated

Every instance runs in a security-hardened, network-isolated sandbox.

Adaptive scaling

Autoscaling adjusts capacity to live request volume. Pay per GPU-hour.

How you call it

One API, addressed by model ID

Your instance is reachable through the same endpoint as platform models. Set the model field to served_name:instance_id and keep your existing SDK.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.ecohash.com/v1",
    api_key="eco_your_api_key",
)

# Address your instance with served_name:instance_id
response = client.chat.completions.create(
    model="qwen3.5-35b-a3b:144",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

At a glance

The essentials

Base URL
api.ecohash.com/v1
Model ID
served_name:instance_id
Billing
Per GPU-hour

Ship your own models

Deploy a dedicated endpoint and keep the SDK you already use.