AI Endpoints - Using Batch mode

Introduction

AI Endpoints is a serverless platform provided by OVHcloud that offers easy access to a selection of world-renowned, pre-trained AI models.

The Batch API (/v1/batches) is an OpenAI-compatible route that lets you submit a large number of inference requests in a single asynchronous job, instead of sending them one by one through synchronous endpoints such as /v1/chat/completions or /v1/responses.

Batch mode is ideal when you do not need an immediate answer, but rather want to process a high volume of prompts (evaluations, offline labelling, content generation at scale, dataset preparation, etc.) in a cost-efficient and throughput-oriented way. Batch jobs have a default completion window of 48 hours. Any job not completed within this period will expire. You can choose between 24, 48 and 72 hours.

AI Endpoints batch mode workflow diagram

Objective

This guide explains the /v1/batches route on AI Endpoints, including:

  • The typical end-to-end batch workflow
  • Preparing an input JSONL file
  • Usage examples in Python, JavaScript, and cURL
  • Retrieving and parsing batch results
  • Known limitations on the platform

This guide explains how to use the /v1/batches API to run inference requests asynchronously on OVHcloud AI Endpoints.

Requirements

The examples provided in this guide can be used with one of the following environments:

Python
JavaScript
cURL

A Python environment with the openai client.

pip install openai

Authentication

Examples provided in this guide use the authenticated mode and expect the AI_ENDPOINT_API_KEY environment variable to be set. The anonymous mode is not available with the batch and files endpoints.

To specify your own API key, set it in the environment (export AI_ENDPOINT_API_KEY='your_api_key').

See the AI Endpoints - Getting started guide for authentication details.

How batch mode works

A batch job is processed in four stages:

  1. Prepare a JSONL file where each line describes a single request (model, endpoint, body).
  2. Upload this file to AI Endpoints through the Files API (/v1/files) with purpose="batch".
  3. Create a batch (/v1/batches) referencing the uploaded file and the target endpoint.
  4. Poll the batch status until it is completed, then download the output file (and the error file, if any).
Diagram of the four batch job lifecycle stages

Each line of the input file is processed independently. Successful responses are written to the output file, failed ones to the error file. Both files are retrieved through the Files API.

Preparing the input file (JSONL)

The input file must be in JSON Lines format (.jsonl): one JSON object per line, with no trailing comma and no wrapping array.

Warning

Your file must not exceed 200 MB or contain more than 50,000 entries.

Each line represents one independent request and must contain the following fields:

FieldDescription
custom_idA unique string you choose. It is echoed back in the output so you can correlate each result to its input.
methodHTTP method of the targeted endpoint, typically POST.
urlThe relative path of the inference endpoint to call, for example /v1/chat/completions or /v1/responses.
bodyThe JSON body that would normally be sent to the synchronous endpoint (same schema as /v1/chat/completions or /v1/responses).

Example requests.jsonl with two requests:

{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-oss-20b", "messages": [{"role": "user", "content": "Summarise the plot of Hamlet in two sentences."}]}}
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-oss-20b", "messages": [{"role": "user", "content": "Translate 'Good morning' into French, Spanish and German."}]}}
Info

custom_id values must be unique within a batch. They are the only reliable way to map outputs back to your original inputs, since the order of the output file is not guaranteed.

Quickstart

The following examples walk through the full lifecycle of a batch: uploading the input file, creating the batch, checking its status, and downloading the results.

1. Upload the input file

Upload the .jsonl file to the Files API with purpose="batch". The response contains a file identifier (e.g. file-abc123) that you will reference when creating the batch.

Python
JavaScript
cURL
import os
from openai import OpenAI

api_key = os.environ["AI_ENDPOINT_API_KEY"]  # export AI_ENDPOINT_API_KEY='your_api_key'

client = OpenAI(
    base_url="https://oai.endpoints.kepler.ai.cloud.ovh.net/v1",
    api_key=api_key,
)

batch_input_file = client.files.create(
    file=open("requests.jsonl", "rb"),
    purpose="batch",
)

print(batch_input_file.id)

2. Create the batch

Create a batch by referencing the uploaded file identifier, the target endpoint and the completion window.

Python
JavaScript
cURL
batch = client.batches.create(
    input_file_id=batch_input_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
    metadata={"description": "Evaluation run - April 2026"},
)

print(batch.id, batch.status)

The response contains the batch object with its identifier and an initial status (typically validating).

3. Check the batch status

Batches are asynchronous. Poll the batch object until it reaches a terminal state (completed, failed, expired or cancelled).

Python
JavaScript
cURL
import time

while True:
    current = client.batches.retrieve(batch.id)
    print(current.status, current.request_counts)
    if current.status in ("completed", "failed", "expired", "cancelled"):
        break
    time.sleep(30)

A batch object progresses through the following states:

StatusMeaning
validatingThe input file is being validated before processing starts.
failedThe input file did not pass validation.
in_progressThe batch is currently being processed.
finalizingProcessing is done. The results are being compiled into the output file.
completedThe batch finished successfully. Output file (and error file if any) are ready.
expiredThe batch could not be completed within the requested completion window.
cancellingA cancellation has been requested and is being applied.
cancelledThe batch has been cancelled by the user.

The request_counts field reports the total, completed, and failed counts for individual requests. It is the easiest way to monitor progress.

4. Download the results

Once the batch reaches the completed state, the batch object exposes two file identifiers:

  • output_file_id: JSONL file containing the successful responses.
  • error_file_id: JSONL file containing the failed requests (present only when at least one request failed).

Retrieve their content through the Files API:

Python
JavaScript
cURL
final = client.batches.retrieve(batch.id)

if final.output_file_id:
    output = client.files.content(final.output_file_id)
    with open("results.jsonl", "wb") as f:
        f.write(output.read())

if final.error_file_id:
    errors = client.files.content(final.error_file_id)
    with open("errors.jsonl", "wb") as f:
        f.write(errors.read())
Info

Output and error files are automatically deleted after 15 days.

Output file format

Each line of the output file is a JSON object matching one input line, with the following shape:

{
  "id": "batch_req_abc123",
  "custom_id": "request-1",
  "response": {
    "status_code": 200,
    "request_id": "req_...",
    "body": {
      "id": "chatcmpl-...",
      "object": "chat.completion",
      "model": "gpt-oss-20b",
      "choices": [
        {
          "index": 0,
          "message": {"role": "assistant", "content": "..."},
          "finish_reason": "stop"
        }
      ],
      "usage": {"prompt_tokens": 42, "completion_tokens": 128, "total_tokens": 170}
    }
  },
  "error": null
}

The body field mirrors exactly what the synchronous endpoint would have returned for the corresponding request, which means you can reuse the same parsing code you already use for /v1/chat/completions or /v1/responses.

Use the custom_id field to map each response back to your original input, as the order of the output file is not guaranteed.

Failed requests are written to the error file with a populated error object instead of response.body.

Listing and cancelling batches

List your batches

Python
JavaScript
cURL
for b in client.batches.list(limit=20):
    print(b.id, b.status, b.created_at)

Cancel a batch

A batch can be cancelled while it is in the validating or in_progress state. Already-processed requests remain available in the output file.

Python
JavaScript
cURL
client.batches.cancel(batch.id)

When to use batch mode

Batch mode is a good fit when:

  • You have a large volume of prompts to process (thousands to millions).
  • You do not need a real-time answer: results within a few hours are acceptable.
  • Your workload is embarrassingly parallel: each request is independent of the others.

Typical use cases include:

  • Dataset annotation and labelling (classification, tagging, summarisation).
  • Offline evaluation of a model or a prompt on a benchmark dataset.
  • Bulk content generation (product descriptions, SEO content, translations).
  • Retrospective enrichment of logs, tickets, or any historical corpus.

For interactive workloads (chat UIs, low-latency tools, real-time agents), prefer the synchronous /v1/chat/completions or /v1/responses routes.

Endpoint limitations

The /v1/batches endpoint is still under development and not all features may be available yet. If you are interested in specific features that you would like us to prioritise, let us know on the OVHcloud Discord server.

  • All requests inside a single batch must target the same endpoint (the one declared at batch creation time).
  • A batch cannot reference models that are not available on the AI Endpoints catalog.
  • Currently, we only accept batch requests for our LLMs and embeddings models.
  • Input files must be valid JSONL with unique custom_id values; malformed lines cause the batch to move to the failed state during validation.
  • The completion_window accepts 24h, 48h, and 72h. Batches that cannot be completed within this window transition to expired.
  • Output and error files are subject to the Files API retention policy. Download them as soon as possible once the batch is completed.
  • Model-specific limitations (context length, structured outputs, function calling, etc.) documented for the synchronous route also apply to the corresponding requests inside a batch.

Conclusion

The Batch API provides a cost-efficient, asynchronous way to run large volumes of inference requests on OVHcloud AI Endpoints. By reusing the same request body as the synchronous endpoints, it fits naturally into existing integrations built on top of v1/chat/completions or v1/responses.

To maximise success rate, verify supported features for your chosen model in the AI Endpoints catalog, keep custom_id values unique, and always correlate results through custom_id rather than relying on file ordering.

Go further

Explore the full AI Endpoints documentation for more guides and tutorials.

If you need training or technical assistance to implement our solutions, contact your sales representative or visit the Professional Services page to get a quote and ask our Professional Services experts for a custom analysis of your project.

Feedback

Please send us your questions, feedback, and suggestions to improve the service:

War diese Seite hilfreich?