OpenAI Python Streaming Tutorial: Real-Time AI Responses

Introduction

Streaming transforms the user experience of AI applications by delivering tokens as they're generated. Instead of waiting 5-10 seconds for a complete response, users see text appear character by character, creating a responsive feel similar to ChatGPT. This guide covers streaming with the OpenAI Python SDK through AI API Hub.

What is Streaming?

Standard API calls return the complete response in one block. Streaming uses Server-Sent Events (SSE) to deliver tokens incrementally:

Standard:  Request → Wait → Full Response (5-10s)
Streaming: Request → T-o-k-e-n---b-y---t-o-k-e-n--- (instant first token)

Quick Setup

import openai

client = openai.OpenAI(
    api_key="your-api-key",
    base_url="https://api.apiyihe.org/v1"
)

Basic Streaming Example

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in detail."}
    ],
    stream=True,
    stream_options={"include_usage": True}
)

collected_content = []
usage = None

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        content = chunk.choices[0].delta.content
        print(content, end="", flush=True)
        collected_content.append(content)
    if hasattr(chunk, "usage") and chunk.usage:
        usage = chunk.usage

full_response = "".join(collected_content)
print(f"\n\nTokens used: {usage}")

Building a Streaming Chat Application

import openai

class StreamingChat:
    def __init__(self, api_key: str, model: str = "gpt-4o"):
        self.client = openai.OpenAI(
            api_key=api_key,
            base_url="https://api.apiyihe.org/v1"
        )
        self.model = model
        self.messages = []

    def chat(self, user_input: str):
        self.messages.append({"role": "user", "content": user_input})
        response_text = []

        stream = self.client.chat.completions.create(
            model=self.model,
            messages=self.messages,
            stream=True
        )

        for chunk in stream:
            if chunk.choices and chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                yield content
                response_text.append(content)

        full_response = "".join(response_text)
        self.messages.append({"role": "assistant", "content": full_response})

# Usage
chat = StreamingChat("your-api-key")
for token in chat.chat("Write a Python function to sort a list"):
    print(token, end="", flush=True)

Handling Streaming Errors

import openai

def stream_with_retry(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            stream = client.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                stream=True
            )
            for chunk in stream:
                if chunk.choices and chunk.choices[0].delta.content:
                    yield chunk.choices[0].delta.content
            return  # Success
        except openai.APIError as e:
            if attempt == max_retries - 1:
                raise
            print(f"Retry {attempt + 1}: {e}")

Streaming with Function Calling

functions = [{
    "name": "get_weather",
    "description": "Get current weather for a location",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {"type": "string"},
            "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
        },
        "required": ["location"]
    }
}]

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    functions=functions,
    stream=True
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.function_call:
        print(f"Function: {delta.function_call.name}")
        print(f"Args: {delta.function_call.arguments}")

Node.js Streaming Example

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "your-api-key",
  baseURL: "https://api.apiyihe.org/v1",
});

const stream = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Tell me about AI" }],
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content || "";
  process.stdout.write(content);
}

Common Streaming Issues

| Issue | Cause | Fix | |---|---|---| | Chunks received out of order | Network buffer | Use SSE properly; don't buffer | | Missing final chunk | Connection drops | Set timeout; check finish_reason | | stream_options not working | Old SDK | Upgrade to openai>=1.0 | | Stream stops mid-response | Token limit | Increase max_tokens | | Empty chunks | content_filter | Check finish_reason for "content_filter" |

Pricing Impact

Streaming uses the same pricing as standard requests. Input and output tokens are metered identically. The include_usage option helps track exact token consumption per stream.

FAQ

Q: Does streaming reduce latency? A: Yes. Time-to-first-token (TTFT) is typically 200-500ms vs 3-10s for complete responses.

Q: Can I stream with DeepSeek via AI API Hub? A: Yes. All models support streaming through the same OpenAI-compatible API.

Q: What's the maximum stream duration? A: No hard limit. Streams stay open until the response completes or connection closes.

Q: Is streaming more expensive? A: No. Token pricing is identical. Streaming just delivers tokens incrementally.

Alternatives

For non-streaming use cases or batch processing, consider:

Standard async API calls for offline processing
Batch API for large-scale jobs
WebSocket connections for bidirectional streaming