OpenAI Python Streaming Tutorial: Real-Time AI Responses
Introduction
Streaming transforms the user experience of AI applications by delivering tokens as they're generated. Instead of waiting 5-10 seconds for a complete response, users see text appear character by character, creating a responsive feel similar to ChatGPT. This guide covers streaming with the OpenAI Python SDK through AI API Hub.
What is Streaming?
Standard API calls return the complete response in one block. Streaming uses Server-Sent Events (SSE) to deliver tokens incrementally:
Standard: Request → Wait → Full Response (5-10s)
Streaming: Request → T-o-k-e-n---b-y---t-o-k-e-n--- (instant first token)
Quick Setup
import openai
client = openai.OpenAI(
api_key="your-api-key",
base_url="https://api.apiyihe.org/v1"
)
Basic Streaming Example
stream = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in detail."}
],
stream=True,
stream_options={"include_usage": True}
)
collected_content = []
usage = None
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
print(content, end="", flush=True)
collected_content.append(content)
if hasattr(chunk, "usage") and chunk.usage:
usage = chunk.usage
full_response = "".join(collected_content)
print(f"\n\nTokens used: {usage}")
Building a Streaming Chat Application
import openai
class StreamingChat:
def __init__(self, api_key: str, model: str = "gpt-4o"):
self.client = openai.OpenAI(
api_key=api_key,
base_url="https://api.apiyihe.org/v1"
)
self.model = model
self.messages = []
def chat(self, user_input: str):
self.messages.append({"role": "user", "content": user_input})
response_text = []
stream = self.client.chat.completions.create(
model=self.model,
messages=self.messages,
stream=True
)
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
yield content
response_text.append(content)
full_response = "".join(response_text)
self.messages.append({"role": "assistant", "content": full_response})
# Usage
chat = StreamingChat("your-api-key")
for token in chat.chat("Write a Python function to sort a list"):
print(token, end="", flush=True)
Handling Streaming Errors
import openai
def stream_with_retry(messages, max_retries=3):
for attempt in range(max_retries):
try:
stream = client.chat.completions.create(
model="gpt-4o",
messages=messages,
stream=True
)
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
return # Success
except openai.APIError as e:
if attempt == max_retries - 1:
raise
print(f"Retry {attempt + 1}: {e}")
Streaming with Function Calling
functions = [{
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}]
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
functions=functions,
stream=True
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.function_call:
print(f"Function: {delta.function_call.name}")
print(f"Args: {delta.function_call.arguments}")
Node.js Streaming Example
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "your-api-key",
baseURL: "https://api.apiyihe.org/v1",
});
const stream = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Tell me about AI" }],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || "";
process.stdout.write(content);
}
Common Streaming Issues
| Issue | Cause | Fix |
|---|---|---|
| Chunks received out of order | Network buffer | Use SSE properly; don't buffer |
| Missing final chunk | Connection drops | Set timeout; check finish_reason |
| stream_options not working | Old SDK | Upgrade to openai>=1.0 |
| Stream stops mid-response | Token limit | Increase max_tokens |
| Empty chunks | content_filter | Check finish_reason for "content_filter" |
Pricing Impact
Streaming uses the same pricing as standard requests. Input and output tokens are metered identically. The include_usage option helps track exact token consumption per stream.
FAQ
Q: Does streaming reduce latency? A: Yes. Time-to-first-token (TTFT) is typically 200-500ms vs 3-10s for complete responses.
Q: Can I stream with DeepSeek via AI API Hub? A: Yes. All models support streaming through the same OpenAI-compatible API.
Q: What's the maximum stream duration? A: No hard limit. Streams stay open until the response completes or connection closes.
Q: Is streaming more expensive? A: No. Token pricing is identical. Streaming just delivers tokens incrementally.
Alternatives
For non-streaming use cases or batch processing, consider:
- Standard async API calls for offline processing
- Batch API for large-scale jobs
- WebSocket connections for bidirectional streaming