OpenAI Rate Limit Errors: Complete Fix Guide
Introduction
Rate limit errors (HTTP 429) are the most common production issue when using OpenAI-compatible APIs. Whether you're hitting RPM (requests per minute), TPM (tokens per minute), or concurrent request limits, proper error handling is essential for production reliability.
Understanding Rate Limits
| Limit Type | Free Tier | Tier 1 | Tier 3 | Tier 5 | |---|---|---|---|---| | RPM (GPT-4o) | 500 | 10,000 | 50,000 | 200,000 | | TPM (GPT-4o) | 200K | 2M | 10M | 50M | | Concurrent | 10 | 500 | 2,000 | 5,000 |
Python: Exponential Backoff with Retry
import openai
import time
import random
client = openai.OpenAI(
api_key="your-api-key",
base_url="https://api.apiyihe.org/v1"
)
def api_call_with_retry(messages, max_retries=5, base_delay=1):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=500
)
except openai.RateLimitError as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited. Retrying in {delay:.1f}s (attempt {attempt + 1}/{max_retries})")
time.sleep(delay)
except openai.APIError as e:
if attempt == max_retries - 1:
raise
time.sleep(base_delay * (attempt + 1))
Python: Tenacity Library (Recommended)
from tenacity import (
retry,
stop_after_attempt,
wait_exponential,
retry_if_exception_type
)
import openai
@retry(
retry=retry_if_exception_type(openai.RateLimitError),
wait=wait_exponential(multiplier=1, min=1, max=60),
stop=stop_after_attempt(5)
)
def robust_api_call(messages):
return client.chat.completions.create(
model="gpt-4o",
messages=messages
)
Concurrent Request Limiting
import asyncio
import openai
from asyncio import Semaphore
sem = Semaphore(10) # Max 10 concurrent requests
async def limited_api_call(messages):
async with sem:
return await client.chat.completions.create(
model="gpt-4o",
messages=messages
)
async def process_batch(prompts):
tasks = [limited_api_call([{"role": "user", "content": p}]) for p in prompts]
return await asyncio.gather(*tasks, return_exceptions=True)
Token Budget Management
class TokenBudget:
def __init__(self, tpm_limit: int = 100000):
self.tpm_limit = tpm_limit
self.used = 0
self.window_start = time.time()
def can_request(self, estimated_tokens: int) -> bool:
if time.time() - self.window_start > 60:
self.used = 0
self.window_start = time.time()
return (self.used + estimated_tokens) <= self.tpm_limit
def record_usage(self, tokens: int):
self.used += tokens
budget = TokenBudget(tpm_limit=100000)
def call_with_budget(messages, max_tokens=500):
estimated = len(str(messages)) // 4 + max_tokens
while not budget.can_request(estimated):
time.sleep(1)
response = client.chat.completions.create(model="gpt-4o", messages=messages, max_tokens=max_tokens)
budget.record_usage(response.usage.total_tokens)
return response
Node.js Rate Limit Handling
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "your-api-key",
baseURL: "https://api.apiyihe.org/v1",
});
async function withRetry(messages, maxRetries = 5) {
for (let i = 0; i < maxRetries; i++) {
try {
return await client.chat.completions.create({
model: "gpt-4o",
messages,
});
} catch (error) {
if (error.status === 429 && i < maxRetries - 1) {
const delay = Math.pow(2, i) * 1000 + Math.random() * 1000;
await new Promise(r => setTimeout(r, delay));
} else {
throw error;
}
}
}
}
Common Error Patterns
| Error Code | Meaning | Fix | |---|---|---| | 429 "Rate limit reached for requests" | RPM exceeded | Add delay between requests | | 429 "Rate limit reached for tokens" | TPM exceeded | Reduce max_tokens or batch size | | 429 "Too many concurrent requests" | Concurrent limit hit | Use semaphore/queue | | 429 "You exceeded your current quota" | Account quota | Check billing; upgrade tier | | 503 "Service Unavailable" | OpenAI overload | Retry with longer backoff | | 500 "Internal Server Error" | Server error | Retry; if persistent, contact support |
Monitoring Rate Limit Usage
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
# Check remaining limits from headers
headers = response.response.headers
print(f"Requests remaining: {headers.get('x-ratelimit-remaining-requests')}")
print(f"Tokens remaining: {headers.get('x-ratelimit-remaining-tokens')}")
print(f"Reset time: {headers.get('x-ratelimit-reset-requests')}")
FAQ
Q: What's the fastest way to reduce rate limit errors? A: Upgrade your OpenAI tier (Tier 1→3→5) or switch to AI API Hub for higher aggregate limits across models.
Q: Can I use multiple API keys to bypass limits? A: Not recommended. Use proper retry handling. AI API Hub provides built-in failover across models.
Q: What's a good retry strategy for production? A: Exponential backoff with jitter, max 5 retries, 60s max delay.
Q: Does DeepSeek have the same rate limits? A: DeepSeek via AI API Hub has generous limits. 429s are rare for most use cases.