Logo
热心市民王先生

LLM API Controls & Integration Patterns: Technical Research

Research Date: March 10, 2026
Focus: API-level controls and integration patterns for improving LLM instruction following


Executive Summary

This research consolidates technical documentation and implementation patterns for LLM API parameters across OpenAI, Anthropic, and Google. Key findings:

  1. Sampling parameters (temperature, top_p) have provider-specific optimal ranges
  2. Structured outputs are now production-ready across all major providers with different implementation approaches
  3. Context window management requires active strategies despite larger advertised limits
  4. Rate limiting requires exponential backoff with jitter for production resilience

1. API Parameters Affecting Instruction Following

1.1 Temperature

Purpose: Controls randomness in token selection

ProviderRangeDefaultRecommended Values
OpenAI0-21.00.2-0.5 (deterministic), 0.7-1.0 (creative)
Anthropic0-11.00.3-0.7 (coding), 0.7-1.0 (creative)
Google0-10.70.2-0.5 (extraction), 0.7+ (generation)

Technical Details:

  • Lower temperature flattens probability distribution toward most likely tokens
  • Temperature = 0 produces deterministic, repeatable outputs
  • High temperatures (>1.5) may produce nonsensical output

Source: Towards Data Science

1.2 Top P (Nucleus Sampling)

Purpose: Samples from cumulative probability distribution

ProviderRangeDefaultNotes
OpenAI0-11.0OpenAI recommends adjusting temperature OR top_p, not both
Anthropic0-1-Works well at 0.9-0.95 for balanced outputs
Google0-10.95Default works for most use cases

Technical Details:

  • Top P = 0.1 means sampling from tokens comprising top 10% probability mass
  • Model finds smallest set of tokens whose cumulative probability exceeds Top P value
  • Lower values = more focused, less diverse outputs

Source: OpenAI Community

1.3 Frequency Penalty

Purpose: Reduces token repetition based on frequency

ProviderRangeDefaultRecommended
OpenAI-2 to 200.1-1.0 (reduce repetition)
Anthropic--Not directly exposed
Google--Not directly exposed

Technical Details:

  • Positive values penalize tokens based on existing frequency in text
  • Proportional penalty (higher frequency = higher penalty)
  • Values >1.0 may degrade output quality

Source: Towards Data Science

1.4 Presence Penalty

Purpose: Reduces repetition based on whether token has appeared

ProviderRangeDefaultRecommended
OpenAI-2 to 200.1-0.5 (encourage new topics)

Technical Details:

  • Boolean-style penalty (once-off vs proportional for frequency penalty)
  • Positive values increase likelihood of discussing new topics
  • Formula: μ_j = μ_j - α_presence * 1[c[j]>0] - α_frequency * c[j]

2. Response Format Controls

2.1 OpenAI Structured Outputs

Approach: Constrained decoding with native Pydantic support

Guarantees: 100% schema adherence

Code Example (Python):

from pydantic import BaseModel
from openai import OpenAI

class MovieReview(BaseModel):
    title: str
    year: int
    rating: float
    pros: list[str]
    cons: list[str]
    recommendation: str

client = OpenAI()

response = client.beta.chat.completions.parse(
    model="gpt-5",
    response_format=MovieReview,
    messages=[
        {"role": "system", "content": "You are a movie critic."},
        {"role": "user", "content": "Review The Matrix (1999)"}
    ]
)

review = response.choices[0].message.parsed

Code Example (TypeScript with Zod):

import OpenAI from "openai";
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";

const MovieReview = z.object({
  title: z.string(),
  year: z.number().int(),
  rating: z.number().min(0).max(10),
  pros: z.array(z.string()),
  cons: z.array(z.string()),
  recommendation: z.enum(["must-watch", "recommended", "skip"]),
});

const response = await client.beta.chat.completions.parse({
  model: "gpt-5",
  response_format: zodResponseFormat(MovieReview, "movie_review"),
  messages: [...]
});

Key Features:

  • .parse() method handles schema conversion, API call, and response parsing
  • Streaming supported via client.beta.chat.completions.stream()
  • Check for refusal before accessing .parsed
  • Max schema depth: 5 levels

Source: DevTk.AI Structured Output Guide

2.2 Anthropic Structured Outputs

Approach: JSON schema via output_config.format (GA as of 2026)

Reliability: ~99%+ (not 100% guaranteed)

Code Example (Python):

import anthropic

client = anthropic.Anthropic()

response = client.messages.parse(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Extract contact info from email"}
    ],
    output_format=ContactInfo,  # Pydantic model
)

contact = response.parsed_output

Alternative: Tool Use Pattern (Legacy but still valid):

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    tools=[{
        "name": "extract_contact",
        "description": "Extract contact information",
        "input_schema": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "email": {"type": "string"}
            },
            "required": ["name", "email"]
        }
    }],
    tool_choice={"type": "tool", "name": "extract_contact"},
    messages=[{"role": "user", "content": "..."}]
)

Key Features:

  • output_config.format moved from beta (output_format still supported in SDKs)
  • Pydantic integration via client.messages.parse()
  • TypeScript support via zodOutputFormat()
  • Zero Data Retention (ZDR) processing
  • Schema cached up to 24 hours for optimization

Source: Anthropic Structured Outputs Documentation

2.3 Google Gemini Structured Outputs

Approach: response_schema in GenerationConfig

Guarantees: Schema-valid output via constrained decoding

Code Example (Python):

import google.generativeai as genai
from pydantic import BaseModel

class MovieReview(BaseModel):
    title: str
    year: int
    rating: float

model = genai.GenerativeModel("gemini-2.5-pro")

response = model.generate_content(
    "Review The Matrix (1999)",
    generation_config=genai.GenerationConfig(
        response_mime_type="application/json",
        response_schema=MovieReview,
    )
)

review = MovieReview.model_validate_json(response.text)

Alternative: JSON Schema Format:

response = model.generate_content(
    "Review The Matrix (1999)",
    generation_config=genai.GenerationConfig(
        response_mime_type="application/json",
        response_schema={
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "rating": {"type": "number"}
            },
            "required": ["title", "rating"]
        }
    )
)

Key Features:

  • Support for anyOf, $ref, and advanced JSON Schema keywords (since Nov 2025)
  • Property ordering preserved from schema
  • Pydantic and Zod work out-of-the-box
  • additionalProperties supported since November 2025

Source: Google AI Structured Outputs, Google Dev Blog

2.4 OpenAI JSON Mode (Simpler Alternative)

When strict schema enforcement isn’t needed:

response = client.chat.completions.create(
    model="gpt-5",
    response_format={"type": "json_object"},
    messages=[
        {"role": "system", "content": "Return JSON with title, rating, summary"}
    ]
)

Note: Guarantees valid JSON but not schema compliance


3. Token Management Strategies

3.1 Context Window Limits (2026)

ModelContext WindowEffective Performance
GPT-4.11M tokensBreaks ~30-40% earlier
Claude Sonnet 4.6200K-1M (beta)Strong performance to 150K
Gemini 1.5 Pro2M tokensRequires careful management
Llama 4 Scout10M tokensExperimental

Critical Finding: “Lost in the Middle” phenomenon persists - models struggle with information in middle of long contexts, showing U-shaped performance curves.

Source: Zylos Research

3.2 Context Management Strategies

Strategy 1: Long Document Placement

Best Practice: Place long documents at the TOP of prompts, queries at the bottom

# Optimal structure
messages = [
    {"role": "user", "content": f"""
<documents>
{long_document_content}
</documents>

Based on the documents above, answer: {query}
"""}
]

Impact: Up to 30% improvement in response quality

Source: Anthropic Prompting Best Practices

Strategy 2: Sliding Window with Summarization

def manage_context(messages, max_tokens=100000):
    current_tokens = count_tokens(messages)
    
    if current_tokens > max_tokens:
        # Summarize oldest messages
        summary = summarize(messages[:len(messages)//2])
        messages = [
            {"role": "system", "content": f"Previous conversation summary: {summary}"},
            *messages[len(messages)//2:]
        ]
    return messages

Strategy 3: Chunking for RAG

# Recommended chunk sizes
CHUNK_SIZE = 512  # tokens
CHUNK_OVERLAP = 50  # tokens

# Process long documents
def chunk_document(text, chunk_size=512, overlap=50):
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunk = text[i:i + chunk_size]
        chunks.append(chunk)
    return chunks

Source: Redis Context Window Guide

Strategy 4: Token Budgeting

# Reserve tokens for response
def calculate_available_tokens(context_limit, input_tokens, response_buffer=1000):
    return context_limit - input_tokens - response_buffer

# Example for Claude (200K context)
available = calculate_available_tokens(200000, 150000)
# Reserve 50K for input, use 150K for context

3.3 System vs User Messages

Best Practices:

  1. System messages for:

    • Role definition
    • Output format requirements
    • Behavioral constraints
    • Persistent instructions
  2. User messages for:

    • Variable content
    • Task-specific instructions
    • Long documents (place at top)

Anthropic Example:

message = client.messages.create(
    model="claude-opus-4-6",
    system="You are a helpful coding assistant specializing in Python.",
    messages=[
        {"role": "user", "content": "How do I sort a list of dictionaries?"}
    ],
)

Source: Anthropic System Prompts


4. Error Handling & Retry Patterns

4.1 Rate Limit Error Handling

OpenAI Rate Limit Headers:

x-ratelimit-limit-requests: 60
x-ratelimit-remaining-requests: 59
x-ratelimit-reset-requests: 1s
x-ratelimit-limit-tokens: 150000
x-ratelimit-remaining-tokens: 149984
x-ratelimit-reset-tokens: 6m0s

Source: OpenAI Rate Limits

4.2 Exponential Backoff Implementation

Pattern 1: Tenacity Library (Python)

from tenacity import retry, stop_after_attempt, wait_random_exponential
from openai import OpenAI, RateLimitError

client = OpenAI()

@retry(
    wait=wait_random_exponential(min=1, max=60),
    stop=stop_after_attempt(6),
    retry=retry_if_exception_type(RateLimitError)
)
def completion_with_backoff(**kwargs):
    return client.chat.completions.create(**kwargs)

# Usage
response = completion_with_backoff(
    model="gpt-5",
    messages=[...]
)

Pattern 2: Backoff Library (Python)

import backoff
from openai import OpenAI, RateLimitError

client = OpenAI()

@backoff.on_exception(backoff.expo, RateLimitError, max_time=60)
def completions_with_backoff(**kwargs):
    return client.chat.completions.create(**kwargs)

Pattern 3: Manual Implementation

import random
import time
from openai import RateLimitError

def retry_with_exponential_backoff(func, max_retries=10):
    def wrapper(*args, **kwargs):
        num_retries = 0
        delay = 1.0
        
        while True:
            try:
                return func(*args, **kwargs)
            except RateLimitError:
                num_retries += 1
                if num_retries > max_retries:
                    raise
                
                delay *= 2 * (1 + random.random())  # Add jitter
                time.sleep(delay)
    return wrapper

Source: OpenAI Cookbook

4.3 Retry Pattern for Structured Output Validation

import json
from pydantic import ValidationError

def generate_with_retry(client, model, messages, schema, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                response_format=schema
            )
            
            # Validate response semantics (not just structure)
            data = response.choices[0].message.parsed
            if validate_semantics(data):
                return data
            else:
                messages.append({
                    "role": "user",
                    "content": f"Invalid response: missing required fields. Try again."
                })
        except ValidationError as e:
            messages.append({
                "role": "user",
                "content": f"Schema validation failed: {str(e)}. Ensure all fields are present."
            })
    
    raise Exception(f"Failed after {max_retries} attempts")

4.4 Rate Limit Best Practices

  1. Monitor headers: Track x-ratelimit-remaining-* to proactively throttle
  2. Batch API: Use for non-real-time workloads (separate rate limits)
  3. Reduce max_tokens: Set close to expected response size
  4. Request batching: Combine multiple tasks into single requests when possible

5. Rate Limiting & Cost Control

5.1 OpenAI Usage Tiers

TierQualificationUsage Limit
Free-$100/month
Tier 1$5 paid$100/month
Tier 2$50 paid + 7 days$500/month
Tier 3$100 paid + 7 days$1,000/month
Tier 4$250 paid + 14 days$5,000/month
Tier 5$1,000 paid + 30 days$200,000/month

Note: Rate limits increase automatically with tier progression

5.2 Cost Optimization Strategies

Strategy 1: Structured Outputs Reduce Tokens

Example: Movie review generation

  • Unstructured: 85-95 tokens
  • Structured: 60-70 tokens
  • Savings: 30-40%

Scale Impact (1M requests/month, Claude Sonnet 4.5 @ $15/M tokens):

ScenarioUnstructuredStructuredMonthly Savings
Short extraction80 tokens35 tokens$675
Medium analysis200 tokens100 tokens$1,500
Long report500 tokens250 tokens$3,750

Source: DevTk.AI

Strategy 2: Prompt Caching

Implementation:

# Anthropic prompt caching
response = client.messages.create(
    model="claude-opus-4-6",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": long_system_prompt,
                    "cache_control": {"type": "ephemeral"}
                },
                {"type": "text", "text": user_query}
            ]
        }
    ]
)

Savings: Up to 90% on repeated content

Strategy 3: Model Selection

Decision Framework:

def select_model(task_complexity, latency_requirement, budget):
    if task_complexity == "simple" and budget == "low":
        return "claude-haiku-4-5"  # or gpt-4o-mini
    elif latency_requirement == "realtime":
        return "claude-sonnet-4-6"  # balanced
    elif task_complexity == "complex":
        return "claude-opus-4-6"  # or gpt-5

Strategy 4: Batch API for Non-Realtime

# OpenAI Batch API
batch_input_file = client.files.create(
    file=open("requests.jsonl", "rb"),
    purpose="batch"
)

batch = client.batches.create(
    input_file_id=batch_input_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

Benefits:

  • Lower cost per token
  • Separate rate limit bucket
  • Process thousands of requests overnight

6. Common Pitfalls & Solutions

Pitfall 1: Trusting Schema Too Much

Problem: Schema guarantees structure, not semantic quality

Solution:

def validate_review(review: MovieReview) -> bool:
    if not review.title.strip():
        return False
    if review.rating < 0 or review.rating > 10:
        return False
    if len(review.pros) == 0 and len(review.cons) == 0:
        return False
    return True

# Always validate
review = generate_review()
if not validate_review(review):
    # Retry with clearer instructions

Pitfall 2: Ignoring Refusal Responses

Problem: OpenAI may return refusal instead of parsed data

Solution:

response = client.beta.chat.completions.parse(
    model="gpt-5",
    response_format=MySchema,
    messages=[...]
)

if response.choices[0].message.refusal:
    handle_refusal(response.choices[0].message.refusal)
else:
    result = response.choices[0].message.parsed

Pitfall 3: Oversized Schemas

Problem: Large schemas increase latency and token consumption

Solution:

  • Split large extraction into multiple focused calls
  • Keep schemas flat (max 5 levels deep)
  • Use enums for categorical fields

Pitfall 4: Incorrect Streaming with Structured Outputs

Problem: Partial JSON invalid until stream completes

Solution:

stream = client.beta.chat.completions.stream(
    model="gpt-5",
    response_format=MovieReview,
    messages=[...]
)

with stream as response:
    for event in response:
        # Don't parse partial JSON
        pass
    # Parse only after stream completes
    final = response.get_final_completion()
    review = final.choices[0].message.parsed

Pitfall 5: Context Window Overflow

Problem: Hitting context limits mid-conversation

Solution:

def check_context_window(messages, model_limit=200000):
    tokens = count_tokens(messages)
    buffer = 5000  # Reserve for response
    
    if tokens > model_limit - buffer:
        # Trigger compaction or summarization
        return compact_context(messages)
    return messages

Pitfall 6: Not Handling Rate Limit Headers

Problem: Blindly retrying without checking limits

Solution:

def make_request_with_header_check(client, **kwargs):
    response = client.chat.completions.create(**kwargs)
    
    # Log rate limit status
    remaining_tokens = response.headers.get('x-ratelimit-remaining-tokens')
    reset_time = response.headers.get('x-ratelimit-reset-tokens')
    
    if int(remaining_tokens) < 1000:
        logger.warning(f"Rate limit approaching: {reset_time} until reset")
    
    return response

7. Code Examples Repository

7.1 Production-Ready LLM Client (Python)

from openai import OpenAI, RateLimitError
from pydantic import BaseModel, ValidationError
from tenacity import retry, stop_after_attempt, wait_random_exponential
import logging

logger = logging.getLogger(__name__)

class LLMClient:
    def __init__(self, api_key: str, model: str = "gpt-5"):
        self.client = OpenAI(api_key=api_key)
        self.model = model
    
    @retry(
        wait=wait_random_exponential(min=1, max=60),
        stop=stop_after_attempt(6)
    )
    def generate_structured(
        self,
        messages: list,
        response_format: BaseModel,
        temperature: float = 0.3,
        max_retries: int = 3
    ):
        """Generate structured output with retry logic"""
        
        for attempt in range(max_retries):
            try:
                response = self.client.beta.chat.completions.parse(
                    model=self.model,
                    messages=messages,
                    response_format=response_format,
                    temperature=temperature
                )
                
                if response.choices[0].message.refusal:
                    raise ValueError(f"Request refused: {response.choices[0].message.refusal}")
                
                parsed = response.choices[0].message.parsed
                
                # Semantic validation
                if not self.validate_response(parsed):
                    messages.append({
                        "role": "user",
                        "content": "Response missing required fields. Please provide complete data."
                    })
                    continue
                
                return parsed
                
            except ValidationError as e:
                logger.warning(f"Validation error (attempt {attempt + 1}): {e}")
                if attempt == max_retries - 1:
                    raise
                messages.append({
                    "role": "user",
                    "content": f"Schema validation failed: {str(e)}"
                })
        
        raise Exception(f"Failed after {max_retries} attempts")
    
    def validate_response(self, response: BaseModel) -> bool:
        """Override for custom validation logic"""
        return True

7.2 Context Manager for Long Conversations

import tiktoken

class ConversationManager:
    def __init__(self, model: str = "gpt-5", max_context_tokens: int = 100000):
        self.model = model
        self.max_context_tokens = max_context_tokens
        self.messages = []
        self.encoder = tiktoken.encoding_for_model(model)
    
    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        self._manage_context()
    
    def _manage_context(self):
        """Compact context if approaching limit"""
        tokens = self._count_tokens()
        
        if tokens > self.max_context_tokens * 0.8:
            logger.info(f"Context at {tokens} tokens, compacting...")
            self._compact()
    
    def _count_tokens(self) -> int:
        total = 0
        for msg in self.messages:
            total += len(self.encoder.encode(msg["content"]))
        return total
    
    def _compact(self):
        """Summarize old messages"""
        if len(self.messages) < 4:
            return
        
        # Keep system message and last 2 exchanges
        system_msg = next((m for m in self.messages if m["role"] == "system"), None)
        recent = self.messages[-4:]
        
        # Summarize middle messages
        to_summarize = self.messages[1:-4] if system_msg else self.messages[:-4]
        if to_summarize:
            summary = self._summarize(to_summarize)
            self.messages = [
                system_msg,
                {"role": "system", "content": f"Previous conversation summary: {summary}"},
                *recent
            ] if system_msg else [
                {"role": "system", "content": f"Previous conversation summary: {summary}"},
                *recent
            ]
    
    def _summarize(self, messages: list) -> str:
        # Use a cheap model for summarization
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "user", "content": f"Summarize this conversation in 3 sentences:\n{messages}"}
            ],
            max_tokens=200
        )
        return response.choices[0].message.content

Use CaseTemperatureTop PMax TokensFrequency PenaltyNotes
Code Generation0.2-0.30.92048-40960.0Deterministic, accurate
Data Extraction0.0-0.20.5512-10240.0Use structured outputs
Creative Writing0.7-1.00.951024-20480.3-0.5High creativity
Customer Support0.5-0.70.9512-10240.2Balanced tone
Analysis/Reasoning0.3-0.50.92048+0.1Thoughtful responses
Classification0.0-0.20.52560.0Consistent outputs
Translation0.2-0.30.910240.0Accurate, fluent

9. Source URLs

Official Documentation

  1. OpenAI API Reference: https://developers.openai.com/api/reference/
  2. OpenAI Rate Limits: https://developers.openai.com/api/docs/guides/rate-limits
  3. Anthropic Prompting Best Practices: https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/claude-prompting-best-practices
  4. Anthropic Structured Outputs: https://platform.claude.com/docs/en/build-with-claude/structured-outputs
  5. Anthropic System Prompts: https://docs.anthropic.com/en/docs/system-prompts
  6. Google Gemini Structured Outputs: https://ai.google.dev/gemini-api/docs/structured-output
  7. Google GenerationConfig: https://docs.cloud.google.com/vertex-ai/generative-ai/docs/reference/rest/v1beta1/GenerationConfig

Technical Guides

  1. Structured Output Guide (2026): https://devtk.ai/en/blog/ai-structured-output-guide-2026/
  2. ChatGPT Advanced Settings: https://towardsdatascience.com/guide-to-chatgpts-advanced-settings-top-p-frequency-penalties-temperature-and-more-b70bae848069/
  3. Context Window Management: https://redis.io/blog/context-window-management-llm-apps-developer-guide/
  4. LLM Context Management (Zylos): https://zylos.ai/research/2026-01-19-llm-context-management
  5. Google Structured Outputs Announcement: https://blog.google/technology/developers/gemini-api-structured-outputs

Code Examples

  1. OpenAI Cookbook (Rate Limits): https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py
  2. Grizzly Peak Context Strategies: https://www.grizzlypeaksoftware.com/library/context-window-management-strategies-uy5sbwgf

10. Key Takeaways

  1. Use structured outputs for production - all providers now support them with different tradeoffs
  2. Temperature 0.2-0.5 works for most deterministic tasks; use 0.7+ only for creative work
  3. Place long documents at top of prompts for 30% better performance
  4. Implement exponential backoff with jitter for rate limit resilience
  5. Monitor rate limit headers proactively, not just on errors
  6. Validate semantically even with guaranteed schema adherence
  7. Context window ≠ effective context - plan for 30-40% headroom
  8. Structured outputs reduce costs 30-60% on output tokens

Research compiled from official documentation, engineering blogs, and production code examples as of March 2026.