最佳实践与代码示例

最佳实践代码示例缓存优化 Prompt Engineering

提供避免缓存失效的最佳实践和可复用的代码示例，涵盖各种常见场景

1. 避免缓存失效的核心原则

1.1 黄金法则：静态内容前置

原则：将静态内容（系统提示词、长文档、知识库）放在前面，动态内容（用户输入、时间戳、UUID）放在后面。

错误示例（会导致缓存失效）：

{
  "messages": [
    {"role": "system", "content": "Current time: 2026-03-08 10:30:00. You are a helpful assistant."},
    {"role": "user", "content": "[LONG_DOCUMENT]"},
    {"role": "user", "content": "What is the main topic?"}
  ]
}

正确示例（可命中缓存）：

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "[LONG_DOCUMENT]\n\nCurrent time: 2026-03-08 10:30:00\n\nWhat is the main topic?"}
  ]
}

1.2 缓存失效的常见原因清单

失效原因	示例	解决方案
时间戳前置	`"Current time: ${Date.now()}"`	将时间戳放在最后
UUID/随机数	`"Request ID: ${uuidv4()}"`	移至消息末尾或元数据
动态计数器	`"Query #123"`	从后缀移除或使用独立字段
会话 ID 前置	`"Session: ${sessionId}"`	放在最后或使用 HTTP Header
空格/格式差异	`"Hello "` vs `"Hello"`	统一格式化函数
JSON 键顺序	`{"b": 1, "a": 2}`	使用有序字典

1.3 提示词结构设计模板

标准模板（推荐）：

[系统提示词 - 静态]
[角色定义 - 静态]
[知识库/文档 - 静态]
[工具定义 - 静态]
[ few-shot 示例 - 静态]
---
[动态上下文]
[用户问题]

代码实现：

def build_prompt(system_prompt, knowledge_base, user_question, dynamic_context=None):
    """
    构建缓存友好的提示词
    静态内容在前，动态内容在后
    """
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"{knowledge_base}\n\n{dynamic_context or ''}\n\nQuestion: {user_question}"}
    ]
    return messages

2. 各供应商代码示例

2.1 DeepSeek 最佳实践

DeepSeek 的缓存完全自动，重点在于提示词结构设计：

import os
from openai import OpenAI

# 初始化客户端
deepseek_client = OpenAI(
    api_key=os.getenv("DEEPSEEK_API_KEY"),
    base_url="https://api.deepseek.com"
)

class RAGSystem:
    def __init__(self, system_prompt, document_corpus):
        self.system_prompt = system_prompt
        # 静态内容：文档语料库
        self.document_corpus = document_corpus
        
    def query(self, user_question, timestamp=None):
        """
        查询方法 - 动态内容放在最后
        """
        # 构建消息：静态内容在前，动态内容在后
        dynamic_part = ""
        if timestamp:
            dynamic_part += f"Current time: {timestamp}\n\n"
        
        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": f"Document Context:\n{self.document_corpus}\n\n{dynamic_part}Question: {user_question}"}
        ]
        
        response = deepseek_client.chat.completions.create(
            model="deepseek-chat",
            messages=messages,
            max_tokens=1000
        )
        
        # 检查缓存命中情况
        usage = response.usage
        if hasattr(usage, 'prompt_tokens_details'):
            cached = usage.prompt_tokens_details.cached_tokens
            print(f"Cache hit: {cached} tokens")
        
        return response.choices[0].message.content

# 使用示例
system_prompt = "You are an expert financial analyst. Analyze the following documents and answer questions."
document = """
## Financial Report 2025
[Very long financial report content... 50K tokens]
"""

rag = RAGSystem(system_prompt, document)

# 第一次查询 - 可能 Cache Miss
response1 = rag.query("What is the revenue growth?")

# 第二次查询 - Cache Hit
response2 = rag.query("What are the key risks?")

# 即使加入时间戳，只要放在最后，缓存依然命中
response3 = rag.query("Summarize the report.", timestamp="2026-03-08 10:30:00")

2.2 Qwen 显式缓存实践

Qwen 的显式缓存适合需要保证命中率的场景：

import os
from openai import OpenAI

qwen_client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)

class QwenCachedRAG:
    def __init__(self):
        self.cache_id = None
        
    def create_cache(self, system_prompt, documents):
        """
        显式创建缓存 - 适合需要保证命中率的场景
        """
        # 合并静态内容
        cache_content = f"{system_prompt}\n\nDocuments:\n{documents}"
        
        response = qwen_client.chat.completions.create(
            model="qwen-plus",
            messages=[
                {"role": "system", "content": "cache-create"},
                {"role": "user", "content": cache_content}
            ],
            extra_body={
                "context_cache": {
                    "enable": True,
                    "content": cache_content
                }
            }
        )
        
        # 提取缓存 ID（实际实现取决于 API 响应）
        self.cache_id = response.context_cache_id if hasattr(response, 'context_cache_id') else None
        print(f"Cache created with ID: {self.cache_id}")
        return self.cache_id
    
    def query_with_cache(self, user_question):
        """
        使用缓存进行查询
        """
        messages = [
            {"role": "user", "content": f"Question: {user_question}"}
        ]
        
        # 如果有缓存 ID，添加到请求中
        extra_body = {}
        if self.cache_id:
            extra_body["context_cache"] = {
                "cache_id": self.cache_id
            }
        
        response = qwen_client.chat.completions.create(
            model="qwen-plus",
            messages=messages,
            extra_body=extra_body,
            max_tokens=1000
        )
        
        # 检查缓存使用情况
        usage = response.usage
        print(f"Total tokens: {usage.prompt_tokens}")
        print(f"Cached tokens: {getattr(usage.prompt_tokens_details, 'cached_tokens', 0)}")
        
        return response.choices[0].message.content

# 使用示例
rag = QwenCachedRAG()
cache_id = rag.create_cache(
    system_prompt="You are a helpful assistant.",
    documents="[Long document content...]"
)

# 所有后续查询都将命中缓存
for question in ["What is X?", "How does Y work?", "Explain Z"]:
    answer = rag.query_with_cache(question)
    print(f"Q: {question}\nA: {answer}\n")

2.3 Qwen 会话缓存实践

适合多轮对话场景：

import requests

class QwenConversation:
    def __init__(self, api_key):
        self.api_key = api_key
        self.base_url = "https://dashscope.aliyuncs.com/api/v1/responses"
        self.conversation_history = []
        
    def send_message(self, user_message):
        """
        使用会话缓存发送消息
        """
        self.conversation_history.append(
            {"role": "user", "content": user_message}
        )
        
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json",
            "x-dashscope-session-cache": "enable"  # 启用会话缓存
        }
        
        data = {
            "model": "qwen-plus",
            "messages": self.conversation_history
        }
        
        response = requests.post(
            self.base_url,
            headers=headers,
            json=data
        )
        
        result = response.json()
        assistant_message = result["choices"][0]["message"]["content"]
        
        # 添加到历史
        self.conversation_history.append(
            {"role": "assistant", "content": assistant_message}
        )
        
        # 显示缓存使用情况
        usage = result.get("usage", {})
        cached_tokens = usage.get("prompt_tokens_details", {}).get("cached_tokens", 0)
        print(f"Cached tokens in this turn: {cached_tokens}")
        
        return assistant_message

# 使用示例
conv = QwenConversation(api_key=os.getenv("DASHSCOPE_API_KEY"))

# 多轮对话 - 自动复用上下文
print(conv.send_message("What is machine learning?"))
print(conv.send_message("What are the main types?"))
print(conv.send_message("Can you give me an example of supervised learning?"))

2.4 GLM 缓存监控实践

GLM 完全自动，重点在于监控缓存使用情况：

import os
from openai import OpenAI

glm_client = OpenAI(
    api_key=os.getenv("ZAI_API_KEY"),
    base_url="https://api.z.ai/v1"
)

class GLMCacheMonitor:
    def __init__(self):
        self.total_requests = 0
        self.total_cached_tokens = 0
        self.total_fresh_tokens = 0
        
    def query_with_monitoring(self, messages):
        """
        带缓存监控的查询
        """
        response = glm_client.chat.completions.create(
            model="glm-4.5",
            messages=messages,
            max_tokens=1000
        )
        
        self.total_requests += 1
        usage = response.usage
        
        # 提取缓存使用详情
        if hasattr(usage, 'prompt_tokens_details'):
            cached = usage.prompt_tokens_details.cached_tokens
            fresh = usage.prompt_tokens_details.fresh_tokens
            
            self.total_cached_tokens += cached
            self.total_fresh_tokens += fresh
            
            cache_rate = cached / (cached + fresh) * 100 if (cached + fresh) > 0 else 0
            print(f"Request #{self.total_requests}:")
            print(f"  Total: {usage.prompt_tokens} tokens")
            print(f"  Cached: {cached} tokens ({cache_rate:.1f}%)")
            print(f"  Fresh: {fresh} tokens")
        
        return response.choices[0].message.content
    
    def print_summary(self):
        """
        打印缓存使用摘要
        """
        total = self.total_cached_tokens + self.total_fresh_tokens
        if total > 0:
            overall_cache_rate = self.total_cached_tokens / total * 100
            print(f"\n=== Cache Usage Summary ===")
            print(f"Total requests: {self.total_requests}")
            print(f"Total tokens: {total}")
            print(f"Cached tokens: {self.total_cached_tokens} ({overall_cache_rate:.1f}%)")
            print(f"Fresh tokens: {self.total_fresh_tokens}")
            print(f"Estimated cost savings: {overall_cache_rate:.1f}%")

# 使用示例
monitor = GLMCacheMonitor()
system_prompt = "You are a helpful assistant."
document = "[Long document...]"

questions = [
    "What is the summary?",
    "What are the key points?",
    "What are the limitations?",
    "Can you explain section 3?"
]

for q in questions:
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"{document}\n\nQuestion: {q}"}
    ]
    answer = monitor.query_with_monitoring(messages)
    print(f"A: {answer[:100]}...\n")

monitor.print_summary()

3. 常见场景的优化方案

3.1 RAG 系统优化

问题：知识库内容固定，但用户查询各不相同。

优化方案：

class OptimizedRAG:
    def __init__(self):
        self.cache_stats = {"hits": 0, "misses": 0}
        
    def format_documents(self, docs):
        """
        格式化文档 - 确保一致性
        """
        formatted = []
        for i, doc in enumerate(docs, 1):
            # 使用一致的格式，避免空格/换行差异
            formatted.append(f"Document {i}:\n{doc.strip()}")
        return "\n\n".join(formatted)
    
    def query(self, query, documents):
        """
        优化后的 RAG 查询
        """
        # 格式化文档（静态内容）
        formatted_docs = self.format_documents(documents)
        
        # 构建消息：文档在前，查询在后
        messages = [
            {"role": "system", "content": "You are a helpful assistant. Use the provided documents to answer questions."},
            {"role": "user", "content": f"Documents:\n{formatted_docs}\n\nQuestion: {query}"}
        ]
        
        # 发送请求
        response = client.chat.completions.create(
            model="deepseek-chat",
            messages=messages
        )
        
        # 更新统计
        if hasattr(response.usage, 'prompt_tokens_details'):
            cached = response.usage.prompt_tokens_details.cached_tokens
            total = response.usage.prompt_tokens
            if cached / total > 0.5:
                self.cache_stats["hits"] += 1
            else:
                self.cache_stats["misses"] += 1
        
        return response.choices[0].message.content

3.2 Agent 系统优化

问题：多轮工具调用，上下文不断增长。

优化方案：

class OptimizedAgent:
    def __init__(self):
        # 静态工具定义
        self.tools_description = self._format_tools()
        self.system_prompt = "You are an AI agent with access to tools."
        
    def _format_tools(self):
        """
        格式化工具定义 - 静态内容
        """
        tools = [
            {"name": "search", "description": "Search the web"},
            {"name": "calculator", "description": "Perform calculations"},
            # ...
        ]
        return "\n".join([f"- {t['name']}: {t['description']}" for t in tools])
    
    def run(self, user_task, conversation_history=None):
        """
        运行 Agent - 优化缓存
        """
        # 历史记录放在最后
        history_str = ""
        if conversation_history:
            history_str = "\n\nPrevious steps:\n" + "\n".join(conversation_history)
        
        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": f"Available tools:\n{self.tools_description}{history_str}\n\nTask: {user_task}"}
        ]
        
        response = client.chat.completions.create(
            model="deepseek-chat",
            messages=messages,
            tools=self.tools  # 工具定义
        )
        
        return response.choices[0].message

3.3 多轮对话优化

问题：对话历史不断变化，如何复用缓存？

优化方案：

class OptimizedChat:
    def __init__(self):
        self.system_prompt = "You are a helpful assistant."
        self.document_context = ""  # 可选的固定知识库
        
    def build_messages(self, conversation_history, new_message):
        """
        构建缓存友好的消息列表
        """
        messages = [{"role": "system", "content": self.system_prompt}]
        
        # 如果有固定知识库，放在这里
        if self.document_context:
            messages.append({
                "role": "user", 
                "content": f"Context:\n{self.document_context}"
            })
            messages.append({
                "role": "assistant", 
                "content": "I'll keep this context in mind for our conversation."
            })
        
        # 添加对话历史
        messages.extend(conversation_history)
        
        # 新消息放在最后
        messages.append({"role": "user", "content": new_message})
        
        return messages
    
    def chat(self, conversation_history, user_message):
        messages = self.build_messages(conversation_history, user_message)
        
        response = client.chat.completions.create(
            model="deepseek-chat",
            messages=messages
        )
        
        return response.choices[0].message.content

4. 调试与监控

4.1 缓存命中率监控

import time
from collections import defaultdict

class CacheMonitor:
    def __init__(self):
        self.stats = defaultdict(lambda: {"cached": 0, "fresh": 0, "requests": 0})
        
    def log_request(self, endpoint, usage):
        """
        记录请求统计
        """
        self.stats[endpoint]["requests"] += 1
        
        if hasattr(usage, 'prompt_tokens_details'):
            cached = usage.prompt_tokens_details.cached_tokens
            fresh = usage.prompt_tokens_details.fresh_tokens
            
            self.stats[endpoint]["cached"] += cached
            self.stats[endpoint]["fresh"] += fresh
    
    def get_report(self):
        """
        生成缓存使用报告
        """
        report = []
        for endpoint, data in self.stats.items():
            total = data["cached"] + data["fresh"]
            hit_rate = data["cached"] / total * 100 if total > 0 else 0
            
            report.append({
                "endpoint": endpoint,
                "requests": data["requests"],
                "hit_rate": f"{hit_rate:.1f}%",
                "cached_tokens": data["cached"],
                "fresh_tokens": data["fresh"]
            })
        
        return report
    
    def print_report(self):
        """
        打印报告
        """
        print("\n=== Cache Performance Report ===")
        for r in self.get_report():
            print(f"\nEndpoint: {r['endpoint']}")
            print(f"  Requests: {r['requests']}")
            print(f"  Hit Rate: {r['hit_rate']}")
            print(f"  Cached Tokens: {r['cached_tokens']:,}")
            print(f"  Fresh Tokens: {r['fresh_tokens']:,}")

4.2 缓存失效诊断

def diagnose_cache_miss(current_prompt, previous_prompts):
    """
    诊断缓存失效原因
    """
    if not previous_prompts:
        return "No previous prompts to compare"
    
    # 找到最长匹配前缀
    for i, prev in enumerate(previous_prompts):
        min_len = min(len(current_prompt), len(prev))
        match_len = 0
        
        for j in range(min_len):
            if current_prompt[j] == prev[j]:
                match_len += 1
            else:
                break
        
        if match_len > 100:  # 至少匹配 100 个字符
            print(f"Compared with prompt #{i+1}:")
            print(f"  Match length: {match_len} chars")
            print(f"  Current length: {len(current_prompt)}")
            print(f"  Divergence at: '{current_prompt[match_len:match_len+50]}...'")
            
            # 检查常见失效原因
            if current_prompt[match_len:match_len+10].isdigit():
                print("  ⚠️ Detected: Possible timestamp/ID at divergence point")
            if "uuid" in current_prompt[match_len:match_len+50].lower():
                print("  ⚠️ Detected: Possible UUID at divergence point")
            
            return
    
    print("No significant prefix match found")

参考资料

How to Implement Prompt Caching - Prompt Caching 实现指南
Prompt Caching: A Guide With Code Implementation - DataCamp 代码实现教程
Cache the prompt, not the response - 缓存策略最佳实践
Prompt Caching: 5 Production Patterns - 生产环境模式