最佳实践与代码示例
最佳实践 代码示例 缓存优化 Prompt Engineering
提供避免缓存失效的最佳实践和可复用的代码示例,涵盖各种常见场景
1. 避免缓存失效的核心原则
1.1 黄金法则:静态内容前置
原则:将静态内容(系统提示词、长文档、知识库)放在前面,动态内容(用户输入、时间戳、UUID)放在后面。
错误示例(会导致缓存失效):
{
"messages": [
{"role": "system", "content": "Current time: 2026-03-08 10:30:00. You are a helpful assistant."},
{"role": "user", "content": "[LONG_DOCUMENT]"},
{"role": "user", "content": "What is the main topic?"}
]
}
正确示例(可命中缓存):
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "[LONG_DOCUMENT]\n\nCurrent time: 2026-03-08 10:30:00\n\nWhat is the main topic?"}
]
}
1.2 缓存失效的常见原因清单
| 失效原因 | 示例 | 解决方案 |
|---|---|---|
| 时间戳前置 | "Current time: ${Date.now()}" | 将时间戳放在最后 |
| UUID/随机数 | "Request ID: ${uuidv4()}" | 移至消息末尾或元数据 |
| 动态计数器 | "Query #123" | 从后缀移除或使用独立字段 |
| 会话 ID 前置 | "Session: ${sessionId}" | 放在最后或使用 HTTP Header |
| 空格/格式差异 | "Hello " vs "Hello" | 统一格式化函数 |
| JSON 键顺序 | {"b": 1, "a": 2} | 使用有序字典 |
1.3 提示词结构设计模板
标准模板(推荐):
[系统提示词 - 静态]
[角色定义 - 静态]
[知识库/文档 - 静态]
[工具定义 - 静态]
[ few-shot 示例 - 静态]
---
[动态上下文]
[用户问题]
代码实现:
def build_prompt(system_prompt, knowledge_base, user_question, dynamic_context=None):
"""
构建缓存友好的提示词
静态内容在前,动态内容在后
"""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"{knowledge_base}\n\n{dynamic_context or ''}\n\nQuestion: {user_question}"}
]
return messages
2. 各供应商代码示例
2.1 DeepSeek 最佳实践
DeepSeek 的缓存完全自动,重点在于提示词结构设计:
import os
from openai import OpenAI
# 初始化客户端
deepseek_client = OpenAI(
api_key=os.getenv("DEEPSEEK_API_KEY"),
base_url="https://api.deepseek.com"
)
class RAGSystem:
def __init__(self, system_prompt, document_corpus):
self.system_prompt = system_prompt
# 静态内容:文档语料库
self.document_corpus = document_corpus
def query(self, user_question, timestamp=None):
"""
查询方法 - 动态内容放在最后
"""
# 构建消息:静态内容在前,动态内容在后
dynamic_part = ""
if timestamp:
dynamic_part += f"Current time: {timestamp}\n\n"
messages = [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": f"Document Context:\n{self.document_corpus}\n\n{dynamic_part}Question: {user_question}"}
]
response = deepseek_client.chat.completions.create(
model="deepseek-chat",
messages=messages,
max_tokens=1000
)
# 检查缓存命中情况
usage = response.usage
if hasattr(usage, 'prompt_tokens_details'):
cached = usage.prompt_tokens_details.cached_tokens
print(f"Cache hit: {cached} tokens")
return response.choices[0].message.content
# 使用示例
system_prompt = "You are an expert financial analyst. Analyze the following documents and answer questions."
document = """
## Financial Report 2025
[Very long financial report content... 50K tokens]
"""
rag = RAGSystem(system_prompt, document)
# 第一次查询 - 可能 Cache Miss
response1 = rag.query("What is the revenue growth?")
# 第二次查询 - Cache Hit
response2 = rag.query("What are the key risks?")
# 即使加入时间戳,只要放在最后,缓存依然命中
response3 = rag.query("Summarize the report.", timestamp="2026-03-08 10:30:00")
2.2 Qwen 显式缓存实践
Qwen 的显式缓存适合需要保证命中率的场景:
import os
from openai import OpenAI
qwen_client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)
class QwenCachedRAG:
def __init__(self):
self.cache_id = None
def create_cache(self, system_prompt, documents):
"""
显式创建缓存 - 适合需要保证命中率的场景
"""
# 合并静态内容
cache_content = f"{system_prompt}\n\nDocuments:\n{documents}"
response = qwen_client.chat.completions.create(
model="qwen-plus",
messages=[
{"role": "system", "content": "cache-create"},
{"role": "user", "content": cache_content}
],
extra_body={
"context_cache": {
"enable": True,
"content": cache_content
}
}
)
# 提取缓存 ID(实际实现取决于 API 响应)
self.cache_id = response.context_cache_id if hasattr(response, 'context_cache_id') else None
print(f"Cache created with ID: {self.cache_id}")
return self.cache_id
def query_with_cache(self, user_question):
"""
使用缓存进行查询
"""
messages = [
{"role": "user", "content": f"Question: {user_question}"}
]
# 如果有缓存 ID,添加到请求中
extra_body = {}
if self.cache_id:
extra_body["context_cache"] = {
"cache_id": self.cache_id
}
response = qwen_client.chat.completions.create(
model="qwen-plus",
messages=messages,
extra_body=extra_body,
max_tokens=1000
)
# 检查缓存使用情况
usage = response.usage
print(f"Total tokens: {usage.prompt_tokens}")
print(f"Cached tokens: {getattr(usage.prompt_tokens_details, 'cached_tokens', 0)}")
return response.choices[0].message.content
# 使用示例
rag = QwenCachedRAG()
cache_id = rag.create_cache(
system_prompt="You are a helpful assistant.",
documents="[Long document content...]"
)
# 所有后续查询都将命中缓存
for question in ["What is X?", "How does Y work?", "Explain Z"]:
answer = rag.query_with_cache(question)
print(f"Q: {question}\nA: {answer}\n")
2.3 Qwen 会话缓存实践
适合多轮对话场景:
import requests
class QwenConversation:
def __init__(self, api_key):
self.api_key = api_key
self.base_url = "https://dashscope.aliyuncs.com/api/v1/responses"
self.conversation_history = []
def send_message(self, user_message):
"""
使用会话缓存发送消息
"""
self.conversation_history.append(
{"role": "user", "content": user_message}
)
headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json",
"x-dashscope-session-cache": "enable" # 启用会话缓存
}
data = {
"model": "qwen-plus",
"messages": self.conversation_history
}
response = requests.post(
self.base_url,
headers=headers,
json=data
)
result = response.json()
assistant_message = result["choices"][0]["message"]["content"]
# 添加到历史
self.conversation_history.append(
{"role": "assistant", "content": assistant_message}
)
# 显示缓存使用情况
usage = result.get("usage", {})
cached_tokens = usage.get("prompt_tokens_details", {}).get("cached_tokens", 0)
print(f"Cached tokens in this turn: {cached_tokens}")
return assistant_message
# 使用示例
conv = QwenConversation(api_key=os.getenv("DASHSCOPE_API_KEY"))
# 多轮对话 - 自动复用上下文
print(conv.send_message("What is machine learning?"))
print(conv.send_message("What are the main types?"))
print(conv.send_message("Can you give me an example of supervised learning?"))
2.4 GLM 缓存监控实践
GLM 完全自动,重点在于监控缓存使用情况:
import os
from openai import OpenAI
glm_client = OpenAI(
api_key=os.getenv("ZAI_API_KEY"),
base_url="https://api.z.ai/v1"
)
class GLMCacheMonitor:
def __init__(self):
self.total_requests = 0
self.total_cached_tokens = 0
self.total_fresh_tokens = 0
def query_with_monitoring(self, messages):
"""
带缓存监控的查询
"""
response = glm_client.chat.completions.create(
model="glm-4.5",
messages=messages,
max_tokens=1000
)
self.total_requests += 1
usage = response.usage
# 提取缓存使用详情
if hasattr(usage, 'prompt_tokens_details'):
cached = usage.prompt_tokens_details.cached_tokens
fresh = usage.prompt_tokens_details.fresh_tokens
self.total_cached_tokens += cached
self.total_fresh_tokens += fresh
cache_rate = cached / (cached + fresh) * 100 if (cached + fresh) > 0 else 0
print(f"Request #{self.total_requests}:")
print(f" Total: {usage.prompt_tokens} tokens")
print(f" Cached: {cached} tokens ({cache_rate:.1f}%)")
print(f" Fresh: {fresh} tokens")
return response.choices[0].message.content
def print_summary(self):
"""
打印缓存使用摘要
"""
total = self.total_cached_tokens + self.total_fresh_tokens
if total > 0:
overall_cache_rate = self.total_cached_tokens / total * 100
print(f"\n=== Cache Usage Summary ===")
print(f"Total requests: {self.total_requests}")
print(f"Total tokens: {total}")
print(f"Cached tokens: {self.total_cached_tokens} ({overall_cache_rate:.1f}%)")
print(f"Fresh tokens: {self.total_fresh_tokens}")
print(f"Estimated cost savings: {overall_cache_rate:.1f}%")
# 使用示例
monitor = GLMCacheMonitor()
system_prompt = "You are a helpful assistant."
document = "[Long document...]"
questions = [
"What is the summary?",
"What are the key points?",
"What are the limitations?",
"Can you explain section 3?"
]
for q in questions:
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"{document}\n\nQuestion: {q}"}
]
answer = monitor.query_with_monitoring(messages)
print(f"A: {answer[:100]}...\n")
monitor.print_summary()
3. 常见场景的优化方案
3.1 RAG 系统优化
问题:知识库内容固定,但用户查询各不相同。
优化方案:
class OptimizedRAG:
def __init__(self):
self.cache_stats = {"hits": 0, "misses": 0}
def format_documents(self, docs):
"""
格式化文档 - 确保一致性
"""
formatted = []
for i, doc in enumerate(docs, 1):
# 使用一致的格式,避免空格/换行差异
formatted.append(f"Document {i}:\n{doc.strip()}")
return "\n\n".join(formatted)
def query(self, query, documents):
"""
优化后的 RAG 查询
"""
# 格式化文档(静态内容)
formatted_docs = self.format_documents(documents)
# 构建消息:文档在前,查询在后
messages = [
{"role": "system", "content": "You are a helpful assistant. Use the provided documents to answer questions."},
{"role": "user", "content": f"Documents:\n{formatted_docs}\n\nQuestion: {query}"}
]
# 发送请求
response = client.chat.completions.create(
model="deepseek-chat",
messages=messages
)
# 更新统计
if hasattr(response.usage, 'prompt_tokens_details'):
cached = response.usage.prompt_tokens_details.cached_tokens
total = response.usage.prompt_tokens
if cached / total > 0.5:
self.cache_stats["hits"] += 1
else:
self.cache_stats["misses"] += 1
return response.choices[0].message.content
3.2 Agent 系统优化
问题:多轮工具调用,上下文不断增长。
优化方案:
class OptimizedAgent:
def __init__(self):
# 静态工具定义
self.tools_description = self._format_tools()
self.system_prompt = "You are an AI agent with access to tools."
def _format_tools(self):
"""
格式化工具定义 - 静态内容
"""
tools = [
{"name": "search", "description": "Search the web"},
{"name": "calculator", "description": "Perform calculations"},
# ...
]
return "\n".join([f"- {t['name']}: {t['description']}" for t in tools])
def run(self, user_task, conversation_history=None):
"""
运行 Agent - 优化缓存
"""
# 历史记录放在最后
history_str = ""
if conversation_history:
history_str = "\n\nPrevious steps:\n" + "\n".join(conversation_history)
messages = [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": f"Available tools:\n{self.tools_description}{history_str}\n\nTask: {user_task}"}
]
response = client.chat.completions.create(
model="deepseek-chat",
messages=messages,
tools=self.tools # 工具定义
)
return response.choices[0].message
3.3 多轮对话优化
问题:对话历史不断变化,如何复用缓存?
优化方案:
class OptimizedChat:
def __init__(self):
self.system_prompt = "You are a helpful assistant."
self.document_context = "" # 可选的固定知识库
def build_messages(self, conversation_history, new_message):
"""
构建缓存友好的消息列表
"""
messages = [{"role": "system", "content": self.system_prompt}]
# 如果有固定知识库,放在这里
if self.document_context:
messages.append({
"role": "user",
"content": f"Context:\n{self.document_context}"
})
messages.append({
"role": "assistant",
"content": "I'll keep this context in mind for our conversation."
})
# 添加对话历史
messages.extend(conversation_history)
# 新消息放在最后
messages.append({"role": "user", "content": new_message})
return messages
def chat(self, conversation_history, user_message):
messages = self.build_messages(conversation_history, user_message)
response = client.chat.completions.create(
model="deepseek-chat",
messages=messages
)
return response.choices[0].message.content
4. 调试与监控
4.1 缓存命中率监控
import time
from collections import defaultdict
class CacheMonitor:
def __init__(self):
self.stats = defaultdict(lambda: {"cached": 0, "fresh": 0, "requests": 0})
def log_request(self, endpoint, usage):
"""
记录请求统计
"""
self.stats[endpoint]["requests"] += 1
if hasattr(usage, 'prompt_tokens_details'):
cached = usage.prompt_tokens_details.cached_tokens
fresh = usage.prompt_tokens_details.fresh_tokens
self.stats[endpoint]["cached"] += cached
self.stats[endpoint]["fresh"] += fresh
def get_report(self):
"""
生成缓存使用报告
"""
report = []
for endpoint, data in self.stats.items():
total = data["cached"] + data["fresh"]
hit_rate = data["cached"] / total * 100 if total > 0 else 0
report.append({
"endpoint": endpoint,
"requests": data["requests"],
"hit_rate": f"{hit_rate:.1f}%",
"cached_tokens": data["cached"],
"fresh_tokens": data["fresh"]
})
return report
def print_report(self):
"""
打印报告
"""
print("\n=== Cache Performance Report ===")
for r in self.get_report():
print(f"\nEndpoint: {r['endpoint']}")
print(f" Requests: {r['requests']}")
print(f" Hit Rate: {r['hit_rate']}")
print(f" Cached Tokens: {r['cached_tokens']:,}")
print(f" Fresh Tokens: {r['fresh_tokens']:,}")
4.2 缓存失效诊断
def diagnose_cache_miss(current_prompt, previous_prompts):
"""
诊断缓存失效原因
"""
if not previous_prompts:
return "No previous prompts to compare"
# 找到最长匹配前缀
for i, prev in enumerate(previous_prompts):
min_len = min(len(current_prompt), len(prev))
match_len = 0
for j in range(min_len):
if current_prompt[j] == prev[j]:
match_len += 1
else:
break
if match_len > 100: # 至少匹配 100 个字符
print(f"Compared with prompt #{i+1}:")
print(f" Match length: {match_len} chars")
print(f" Current length: {len(current_prompt)}")
print(f" Divergence at: '{current_prompt[match_len:match_len+50]}...'")
# 检查常见失效原因
if current_prompt[match_len:match_len+10].isdigit():
print(" ⚠️ Detected: Possible timestamp/ID at divergence point")
if "uuid" in current_prompt[match_len:match_len+50].lower():
print(" ⚠️ Detected: Possible UUID at divergence point")
return
print("No significant prefix match found")
参考资料
- How to Implement Prompt Caching - Prompt Caching 实现指南
- Prompt Caching: A Guide With Code Implementation - DataCamp 代码实现教程
- Cache the prompt, not the response - 缓存策略最佳实践
- Prompt Caching: 5 Production Patterns - 生产环境模式