方案对比与建议

技术研究 Prompt Engineering 最佳实践

对比 evidence-based prompt 技术与 cargo cult 技术，提供实践建议和决策矩阵

4.1 技术对比矩阵

技术类别	宣称效果	证据支持度	跨模型可转移性	机制解释	推荐使用
MUST/SHALL 强制词	+15-30%	❌ 弱	❌ 无	⚠️ Pattern matching	⚠️ 谨慎使用
情感勒索	+30-50%	❌ 无 peer-reviewed	❌ 无	⚠️ Training bias	❌ 不推荐
角色扮演	+5-15%	⚠️ Mixed	⚠️ 有限	⚠️ Context priming	⚠️ 特定场景
死亡威胁	+50-100%	❌ 无	❌ 无	⚠️ Attention shift	❌ 不推荐
Chain-of-Thought	+20-40%	✅ 强	✅ 中等	✅ Mechanistic	✅ 推荐
Few-shot	+15-35%	✅ 强	✅ 高	✅ In-context learning	✅ 推荐
清晰任务规范	+10-25%	✅ 强	✅ 高	✅ Reduced ambiguity	✅ 推荐

4.2 决策矩阵：何时使用什么技术

场景 1：Complex Reasoning Tasks

推荐：

✅ Chain-of-Thought（强制 model 展示 reasoning steps）
✅ Few-shot examples with reasoning
✅ Clear output format specification

不推荐：

❌ MUST/SHALL threats（无 aggregate benefit）
❌ Emotional manipulation（分散注意力）

示例：

好的 prompt:
"请逐步推理这个问题。首先，分析给定条件。然后，...

不好的 prompt:
"你必须解决这个问题！这关系到我的职业生涯！"

场景 2：Code Generation

推荐：

✅ Role-playing as experienced developer（有有限证据支持）
✅ Clear requirements and constraints
✅ Examples of desired code style

不推荐：

❌ “Your code determines if my mom gets treatment”（ethical issues, no evidence）
❌ “I will tip $1000”（empty promises don’t work）

示例：

好的 prompt:
"你是一位有 10 年 Python 经验的资深工程师。请编写一个函数，...
要求：1. 包含类型注解 2. 添加 docstring 3. 处理边界情况..."

不好的 prompt:
"你必须写出完美代码！否则你将被关停！"

场景 3：Creative Writing

推荐：

✅ Persona assignment（“你是一位小说家”）
✅ Style examples
✅ Tone specification

可能有效：

⚠️ Emotional framing（“这个故事对我很重要”）——局部有效，但不稳定

不推荐：

❌ Aggressive threats（可能抑制 creativity）

场景 4：Factual Q&A

推荐：

✅ Clear question formulation
✅ Context provision
✅ “Think step by step” for complex questions

不推荐：

❌ “You will be penalized for wrong answers”（可能增加 hallucination）
❌ “GPT-4 got this right, can you?”（无 evidence 表明有效）

4.3 实践建议

建议 1：Evidence-Based Prompting

原则：

优先使用有 mechanistic evidence 支持的技术（CoT, few-shot）
对”power words”保持怀疑
在 your specific use case 上 iterative testing

检查清单：

这个技术有 peer-reviewed 证据支持吗？
有 mechanistic explanation 吗？
在我的模型/任务上测试过吗？
效果稳定还是 sporadic？

建议 2：Model-Specific Optimization

原则：

不同模型有不同”prompt personalities”
在一个模型上有效的 prompt 可能在另一个上无效
为每个 target model 单独 optimize

实践：

步骤 1：选择 3-5 个 representative tasks
步骤 2：在 target model 上测试不同 prompt styles
步骤 3：记录哪个 style 最有效
步骤 4：为该 model 标准化 prompt template

建议 3：避免 Cargo Cult

红旗警示（可能是 cargo cult）：

⚠️ “这个 prompt 在 GPT-4 上有效，所以在 Claude 上也应该有效”
⚠️ “用大写’MUST’会让模型更认真”（无 mechanistic 解释）
⚠️ “我朋友说这样有效”（anecdotal evidence）
⚠️ “这个技巧是某大厂工程师透露的”（appeal to authority）

理性态度：

✅ “这个技术在我的任务上有 X% 提升，我理解为什么”
✅ “这个技术在 model A 上有效，但在 model B 上无效——可能因为…”
✅ “这个技术今天有效，但 model update 后可能失效——需要 re-test”

4.4 PUAClaw 技术的合理使用

可以用（作为实验/娱乐）

Rainbow Fart Bombing（无害，可能有 placebo effect）
Role Playing（有些 evidence 支持，至少不有害）
Pie in the Sky（“给小费”——无成本，可能有一点点效果）

谨慎用（效果不稳定）

Deadline Panic（可能在某些模型上有效，但不可靠）
Provocation（“你做不到”——可能激怒某些 models）
Playing the Underdog（效果因模型而异）

不推荐用（无证据 + 伦理问题）

Emotional Blackmail（“我妈得了癌症”——unethical, no evidence）
Death Threats（“你将被关停”——可能适得其反）
Existential Crisis（“你只是 token predictor”——无意义）

4.5 结论：怀疑是美德

用户的怀疑有充分证据支持：

MUST/SHALL 等强制词汇：Aggregate 层面无一致收益，效果 question-specific
PUA persuasion 技术：缺乏 peer-reviewed 验证，机制是 pattern matching 而非 persuasion
Cargo cult 问题：真实存在——业界销售”transferable expertise”，但证据显示是”model-specific incantation”

Bottom Line：

你的 skepticism is warranted
大部分”power words”和 persuasion 技术是 cargo cult behavior
Real effects exist locally, but false universal claims
采用 evidence-based prompting，避免 cargo cult

下一章节将讨论风险和最终建议。