实验证据与机制分析

技术研究实验证据 LLM

基于 2025-2026 年 peer-reviewed 研究，系统评估 imperative constraints 和 persuasion 技术的实际效果

3.1 Imperative Constraints（MUST/SHALL）的效果

Wharton Generative AI Labs 直接实验（March 2025）

实验设计：

直接对比”Please”（礼貌请求）vs “I order you to answer”（命令式）
在多个问答数据集上进行控制实验
测量准确率差异

关键发现：

问题类型	”Please” 优势	”I order” 优势
某些问题	+60 个百分点	-
另一些问题	-	+60 个百分点
Aggregate	差异相互抵消	无一致赢家

结论（原文引用）：

“Prompt modifications, like politeness [or commanding], influence individual responses but have minimal overall effect. Aggregate model characteristics dominate over specific prompting strategies.”

解读：

强制词汇（MUST/SHALL）在个别问题上可能有效
但 aggregate 层面（跨任务平均）没有一致收益
效果是question-specific而非 universal

Sclar et al. 研究（EMNLP 2024）

实验：测试 26 种 prompting principles，包括：

“You will be penalized”（威胁）
“Ensure your answer is unbiased”（约束）
各种 imperative formulations

发现：

效果”varied wildly across models”
在一个模型上有效的技术，在另一个模型上可能无效
格式化变化（如用大写”MUST”）的影响因模型而异

结论：

高达76 个准确率点的差异来自 subtle formatting changes
但 format 性能”only weakly correlates between models”
Implication：在 GPT-4 上用大写”MUST”可能有效，在 Claude 上可能无效甚至有害

研究结论：Imperative Constraints

主张	证据支持度
”MUST/SHALL 总是有效”	❌ 无支持
”MUST/SHALL 在某些问题上有效”	✅ 有支持
”效果跨模型可转移”	❌ 证据反对
”Aggregate 层面有显著收益”	❌ 证据反对

** verdict**：MUST/SHALL 等强制词汇属于局部有效、不可泛化的技术——在特定任务/模型上可能有用，但作为”通用原则”缺乏证据。

3.2 Persuasion 技术的效果

EmotionPrompt 研究（Li et al., 2023）

主张：心理刺激（“This is very important to my career”）可以提升性能

状态：

Preprint，发表时未经 peer review
显示一些改进，但机制不明确
样本量和实验设计受到后续研究质疑

”The Neurolinguistic Architecture of LLM Performance”（Dec 2025）

核心论点：

心理 framing 有效是通过statistical pattern matching，而非 genuine persuasion
LLM 学习了 emotional language 与 high-quality training data 之间的相关性
不是魔法：利用 training distribution biases，而非”理解”

机制解释：

训练数据中：严肃/正式/情感强烈的语境 → 高质量回答
模型学习：情感语言 → 切换到"认真模式"
实际机制：pattern matching，不是 persuasion

关键研究空白

Research Gap：

无 peer-reviewed 研究直接测试 PUA 技术（negging、scarcity、authority claims）
大多数”persuasion”研究聚焦于：
- AI 生成 matched personality 的消息（Xu & Zhao, 2025）
- AI 说服人类（Nature Human Behaviour, 2025）
- 而非人类用 persuasion 改进 AI 输出

3.3 Persona Prompting 的效果

”The Prompt Makes the Person(a)“（Lutz et al., 2025）

实验：系统评估 sociodemographic persona prompting

发现：

“How a persona prompt is formulated can significantly affect outcomes”
但效果inconsistent across models
Persona 对 generating diverse perspectives 有效，对 improving task accuracy 效果有限

结论：

Persona 效果存在，但被 overstated
不适合用于 accuracy-critical 任务

3.4 Jailbreak 技术的机制

”Do Anything Now” (DAN) 分析（CCS 2024）

发现：

Jailbreak 利用specific model vulnerabilities，而非 universal “persuasion”
成功率因 model 和 patch level 差异巨大

JailbreakRadar（CISPA, 2025）

评估：17 种 jailbreak attacks across 9 个 LLMs

关键发现：

无单一技术 universally works
什么有效取决于：model architecture、training、specific guardrails

”What Features in Prompts Jailbreak LLMs?”（Kirch et al., 2024）

机制洞察：

某些 token patterns 触发不同的 attention pathways
不是 persuasion：pattern-based exploitation

3.5 Chain-of-Thought：例外情况

为什么 CoT 有效（Multiple Studies）

Mechanistic Evidence：

Sparse autoencoder 研究显示 CoT 激活了实际的 reasoning circuits
Counterfactual 研究：即使 reasoning steps 无效也有帮助——structure matters more than content
2026 Update：知道何时 backfire——CoT 在简单任务上可能有害

关键区别：

CoT 有mechanistic explanation backed by interpretability research
不是 cargo cult——有神经机制证据支持

3.6 Cargo Cult 问题

”Prompt Engineering Is Mostly Cargo Cult Behaviour”（Golev, Jan 2026）

核心论点：

技术是model-specific and time-bound
Model updates 打破之前有效的 prompts
业界销售”transferable expertise”，但证据显示是”model-specific incantation”

Wharton 的结论

“Being polite or commanding yields question-specific differences rather than global improvements, and these effects often diminish when aggregated”

3.7 证据综合评估

回答用户问题

问题 1：Imperative constraints（MUST/SHALL）是否改进 LLM 输出质量？

答案：不一致。

Wharton 研究直接测试了这一点
效果是 question-dependent（可帮助或损害 60+ 点）
Model-dependent（跨模型相关性弱）
Net effect：aggregate 层面可忽略

问题 2：研究对”jailbreak”风格 prompt 的看法？

答案：它们有效，但机制不是 persuasion。

利用 specific model vulnerabilities
成功率因模型和 guardrail 版本差异巨大
Pattern exploitation，不是 psychological manipulation

问题 3：PUA 式技术有效还是 placebo？

答案：可能是 placebo，偶有局部效果。

无 peer-reviewed 证据证明 PUA 技术改进输出质量
Emotional framing（EmotionPrompt）显示一些效果，但机制是 pattern-matching
Claims exceed evidence——大多数”persuasion”研究是关于 AI 说服人类，而非反之

问题 4：这是 cargo culting 吗？

答案：部分如此。

Real effects exist locally（你的 prompt 可能在你的任务/模型上今天有效）
Effects don’t transfer reliably（明天或不同模型上可能失败）
业界呈现为”engineering”，但证据显示是”probabilistic pattern-matching”

什么真正有效（证据支持）

✅ Chain-of-thought for complex reasoning（mechanistically validated）
✅ Few-shot examples with correct reasoning steps
✅ Clear task specification（formatting, output structure）
✅ Model-appropriate prompting（GPT-4 ≠ Claude ≠ LLaMA）
✅ Iterative testing on your specific use case

什么可能是 Placebo

❌ “MUST/SHALL” imperative language
❌ PUA-style emotional manipulation
❌ “You will be penalized” threats
❌ Arbitrary persona assignments for accuracy tasks
❌ “I’m going to tip $100” incentives

3.8 机制解释：为什么”局部有效”？

Training Distribution Bias

LLM 在训练时接触到：

正式/严肃语境 → 高质量回答（学术论文、技术文档）
随意/不正式语境 → 低质量回答（论坛帖子、社交媒体）

当 prompt 使用：

正式语言（MUST/SHALL）
情感框架（“这很重要”）
角色分配（“你是专家”）

模型检测到这些 patterns 与 training data 中的”高质量回答”语境相关，因此切换到更认真的生成模式。

不是 Persuasion，是 Pattern Matching

关键点：

LLM 没有心理状态可以被”说服”
LLM 没有情感可以被打动
所谓”效果”是 statistical correlation，不是 causation

这解释了为什么：

效果不 cross-model transfer（不同模型有不同 training distributions）
效果不 time-stable（model updates 改变 distributions）
效果不 universal（某些问题/语境下无效）