我们统一用一个贯穿全文的例子——一句话描述 USDA 报告。所有指标的计算,都在这一个例子上演示,不会乱。
政府每年出一堆报告,比如"养老院虐待问题调查"、"农业部行政改革",几十页全是文字。没人想全读完,但里面有有用的信息。这个项目就是训练 AI 帮你提炼摘要。
但 AI 有一个问题——它会幻觉(Hallucination)。就是原文没说的事,AI 自信满满地写进摘要,这很危险,尤其是政策文件。所以这个项目的亮点是:AI 生成摘要之后,再自己检查一遍有没有编造,这就是 Self-Checking。
还有第二个问题——文档太长,AI 一次读不完。我们用的模型一次最多读 16000 个词,政府报告经常几万字,后面的内容被直接截断了,根本没被读到。
解决方案叫分治法(Divide & Conquer):把文档切成几块,分别总结,再合并。我们做了三种合并方式,实验看哪种最好。
先说一下背景:我们的 USDA 报告原文很长,被切成了 2 块处理(叫 chunk)。我用具体内容带你过一遍三种方法各自是怎么处理这 2 块的。
Chunk 1(前半段,约3000词):
USDA 有 13 个 staff offices 和 8 个 mission areas,共 18 个机构。2017 年,农业部长发出备忘录,要求在每个 mission area 建立一个 business center,整合行政服务,目标是提升效率和部门协作……
Chunk 2(后半段,约3000词):
截至报告撰写时,8 个 mission areas 已全部建立 business center。GAO 对此进行审计,发现各机构推进进度不一,并提出建议:要求 USDA 建立统一的追踪机制,定期向国会汇报执行情况……
做法:两块分别独立总结,得到两段子摘要,然后把两段子摘要一起喂给模型,让它写出最终摘要。
问题在哪:合并时两段子摘要里有重复说法,模型得判断哪个保留、哪个删,这个过程会改写很多词。块数越多、压缩量越大,改写越严重。这个"一次性合并"是它的瓶颈。
做法:想在 MapReduce 基础上进一步改进——先把各块子摘要按主题聚类(分组),组内先合并一次,再跨组做最终合并。理论上主题更聚焦,合并质量更好。
理论 vs 实际:理论上聚类后同主题内容放一起,合并更整齐。但实验证明,多了一层合并就是多了一轮改写。改写叠加两次,ROUGE-2 损失比 MapReduce 还大,效果排名最低。设计越复杂不等于结果越好。
前两种方法都有同一个根本问题:合并的时候需要同时处理多段内容,改写量大。Map-Refine 换了一个思路。
做法:先读第 1 块,写出草稿摘要。然后把第 2 块和已有草稿一起喂给模型,让它在草稿基础上小幅更新,得到最终摘要。每次只"追加一块新信息",不是一次性合并所有。
为什么是最优方案:每次合并只需处理「1 个新块 + 已有摘要」,压力最小,改写最少,误差积累最慢。这最接近人类"边读边记笔记、随时更新"的方式。块越多,这个优势越明显。
→ 实验结果:ROUGE-1、BERTScore、覆盖率三项正向指标全部最高,综合最优。
完全正确,这就是这三种方法的演进逻辑。核心洞察是:不要想着怎么更聪明地合并,而是要想着怎么减少每次合并的量。 Map-Refine 把这个问题想清楚了。
后面所有指标的计算,我们都用同一个例子。是项目真实数据里的一句话(USDA 农业部报告)。先记住这两句:
参考摘要(人工写的标准答案):
"USDA established business centers in eight mission areas"
AI 生成摘要:
"USDA created business centers across mission areas and staff offices"
乍一看,这两句意思差不多——都是在说 USDA 建立了 business centers,分布在 mission areas 里。但有些词不同:AI 说的是 "created",参考说的是 "established";AI 多说了 "staff offices",漏掉了 "eight"。
接下来我们就用这两句话,把所有评估指标从头算一遍。
ROUGE-1 最简单:就是数两段文字里出现了多少相同的词。但光说"有多少个相同的词"还不够——还要区分是 AI 说多了,还是 AI 说少了。这就引出了精确率和召回率。
用这个韦恩图理解,一看就懂:
从这个图里,我们能直接读出三个问题的答案:
① AI 说的有多少是准的? → 看右圆:重叠 5 个 ÷ AI 总词数 10 个 = 50%,这叫精确率(Precision)
② AI 覆盖了多少参考内容? → 看左圆:重叠 5 个 ÷ 参考总词数 8 个 = 62.5%,这叫召回率(Recall)
③ 综合评分? → 把上面两个数取调和平均,叫F1,这就是我们说的 ROUGE-1 分数。
因为单用一个会被钻空子。
如果只看精确率,AI 只要说一个词——正好是对的——精确率就是 100%。但这摘要基本没用。
如果只看召回率,AI 把整个原文复制粘贴进摘要,所有参考词都出现了,召回率 100%。但这不叫摘要。
F1 是调和平均,哪边太极端都会被拉低。两边都得表现好,F1 才高。所以 F1 是更可靠的综合指标。
ROUGE-1 匹配单个词,太松了。比如参考说"business centers",AI 说"business cafeteria",ROUGE-1 会把"business"算作一个重叠词——但这两句明显意思不同。
ROUGE-2 的改进:把每两个相邻的词作为一组(叫 bigram),整组都相同才算匹配。我们用同一个例子算一遍:
两边都有的词组只有 2 个:(business, centers) 和 (mission, areas)。
"established business"和"created business",虽然 ROUGE-1 认为"business"匹配了,但 ROUGE-2 看的是词组整体,就不算了。
同一个例子,ROUGE-1 = 0.556,ROUGE-2 = 0.250。差了一倍多。
原因:这两句话换了很多词("established"→"created","in eight"→"across"),单词层面还有很多重叠,但词组层面重叠少了很多。ROUGE-2 对改写非常敏感,只要换了说法,分数就掉。
ROUGE-1 只问:这个词有没有出现?不管顺序。所以即使 AI 把词全说对了但顺序打乱,ROUGE-1 也给高分。
ROUGE-L 多加了一个要求:重叠的词,在两段文字里出现的先后顺序也要一致。找的是"最长公共子序列"(LCS)——可以跳过中间词,但顺序不能倒。
参考摘要:
版本 A(正常顺序):LCS = 5 个词
✓ 蓝色词按顺序出现:USDA → business → centers → mission → areas
版本 B(顺序打乱):同样的词,但乱序
✗ 词都一样,但顺序完全倒了 → LCS 很短 → ROUGE-L 大幅下降
所以 ROUGE-L 检测的是:AI 不只说对了词,而且说话的逻辑顺序也和参考摘要接近。一篇好摘要不只包含正确的信息,还要按合理的顺序组织。
对,你发现了 ROUGE 最大的问题:它只认字面,不认语义。"established"和"created"在 ROUGE 眼里是完全不同的词,给 0 分。但任何人都知道这两个词在这个语境里意思相同。
BERTScore 用神经网络来理解词义。你可以把每个词想象成一个坐标点——语义接近的词,坐标距离也近。BERTScore 比的是坐标距离,而不是字面是否相同。
对 AI 摘要里的每个词,找参考摘要里语义最接近的词,算相似度(0~1):
| AI 的词 | 最接近的参考词 | 语义相似度 | ROUGE-1 |
|---|---|---|---|
| USDA | USDA | 0.99 | ✓ 匹配 |
| created | established | 0.92 | ✗ 不匹配 |
| business centers | business centers | 0.99 | ✓ 匹配 |
| across | in | 0.78 | ✗ 不匹配 |
| mission areas | mission areas | 0.99 | ✓ 匹配 |
| staff offices | mission areas | 0.71 | ✗ 不匹配 |
我们的分治法有一个合并步骤(merge),模型会把多段子摘要改写成统一的表达,不可避免地换词换说法。
这种改写让 ROUGE-2 下降——但改写不等于质量变差,只是换了说法。BERTScore 能识别同义改写,所以它更真实地反映了合并后摘要的语义质量。
ROUGE 和 BERTScore 要组合着看:一个量字面匹配,一个量语义质量,缺一个都会漏掉重要信息。
对,本质上就是 Y 或者 N——但背后有一套逻辑,我把整个流程说清楚。
先说为什么需要它:ROUGE 和 BERTScore 都是「和参考摘要比」,但参考摘要是人写的,它不等于原文。AI 完全可以写出一段 ROUGE 很高、但里面有编造内容的摘要。所以需要一个指标直接和原始文档比,检查 AI 有没有造谣。
不是人工一句句核查,而是写一个 prompt,让另一个 LLM 来当裁判。prompt 大概长这样:
裁判模型读完这两段,给出它的判断。
裁判模型会逐条比对摘要里的说法,检查三类问题:
每篇报告跑完,得到一个 Faithful 或 Unfaithful 的判定,以及一段解释文字。比如:
100 篇跑完,比如 73 篇 Faithful、27 篇 Unfaithful,那 Faithfulness 通过率 = 73%。这就是最终汇报的数字。
这个问题很好,是这个方法的局限性之一。裁判模型也是 LLM,它也可能判断失误,也可能对复杂的逻辑推理出错。
但这种方式叫 LLM-as-Judge,是目前 NLP 领域里对"事实忠实度"评估的主流做法,因为没有更好的自动化方案——让人逐篇核查成本太高。更好的做法是用更强的模型当裁判(比如 GPT-4),我们用的是同一个模型自己检查自己,有一定局限,这也是我们在 limitations 里会提到的。
但即便如此,它发现的 Unfaithful 案例是有意义的——至少说明该摘要在模型自己的理解框架里也说不通,这本身是有价值的信号。
一篇 ROUGE 高的摘要完全可能 Faithfulness 不通过,反过来也成立。两个维度相互独立,缺一个都有盲区。
因为 ROUGE 测的是质量——你说的对不对——但不测数量——你说得够不够多。
想象一下:标准答案是 10 句话,AI 写了 2 句,但这 2 句恰好是最重要的,每个词都在参考里出现。ROUGE-1 会给高分。但你会说这是一个好摘要吗?它漏了 80% 的内容。
Baseline:平均生成 374 词,参考平均 587 词 → 覆盖率 = 374/587 = 63.7%
Map-Refine:平均生成 495 词,参考平均 574 词 → 覆盖率 = 495/574 = 86.2%
这 22% 的提升,直接说明分治法解决了截断问题——原来只有六成信息被说到,现在有了八成六。这是这个项目最重要的一个数字。
所以四个指标加在一起,才是完整的评估体系:
ROUGE-1:你说的词有没有出现在参考里(词级别质量)
ROUGE-2:你说的词组有没有出现(短语级别质量,更严格)
BERTScore:语义上有多接近(能识别同义改写)
Faithfulness:有没有编造(和原文对比)
覆盖率:说得够不够多(数量维度)
缺任何一个,你都只能看到故事的一部分。
先看数字,然后我们逐行解读。
| 方法 | ROUGE-1 ↑ | ROUGE-2 ↑ | BERTScore ↑ | 覆盖率 ↑ | Faithfulness ↑ |
|---|---|---|---|---|---|
| Baseline 直接截断,没有分治 |
0.4945 | 0.1818 | 0.0774 | 63.7% | 98.0% |
| MapReduce | 0.505 ↑ | 0.154 ↓ | 0.089 ↑ | 82.9% ↑ | N/A |
| Map-Refine ★ 推荐方法 |
0.5129 ↑ | 0.1696 ↓ | 0.0918 ↑ | 86.2% ↑ | 83.5% ↓ |
| Map-Cluster-Reduce | 0.502 ↑ | 0.160 ↓ | 0.088 ↑ | 83.4% ↑ | N/A |
Baseline 的 Faithfulness 高达 98%,Map-Refine 只有 83.5%,下降了 14.5 个百分点。这看起来像是倒退,但背后有一个合理解释:
Baseline 因为截断,只处理了前半段文档。前半段内容相对简单清晰,AI 不容易编造,所以 Faithfulness 很高——但这是"截断带来的幸存者偏差",后半段那些复杂内容根本没被处理,自然也不会出错。
Map-Refine 读完了整篇文档,包括后半段那些细节更多、更复杂的内容。处理的内容越多,出现遗漏或轻微偏差的概率就越高,Faithfulness 因此下降。这是信息量扩大带来的代价。
换句话说:Baseline 的 98% 是"少做少错",Map-Refine 的 83.5% 是"做得多难免有失误"——后者覆盖了 86.2% 的内容,前者只覆盖了 63.7%。这个 tradeoff 在政府报告场景里,覆盖率更重要。
现在逐行解读,不能只看数字,要读懂它背后的意思。
Baseline 根本没有用分治法——直接把文档截断到 16000 词喂给模型。政府报告动辄几万字,后半部分根本没被读到。
63.7% 覆盖率说明:AI 生成的摘要平均只有 374 个词,而人工参考摘要平均有 587 词。有 36% 的内容因为截断压根没被处理到,这就是要解决的核心问题。
ROUGE-2 = 0.1818 是所有方法里最高的——不是因为 Baseline 最好,而是它处理的都是能读到的前段内容,改写少,字面匹配率高。这个数字是个陷阱,单看会以为 Baseline 最好。
覆盖率从 63.7% 跳到 82.9%,说明分治法确实解决了截断问题——文档的后半部分现在被读到了。ROUGE-1 和 BERTScore 也都上升了。
但 ROUGE-2 从 0.1818 掉到 0.154,降幅最大。原因是 MapReduce 的最后一步要把所有子摘要一口气合并成一段,压缩量大,模型必须大量改写才能让内容流畅,改写越多,短语匹配越差。
ROUGE-1 最高(0.5129),BERTScore 最高(0.0918),覆盖率最高(86.2%)。三种分治方法里全面最好。
为什么?因为它每次只合并一小块新内容,合并压力最小,改写最少,误差积累最慢。
ROUGE-2 同样有下降(0.1696),这是所有分治方法共同面临的问题,不是 Map-Refine 特有的缺陷。原因下一节专门讲。
理论上聚类能帮模型按主题整理内容,但实验数据显示它的各项指标都在 MapReduce 和 Map-Refine 之间,没有超过 Map-Refine。
原因很可能是:聚类增加了一个合并层(子摘要→组摘要→最终摘要),改写了两次,内容走样更严重。多一个步骤不等于更好的结果,有时候复杂度带来的副作用超过了收益。
数字看完了,现在把结论说清楚。结论有三层:解决了什么、没解决什么、接下来怎么做。
好,结论有了。那接下来怎么做才是最佳方案?我们有三个改进方向:
现在 merge 的 prompt(指令)没有要求模型保留原词。改进方法是在 prompt 里明确加一句:"合并时请尽量保留原文关键词,不要自行换说法"。
预期效果:ROUGE-2 会回升,同时覆盖率和 BERTScore 基本维持。这是成本最低的改进,改一行 prompt 就能验证。
现在全程是生成式——AI 自己写词。另一种方式是先从原文里直接摘录关键句(不改写),再用模型稍微整理格式。
这叫"抽取式 + 生成式混合"方法。直接摘录的句子 ROUGE-2 会非常高,因为是原文原话。代价是摘要可能不那么流畅,但信息准确性更高,Faithfulness 也会更好。
这次 Baseline 跑了 100 份报告,Map-Refine 跑了 400 份,而且两个实验用的报告不完全一样(参考摘要长度略有差异:587词 vs 574词)。这意味着对比不是完全公平的。
理想情况是:四种方法(Baseline + 三种分治)跑完全相同的 400 份报告,在同一份数据集上比较。受算力和时间限制没做到,但这是未来要补的实验。
最后总结一句话:分治法在解决截断问题上是成功的,Map-Refine 是当前最佳方案,下一步的重点是压制 merge 步骤的改写,让 ROUGE-2 和 BERTScore 同时提升,而不是只能顾一头。
这个问题很尖锐,说明你真的理解了。我来分四层回答:
ROUGE 是文本摘要领域 20 年的标准指标,所有论文都报告它。不报的话,你的结果没法和任何其他工作比较。就算你觉得它有局限,也得报——同时解释它的局限。这是学术规范,不是"坑自己"。
我们没有隐藏 ROUGE-2 的下降,我们把它完整展示出来,并且解释了原因。这是诚实,不是失败。
分治法用改写换来了覆盖率提升 22.5%,BERTScore 提升 18.6%。ROUGE-2 下降 6.7%。这个权衡是值不值得的,要看你用这个工具做什么。
对于政府报告——遗漏 36% 的内容才是真正的灾难,改写词汇是小问题。所以这个 tradeoff 是合算的。
学术界早就知道 ROUGE 对改写不公平,BERTScore 是 2019 年提出来专门解决这个问题的。我们用了两个指标组合,正是因为知道 ROUGE 有这个盲点。
如果我们只报 BERTScore 不报 ROUGE,反而会被质疑"为什么只用一个指标"。一起报,让读者自己看全貌,才是正确做法。
如果有人只看 ROUGE-2,他会认为分治法让结果变差了。这是不完整的解读,但它会发生。
所以在演讲里,我们不能只扔出一张表格走人。要主动解释:"ROUGE-2 下降的原因是 merge 步骤引入了改写,而 ROUGE 对改写敏感。BERTScore 能识别同义改写,它的提升说明语义质量是上升的。这两个指标一起看,才是完整的故事。"
未来改进方向:在 merge 的 prompt 里明确要求"保留原文关键词不要换说法",或者引入抽取式摘要减少改写,这样 ROUGE-2 也能同步提升。
一句话总结:ROUGE-2 下降不是实验失败,是合并步骤的副作用,我们知道它的原因,已经用 BERTScore 做了补偿,并且指出了改进方向。 这就是好的科学。
这些是真实答辩中被追问过的问题。老师从头到尾追的核心只有一件事:分块是有损压缩,你们怎么处理信息丢失?
模型的上下文窗口大约 16,000 token,但政府报告经常 3 万到 5 万 token。所以我们按固定 token 数切——每块约 8,000 token,按文档顺序切。一篇 USDA 报告通常切成 2 到 4 块。
有——相邻两块之间重叠约 200 token,大概占每块的 5%,滑动窗口方式。原因是如果刚好在第 8000 个 token 硬切,可能切在句子中间。重叠保证边界上下文不丢——第 1 块的尾巴和第 2 块的开头有一小段相同的文字。
太少会断句,太多浪费上下文窗口。200 token 是经验上的平衡点。
"摘要本质上是有损压缩——你们有没有注意到质量下降?块与块之间的上下文断了,跨块的信息怎么办?"
这是分治法的核心 tradeoff,我们认真对待过。分块确实是有损的——每块独立总结,看不到其他块,跨块上下文确实会断。我们用三层手段应对:
第一,边界重叠——相邻块共享 200 token,句子不会被切断。
第二,合并策略的选择至关重要。MapReduce 每块独立总结再一次性合并——上下文丢失最多。Map-Refine 解决了这个问题:它把草稿带到下一块——处理第 2 块时,手里已有第 1 块的摘要做上下文。跨块信息被逐步保留。这就是 Map-Refine 各项指标都最好的原因。
第三,数据能证明。覆盖率从 63.7% 升到 86.2%——说明分治法恢复了 Baseline 因截断丢失的大部分信息。BERTScore 上涨 18.6%,说明语义质量在提升而不是下降。如果分块真的严重破坏上下文,这些数字应该往下走,不应该往上走。
老师不是让你消除 tradeoff——那不可能。他想知道的是:(1)你知不知道有这个代价,(2)你做了什么来缓解,(3)有没有证据说明它有效。三层都答到了,就过关了。
目前没有——当前 pipeline 没有逐块质量检查,每块摘要直接进入合并步骤。这是一个真实的局限。
下一步可以做的是:在 chunk 层面也加一个 self-check——不只在最终摘要检查。如果某块摘要的忠实度低于阈值,就重新生成再进入合并。我们没做,但知道方向在哪。
永远不要只说"我们没做"就停下。后面一定要跟:"但如果做的话会怎么做,以及这次为什么没做(时间/算力/不是核心假设)。"这说明你理解这个问题——只是没来得及做,不是没想到。
会更好——但我们是故意不用的。生成和裁判用的是同一个模型系列。原因是:如果用 GPT-4 当裁判,Faithfulness 分数会好看,但我们分不清提升是来自 pipeline 设计还是来自更强的裁判本身。
这是一个概念验证项目。我们要隔离的是分治法本身的效果。如果用 7B 模型自己检查自己都能跑通,换更强模型只会更好——这比"只有靠 GPT-4 才能跑通"更有说服力。
更大窗口确实存在——GPT-4 有 128K。但我们用的是 Qwen 2.5-7B,A100 上 4-bit 量化,有效窗口约 16K,再长质量就明显下降。
更重要的是,研究表明即使窗口更大,模型也会丢失中间内容——"lost in the middle" 问题。开头结尾记得住,中间的信息被忽略。分块+分治法反而强迫模型认真处理文档的每一部分,这就是覆盖率能上升的原因。
BLEU 是给机器翻译设计的,不是摘要——它只算精确率不算召回率,但摘要任务里召回率很重要,因为我们需要知道参考摘要被覆盖了多少。METEOR 更接近但发表论文里很少报告它,结果没法和其他工作比。
ROUGE 是摘要领域 20 年的通用标准,几乎每篇论文都报它。选它是为了可比性。同时我们加了 BERTScore,专门弥补 ROUGE 对改写敏感的盲点。
没算错——数字看起来低是因为 BERTScore 做了 baseline rescaling。原始分数集中在 0.85 到 1.0 之间,差异很小看不出来。库会减去一个语料级别的基准值,让差异可见。所以 0.09 不是"9% 相似",而是"高于语料平均 0.09 个标准差"。
关键看相对差异:Map-Refine 比 Baseline 高 18.6%,这是实质性的提升。
人工评估。我们所有指标都是自动化的代理。金标准是找领域专家——真正读政府报告的人——用 Likert 量表给摘要的信息完整性、流畅度、事实准确性打分。没做是因为时间和成本,但这是最强的验证方式。
第一步:正面承认挑战——不闪躲。
第二步:解释做了什么、为什么——即使答案是"没做,因为……"
第三步:指向证据——指标、数据或文献。
第四步:点出局限和下一步——展现成熟度。
老师不期待完美。他期待的是理解。
问题是什么:政府报告太长,AI 一次读不完,被截断的内容直接消失。同时 AI 会幻觉,摘要里有编造内容。
三种分治方法:MapReduce(全并一次)、Map-Refine★(边读边更新)、Map-Cluster-Reduce(先聚类再并)。区别在于如何 merge,Map-Refine 每次只合并一个块,压力最小,效果最好。
ROUGE-1:数重叠的词,拆出精确率(说准了多少)、召回率(覆盖了多少),F1 是两者综合。
ROUGE-2:数重叠的词组,更严格,只要换了说法就给 0 分,对改写敏感。
ROUGE-L:重叠的词顺序也要一致,顺序打乱 ROUGE-L 就掉。
BERTScore:用语义向量计算相似度,能识别同义改写,弥补 ROUGE 的字面盲点。
Faithfulness:让另一个 LLM 对照原文检查摘要有没有编造,二元判断,这是项目亮点。
覆盖率:生成长度 ÷ 参考长度,ROUGE 测质量,覆盖率测数量,两个都要。
结论:分治法让覆盖率从 63.7% → 86.2%(+22.5pp),BERTScore +18.6%,ROUGE-2 小幅下降是 merge 改写的系统性副作用,不是质量下降。Map-Refine 综合最优。
下一步:在 merge prompt 里限制改写、引入抽取式摘要、在同一数据集上做公平对照实验。
We'll use one running example — a single sentence from a USDA report — throughout the whole piece. Every metric will be computed on this same example, so nothing gets confusing.
The US government publishes stacks of reports every year — nursing home abuse investigations, USDA administrative reform, things like that. Dozens of pages of dense text. Nobody wants to read every word, but the information is genuinely useful. This project trains an AI to pull out a summary for you.
But AI has a problem — it hallucinates. It will confidently write things into the summary that never appeared in the source document. That's dangerous, especially for policy documents. So the key contribution of this project is: after the AI generates a summary, it checks its own work for fabrications. That's the "Self-Checking" part.
There's a second problem — the documents are too long for the AI to read in one pass. Our model can take in at most 16,000 tokens at a time, but government reports often run into tens of thousands of words. Everything past that limit just gets cut off and never processed.
The fix is called Divide & Conquer: split the document into chunks, summarize each chunk, then merge. We tried three different ways of doing the merge and ran experiments to see which works best.
Quick setup: our USDA report is long enough that we split it into 2 chunks. I'll walk you through each method using the actual content of those 2 chunks so you can see exactly what happens.
Chunk 1 (first half, ~3000 words):
USDA has 13 staff offices and 8 mission areas — 18 agencies in total. In 2017, the Secretary of Agriculture issued a memorandum requiring each mission area to establish a business center to consolidate administrative services, with the goal of improving efficiency and cross-agency collaboration...
Chunk 2 (second half, ~3000 words):
By the time the report was written, all 8 mission areas had established their business centers. GAO audited the process and found uneven implementation across agencies, recommending that USDA set up a unified tracking mechanism and report regularly to Congress...
How it works: summarize each chunk independently, producing two sub-summaries. Then feed both sub-summaries to the model in one shot and ask it to write the final summary.
Where it hurts: the two sub-summaries repeat each other in places. The model has to decide what to keep and what to drop, and it rewrites a lot in the process. More chunks and heavier compression make the rewriting worse. That "one-shot merge" is the bottleneck.
How it works: tries to improve MapReduce by clustering sub-summaries by topic first, merging within each cluster, and then merging across clusters. The theory is that grouping similar content together makes each merge cleaner.
Theory vs. reality: on paper, clustering should make each merge tidier. In practice, adding a layer of merging is just adding another round of rewriting. Two rewrite passes stack up and ROUGE-2 drops even further than MapReduce — this method scores lowest. A more complex pipeline doesn't automatically mean better results.
The first two methods share one fundamental issue: the merge step has to handle multiple pieces of content at once, which forces a lot of rewriting. Map-Refine rethinks this.
How it works: read chunk 1 and write a draft summary. Then feed chunk 2 together with the existing draft back to the model, and let it update the draft slightly to produce the final version. Each step only absorbs "one new piece of information" — it never merges everything at once.
Why this is the best approach: each merge only has to process "one new chunk + existing summary" — minimum load, minimum rewriting, minimum error accumulation. It mirrors how humans actually read: take notes as you go, update them chapter by chapter. The more chunks you have, the bigger this advantage gets.
→ Experimental result: ROUGE-1, BERTScore, and coverage are all the highest — best overall among the three D&C methods.
Exactly — that's the progression across the three methods. The key insight: don't try to make the big merge smarter, try to make each merge smaller. Map-Refine nails that principle.
Every metric below is computed on the same example — a sentence from the actual project data (the USDA report). Burn these two into your memory:
Reference summary (human-written ground truth):
"USDA established business centers in eight mission areas"
AI-generated summary:
"USDA created business centers across mission areas and staff offices"
At a glance, both sentences carry similar meaning — USDA established business centers across mission areas. But some words differ: the AI says "created" instead of "established", adds "staff offices", and drops "eight".
We'll take these two sentences and walk every evaluation metric through them, end to end.
ROUGE-1 is the simplest: count how many words appear in both texts. But "how many words overlap" isn't enough — you also need to distinguish "the AI said too much" from "the AI said too little." That's where precision and recall come in.
The Venn diagram makes this obvious in one look:
From this diagram, three questions are answered immediately:
① How much of what the AI said is correct? → right circle: 5 overlap words ÷ 10 AI words = 50%. That's Precision.
② How much of the reference did the AI cover? → left circle: 5 overlap ÷ 8 reference words = 62.5%. That's Recall.
③ Combined score? → harmonic mean of those two — that's F1, which we call the ROUGE-1 score.
Because either one alone can be gamed.
If you only look at precision, the AI could say just one word — as long as it's correct — and precision is 100%. Useless summary.
If you only look at recall, the AI could copy-paste the entire source document for 100% recall. Not a summary either.
F1 is the harmonic mean — whenever either side goes extreme, F1 gets dragged down. Both have to be good for F1 to be good. That's why it's the more reliable combined metric.
ROUGE-1 only matches single words, which is too loose. If the reference says "business centers" and the AI says "business cafeteria", ROUGE-1 still counts "business" as overlap — but clearly the two phrases mean different things.
ROUGE-2's upgrade: take every pair of adjacent words as a unit (a bigram), and only count a match if the whole pair is identical. Same example, run through it:
Only 2 bigrams match: (business, centers) and (mission, areas).
"established business" and "created business" don't match at the bigram level, even though ROUGE-1 would have counted "business" as a match.
Same example: ROUGE-1 = 0.556 vs ROUGE-2 = 0.250. More than 2× difference.
The reason: the AI swapped a lot of words ("established" → "created", "in eight" → "across"). Single-word overlap is still decent, but bigram overlap collapses. ROUGE-2 is extremely sensitive to paraphrasing — change the phrasing, and the score drops.
ROUGE-1 only asks: did this word appear? Order is ignored. So even if the AI says all the right words in a scrambled order, ROUGE-1 would still score high.
ROUGE-L adds: overlapping words also have to appear in the same relative order in both texts. It finds the "longest common subsequence" (LCS) — you can skip intermediate words, but you can't reorder them.
Reference:
Version A (normal order): LCS = 5 words
✓ Blue words appear in order: USDA → business → centers → mission → areas
Version B (order scrambled): same words, different order
✗ Same words, order reversed → LCS is short → ROUGE-L drops sharply
ROUGE-L checks: did the AI not only use the right words but also put them in a reasonable order? A good summary isn't just about having the right information — it has to be organized coherently.
Right — you just spotted ROUGE's biggest weakness: it only sees the surface form of words, not their meaning. "Established" and "created" are two completely different tokens to ROUGE — 0 points. Any human would say they mean the same thing in context.
BERTScore uses a neural network to understand meaning. Think of each word as a point in high-dimensional space — semantically similar words end up close together. BERTScore compares distances in that space, not surface spellings.
For each word in the AI summary, find the most semantically similar word in the reference and compute a similarity score (0–1):
| AI word | Closest reference word | Semantic similarity | ROUGE-1 |
|---|---|---|---|
| USDA | USDA | 0.99 | ✓ match |
| created | established | 0.92 | ✗ no match |
| business centers | business centers | 0.99 | ✓ match |
| across | in | 0.78 | ✗ no match |
| mission areas | mission areas | 0.99 | ✓ match |
| staff offices | mission areas | 0.71 | ✗ no match |
Our D&C pipeline has a merge step where the model rewrites multiple sub-summaries into a unified text. That inevitably swaps words and phrasings.
Rewriting makes ROUGE-2 drop — but rewriting ≠ quality degradation. BERTScore recognizes synonymous rewrites, so it reflects the actual semantic quality of merged summaries more honestly.
ROUGE and BERTScore have to be read together: one measures literal match, the other measures semantic match. Drop either one and you miss half the picture.
Yes, at the core it's Y or N — but there's a whole protocol behind it. Let me walk through the full flow.
First, why we need it: ROUGE and BERTScore both compare against the reference summary, but the reference is human-written and isn't the same as the source document. An AI could produce a high-ROUGE summary that still fabricates content. So we need a metric that compares directly against the original document.
Instead of manually checking each claim, we write a prompt asking another LLM to judge. Roughly:
The judge model reads both inputs and returns its verdict.
The judge walks the summary claim by claim and looks for three kinds of problems:
Each report yields one Faithful/Unfaithful verdict plus a short explanation. For example:
Run 100 reports, get 73 Faithful + 27 Unfaithful → Faithfulness pass rate = 73%. That's the final number reported.
Sharp question — it's one of the real limitations. The judge is also an LLM; it can misjudge, and complex reasoning can throw it off.
This approach is called LLM-as-Judge, and it's the mainstream way of evaluating faithfulness in NLP right now — human fact-checking every summary is too expensive. A stronger setup would use a more powerful model (like GPT-4) as the judge; our project uses the same model as both generator and checker, which has known limitations. We call this out in the limitations section.
Even so, the Unfaithful cases it flags are meaningful — at minimum, they show the summary doesn't hold up within the model's own reasoning framework, which is a useful signal.
A high-ROUGE summary can totally fail Faithfulness, and vice versa. Two independent dimensions — drop either one and you have a blind spot.
ROUGE measures quality — whether what you said is correct — but not quantity — whether you said enough.
Picture this: reference has 10 sentences, AI writes only 2, but those 2 are the most important ones and every word is in the reference. ROUGE-1 scores high. Would you call that a good summary? It skipped 80%.
Baseline: 374 words on average, reference averages 587 → coverage = 374/587 = 63.7%
Map-Refine: 495 words on average, reference averages 574 → coverage = 495/574 = 86.2%
That +22% jump directly shows D&C solved the truncation problem — from ~6/10 info pieces caught to ~8.6/10. Arguably the single most important number in the project.
All five put together gives you the full evaluation framework:
ROUGE-1: did the words show up in the reference? (word-level quality)
ROUGE-2: did the word pairs show up? (phrase-level quality, stricter)
ROUGE-L: did the words appear in the right order? (structure)
BERTScore: how close is the meaning? (semantic, catches paraphrases)
Faithfulness: did the AI make anything up? (vs. source)
Coverage: did it say enough? (quantity)
Miss any one and you only see part of the story.
Numbers first, then we unpack them row by row.
| Method | ROUGE-1 ↑ | ROUGE-2 ↑ | BERTScore ↑ | Coverage ↑ | Faithfulness ↑ |
|---|---|---|---|---|---|
| Baseline Truncation only, no D&C |
0.4945 | 0.1818 | 0.0774 | 63.7% | 98.0% |
| MapReduce | 0.505 ↑ | 0.154 ↓ | 0.089 ↑ | 82.9% ↑ | N/A |
| Map-Refine ★ Recommended |
0.5129 ↑ | 0.1696 ↓ | 0.0918 ↑ | 86.2% ↑ | 83.5% ↓ |
| Map-Cluster-Reduce | 0.502 ↑ | 0.160 ↓ | 0.088 ↑ | 83.4% ↑ | N/A |
Baseline scores 98%, Map-Refine only 83.5% — a 14.5-point drop. Looks like regression, but there's a clean explanation:
Baseline only processed the first half of the document due to truncation. That first half is relatively simple, so the AI doesn't fabricate much — Faithfulness stays high. But this is "survivorship bias from truncation" — the complex back half was never processed, so it couldn't go wrong.
Map-Refine reads the entire document, including the denser back half. More content → more opportunity for minor omissions or drift → Faithfulness drops. That's the cost of covering more.
Baseline's 98% means "you can't fail what you don't attempt"; Map-Refine's 83.5% means "you made mistakes because you actually tried." The latter covers 86.2% of content vs. the former's 63.7%. For government reports, coverage matters more.
Now row by row — can't just stare at numbers, have to read what they mean.
Baseline doesn't use D&C — truncates the document to 16,000 tokens and hands it to the model. Government reports are often tens of thousands of words; the back half simply never gets read.
63.7% coverage means: AI's summary averages 374 words while the reference averages 587. 36% of the content is lost to truncation before anything else happens. That's the core problem this project exists to solve.
ROUGE-2 = 0.1818 is the highest of any method — not because Baseline is best, but because it processes only the front half, does less rewriting, and gets higher literal word-pair overlap. That number is a trap: glance at it and you'd think Baseline wins.
Coverage leaps from 63.7% to 82.9% — D&C fixes the truncation problem, the back half is now being read. ROUGE-1 and BERTScore also rise.
But ROUGE-2 drops from 0.1818 to 0.154 — the biggest drop. Reason: MapReduce's final step fuses all sub-summaries in one pass, forcing heavy rewriting for fluency. More rewriting → lower phrase-level match.
Highest ROUGE-1 (0.5129), highest BERTScore (0.0918), highest Coverage (86.2%). Best across the board among the three D&C methods.
Why? Each merge only absorbs one new chunk — minimum merging pressure, minimum rewriting, minimum error accumulation.
ROUGE-2 also drops (0.1696). That's a systemic issue across all D&C methods, not a Map-Refine flaw. We unpack that next.
Theory said clustering would help the model organize by topic. Data shows Map-Cluster-Reduce lands between MapReduce and Map-Refine on every metric — never beating Map-Refine.
Likely reason: clustering added an extra merge layer (sub-summary → group summary → final summary) — two rounds of rewriting, more content drift. An extra step isn't automatically an improvement — sometimes complexity costs more than it helps.
Numbers done. Now the conclusions. Three layers: what we solved, what we didn't, what comes next.
Conclusions in. What would the best next steps be? Three directions:
The merge prompt currently doesn't ask the model to preserve source wording. Fix: add one line — "When merging, preserve the original document's key terminology; do not substitute alternative phrasings."
Expected effect: ROUGE-2 recovers, Coverage and BERTScore hold steady. Lowest-cost fix — change one line of prompt and run it.
Currently everything is generative — the AI writes the words itself. Alternative: first extract key sentences directly from the source (unchanged), then use the model only to organize the output.
This "extractive + generative" hybrid gets very high ROUGE-2 on extracted sentences (verbatim). Tradeoff: slightly less fluent prose, but stronger factual accuracy and better Faithfulness.
Baseline ran on 100 reports, Map-Refine on 400, and the two runs used slightly different reports (reference lengths: 587 vs 574 words). Not perfectly controlled.
Ideal setup: all four methods on the exact same 400 reports, head-to-head. Didn't do it due to compute/time limits, but it's the next experiment we'd add.
One-sentence wrap-up: D&C successfully solved the truncation problem, Map-Refine is the current best option, and the next priority is suppressing merge-step rewriting so ROUGE-2 and BERTScore can improve together instead of trading off.
Sharp question — shows you actually got it. Four layers of response:
ROUGE has been the standard metric in text summarization for 20 years. Every paper reports it. If you don't, your results can't be compared to any prior work. Even knowing its limitations, you report it — and explain the limitations. That's academic rigor, not self-sabotage.
We didn't hide the drop. We displayed it in full and explained why it happens. That's honest reporting, not failure.
D&C trades some rewriting for +22.5 points of coverage and +18.6% BERTScore at a cost of −6.7% ROUGE-2. Whether that tradeoff is worth it depends on what you're using the tool for.
For government reports — missing 36% of the content is the real disaster; some word substitutions are a minor issue. So the tradeoff is clearly worth it here.
The field has known for years that ROUGE is unfair to paraphrasing — BERTScore was introduced in 2019 specifically to address this. We use both together because we know ROUGE has this blind spot.
If we only reported BERTScore without ROUGE, we'd get the opposite criticism: "why only one metric?" Reporting both and letting the reader see the full picture is the correct approach.
Someone who only looks at ROUGE-2 would conclude D&C made things worse. That's an incomplete read, but it will happen.
In the talk, we can't just throw up a table and walk off. We have to proactively explain: "ROUGE-2 drops because merge introduces paraphrasing, and ROUGE is sensitive to paraphrasing. BERTScore recognizes synonymous rewrites — its rise tells us semantic quality actually improved. Read both together for the full story."
Future direction: update the merge prompt to explicitly "preserve source terminology; do not substitute phrasing," or introduce extractive elements to reduce rewriting, so ROUGE-2 can rise alongside the others.
One line: the ROUGE-2 drop isn't experimental failure — it's a known side effect of the merge step. We understand the cause, compensated with BERTScore, and pointed to a clear fix. That's good science.
These are real questions that came up during presentations of this project. If you're presenting this work, you need solid answers to all of them. The core theme is one thing: chunking is lossy compression — how do you handle the information loss?
Our model has a context window of about 16,000 tokens, but government reports can be 30,000 to 50,000 tokens. So we split sequentially with a fixed token budget — about 8,000 tokens per chunk. A typical USDA report ends up as 2 to 4 chunks.
Yes — each chunk shares about 200 tokens with the next one, a sliding window. That's roughly 5% of each chunk. The reason: if you hard-cut at exactly token 8,000, you might land in the middle of a sentence. The overlap ensures boundary context isn't lost — the end of chunk 1 and the beginning of chunk 2 share a small strip of text so the model has continuity.
Too little overlap risks splitting sentences. Too much overlap wastes the context window on repeated content. 200 tokens is the empirical sweet spot.
"Summarization is essentially a lossy affair. You're going to let go of some information. Did you notice any reduction in quality because of that? What about cross-chunk context loss?"
Yes — that's the core tradeoff of Divide-and-Conquer, and we take it seriously. Chunking is lossy: each chunk is summarized without seeing the other chunks, so cross-chunk context is lost. We address this in three ways:
First, overlap at chunk boundaries — 200 tokens shared between adjacent chunks, so sentences at the boundary aren't split.
Second, the choice of merge strategy matters enormously. MapReduce summarizes every chunk in isolation and merges once — maximum context loss. Map-Refine fixes this by carrying the draft forward: when it reads chunk 2, it already has the summary from chunk 1 as context. Cross-chunk information is preserved incrementally. This is why Map-Refine outperforms MapReduce on every positive metric.
Third, our metrics actually capture the effect. Coverage went from 63.7% to 86.2% — meaning D&C recovers most of the information that Baseline loses to truncation. BERTScore went up 18.6%, showing semantic quality improved, not degraded. If chunking were destroying too much context, these numbers would go down, not up.
The professor isn't asking you to eliminate the tradeoff — that's impossible. They want to know: (1) are you aware of the tradeoff, (2) did you do anything to mitigate it, and (3) do you have evidence it worked? This answer hits all three.
In the current pipeline, no — we don't have a per-chunk quality gate. Every chunk summary goes directly into the merge step regardless of quality. That's a real limitation.
A clear next step would be to add a self-check at the chunk level too — not just at the final summary. If a chunk summary scores below a threshold on a quick faithfulness check, regenerate it before it enters the merge. We didn't implement this, but we know it's the right direction.
Never just say "we didn't do that" and stop. Always follow up with: "but here's what we'd do if we did, and here's why we didn't this time." That shows you understand the limitation — you just ran out of time or compute, not understanding.
It would — but intentionally, we didn't. We used the same model family for both generation and judging. The reason: if we used GPT-4 as the judge, the Faithfulness score would look better, but we wouldn't know whether the improvement came from our pipeline design or just from having a stronger judge.
This is a proof-of-concept project. We want to isolate the effect of the D&C method itself. If the idea works even with a modest 7B model judging itself, it'll work even better with a stronger judge later. That's a much stronger claim than "it works, but only if you throw GPT-4 at it."
Larger context windows exist — GPT-4 has 128K tokens. But we're using Qwen 2.5-7B on an A100 with 4-bit quantization. At that model size, the effective context window is about 16K before quality degrades noticeably.
More importantly, research shows that even with larger windows, models tend to lose information in the middle of very long inputs — the "lost in the middle" problem. The model pays attention to the beginning and end but forgets things in the middle. Chunking with D&C actually forces the model to process every part of the document carefully, which is why our coverage goes up rather than relying on a long context window that might silently ignore the middle.
BLEU was designed for machine translation, not summarization. It focuses on n-gram precision and doesn't compute recall — but recall is critical for summaries because we need to know how much of the reference we covered. METEOR is closer but far less commonly reported in summarization papers, so our results wouldn't be comparable to prior work.
ROUGE has been the standard in summarization for 20 years. Every paper reports it. We chose it so our numbers are directly comparable — and we added BERTScore specifically because we knew ROUGE's paraphrasing blind spot would hurt us in a D&C pipeline.
No — the raw numbers look low because BERTScore applies baseline rescaling. Without rescaling, most scores cluster between 0.85 and 1.0, which makes everything look identical. The library subtracts a corpus-level baseline so differences become visible. So 0.09 means "0.09 standard deviations above the corpus mean," not "9% similar."
The important thing is the relative difference: Map-Refine is 18.6% higher than Baseline. That's a substantial and meaningful improvement.
Human evaluation. All our metrics are automated proxies. The gold standard would be having domain experts — people who actually read government reports — rate the summaries on a Likert scale for informativeness, fluency, and factual accuracy. We didn't do it because of time and cost, but it would be the strongest validation of whether D&C actually produces better summaries in practice, not just on automated benchmarks.
Step 1: Acknowledge the challenge directly — don't dodge.
Step 2: Explain what you did and why — even if the answer is "we didn't, here's why."
Step 3: Point to evidence — metrics, data, or literature.
Step 4: Name the limitation and what you'd do next — shows maturity.
Professors don't expect perfection. They expect understanding.
The problem: government reports are too long; AI can't read them in one pass, and truncated content disappears. AI also hallucinates, so summaries can contain fabricated claims.
Three D&C methods: MapReduce (merge all at once), Map-Refine ★ (update as you read), Map-Cluster-Reduce (cluster first, then merge). The difference is how they merge. Map-Refine only absorbs one chunk per merge — minimum pressure, best result.
ROUGE-1: count overlapping words; break into precision (how much of what you said is correct) and recall (how much of the reference you covered); F1 combines them.
ROUGE-2: count overlapping word-pairs; stricter; change phrasing and the score collapses — highly sensitive to paraphrasing.
ROUGE-L: overlapping words also need to keep their order; scramble the order and ROUGE-L drops.
BERTScore: uses semantic embeddings to recognize synonymous rewrites, covering ROUGE's literal-match blind spot.
Faithfulness: another LLM compares the summary against the source and returns Y/N. This is the project's signature contribution.
Coverage: generated length ÷ reference length. ROUGE measures quality; Coverage measures quantity. Both are needed.
Conclusions: D&C pushes coverage 63.7% → 86.2% (+22.5pp), BERTScore +18.6%. The minor ROUGE-2 drop is a systemic side effect of merge rewriting, not quality regression. Map-Refine wins overall.
Next steps: constrain rewriting in the merge prompt, introduce extractive summarization, run a fair head-to-head on the same dataset.