带你读懂这个项目 · Self-Checking Summarizer · CS6140
从零开始,不绕弯

用大白话把这个项目讲清楚

我们统一用一个贯穿全文的例子——一句话描述 USDA 报告。所有指标的计算,都在这一个例子上演示,不会乱。

第一部分

这个项目是干嘛的?两分钟说完

👨‍💻

政府每年出一堆报告,比如"养老院虐待问题调查"、"农业部行政改革",几十页全是文字。没人想全读完,但里面有有用的信息。这个项目就是训练 AI 帮你提炼摘要。

但 AI 有一个问题——它会幻觉(Hallucination)。就是原文没说的事,AI 自信满满地写进摘要,这很危险,尤其是政策文件。所以这个项目的亮点是:AI 生成摘要之后,再自己检查一遍有没有编造,这就是 Self-Checking。

👨‍💻

还有第二个问题——文档太长,AI 一次读不完。我们用的模型一次最多读 16000 个词,政府报告经常几万字,后面的内容被直接截断了,根本没被读到。

解决方案叫分治法(Divide & Conquer):把文档切成几块,分别总结,再合并。我们做了三种合并方式,实验看哪种最好。

第二部分

三种分治方法,区别到底在哪?

👨‍💻

先说一下背景:我们的 USDA 报告原文很长,被切成了 2 块处理(叫 chunk)。我用具体内容带你过一遍三种方法各自是怎么处理这 2 块的。

前提:报告被切成了 2 块(chunk)

Chunk 1(前半段,约3000词):

USDA 有 13 个 staff offices 和 8 个 mission areas,共 18 个机构。2017 年,农业部长发出备忘录,要求在每个 mission area 建立一个 business center,整合行政服务,目标是提升效率和部门协作……

Chunk 2(后半段,约3000词):

截至报告撰写时,8 个 mission areas 已全部建立 business center。GAO 对此进行审计,发现各机构推进进度不一,并提出建议:要求 USDA 建立统一的追踪机制,定期向国会汇报执行情况……

方法一:MapReduce ——「各自总结,最后一次性合并」

做法:两块分别独立总结,得到两段子摘要,然后把两段子摘要一起喂给模型,让它写出最终摘要。

Step 1:Map(各自总结)
Chunk 1 → 子摘要①
"USDA 有 13 个 office 和 8 个 mission area,部长要求建 business center 整合服务"
Chunk 2 → 子摘要②
"8 个 mission area 均已建立,GAO 建议建立追踪机制"
Step 2:Reduce(一次性合并)
子摘要① + 子摘要② → 模型
模型被要求把这两段合成一段,必须大量改写才能让内容流畅。改写越多,ROUGE-2 越低。
→ 最终摘要

问题在哪:合并时两段子摘要里有重复说法,模型得判断哪个保留、哪个删,这个过程会改写很多词。块数越多、压缩量越大,改写越严重。这个"一次性合并"是它的瓶颈。

方法二:Map-Cluster-Reduce ——「先分组,再分层合并」

做法:想在 MapReduce 基础上进一步改进——先把各块子摘要按主题聚类(分组),组内先合并一次,再跨组做最终合并。理论上主题更聚焦,合并质量更好。

假设有 4 个 Chunk
Step 1 Map: 4 块各自 → 子摘要①②③④
Step 2 Cluster: 算法发现①③内容相似(讲 office 结构),②④相似(讲 GAO 建议)
→ 分成两组:[①③] 和 [②④]
Step 3 组内 Reduce: [①③] → 组摘要A   [②④] → 组摘要B  ← 改写第一次
Step 4 最终 Reduce: 组摘要A + 组摘要B → 最终摘要  ← 改写第二次

理论 vs 实际:理论上聚类后同主题内容放一起,合并更整齐。但实验证明,多了一层合并就是多了一轮改写。改写叠加两次,ROUGE-2 损失比 MapReduce 还大,效果排名最低。设计越复杂不等于结果越好。

方法三:Map-Refine ★ ——「边读边更新,像人一样阅读」

前两种方法都有同一个根本问题:合并的时候需要同时处理多段内容,改写量大。Map-Refine 换了一个思路。

做法:先读第 1 块,写出草稿摘要。然后把第 2 块和已有草稿一起喂给模型,让它在草稿基础上小幅更新,得到最终摘要。每次只"追加一块新信息",不是一次性合并所有。

逐步处理过程(USDA 报告 2 个 Chunk)
Step 1: 读 Chunk 1 → 草稿摘要
"USDA 有 13 个 office,部长 2017 年要求每个 mission area 建 business center"
Step 2: 读 Chunk 2 + 草稿摘要在草稿上追加,得到最终摘要
"USDA 有 13 个 office,部长 2017 年要求建 business center;截至报告,8 个已建立,GAO 建议建追踪机制"
← 绿色部分是 Step 2 新追加的,原来的草稿内容基本不动

为什么是最优方案:每次合并只需处理「1 个新块 + 已有摘要」,压力最小,改写最少,误差积累最慢。这最接近人类"边读边记笔记、随时更新"的方式。块越多,这个优势越明显。

→ 实验结果:ROUGE-1、BERTScore、覆盖率三项正向指标全部最高,综合最优。

🙋‍♀️
所以逻辑是:MapReduce 最简单但合并压力大 → Map-Cluster-Reduce 想改进但改写更多反而更差 → Map-Refine 换了思路从根本上减少了每次合并量,才是最好的?
👨‍💻

完全正确,这就是这三种方法的演进逻辑。核心洞察是:不要想着怎么更聪明地合并,而是要想着怎么减少每次合并的量。 Map-Refine 把这个问题想清楚了。

第三部分

贯穿全文的例子,先认识一下

👨‍💻

后面所有指标的计算,我们都用同一个例子。是项目真实数据里的一句话(USDA 农业部报告)。先记住这两句:

贯穿全文的例子(就这两句,记住它)

参考摘要(人工写的标准答案):
"USDA established business centers in eight mission areas"

AI 生成摘要:
"USDA created business centers across mission areas and staff offices"

乍一看,这两句意思差不多——都是在说 USDA 建立了 business centers,分布在 mission areas 里。但有些词不同:AI 说的是 "created",参考说的是 "established";AI 多说了 "staff offices",漏掉了 "eight"。

接下来我们就用这两句话,把所有评估指标从头算一遍。

第三部分(重点!)

ROUGE-1:用韦恩图理解精确率和召回率

👨‍💻

ROUGE-1 最简单:就是数两段文字里出现了多少相同的词。但光说"有多少个相同的词"还不够——还要区分是 AI 说多了,还是 AI 说少了。这就引出了精确率和召回率。

用这个韦恩图理解,一看就懂:

参考摘要 8 个词 AI 生成 10 个词 重 叠 5 个词 established in eight AI 没有说到 参考里有,生成里没有 USDA business centers mission areas created across and staff offices AI 多说的 生成里有,参考里没有
👨‍💻

从这个图里,我们能直接读出三个问题的答案:

① AI 说的有多少是准的? → 看右圆:重叠 5 个 ÷ AI 总词数 10 个 = 50%,这叫精确率(Precision)

② AI 覆盖了多少参考内容? → 看左圆:重叠 5 个 ÷ 参考总词数 8 个 = 62.5%,这叫召回率(Recall)

③ 综合评分? → 把上面两个数取调和平均,叫F1,这就是我们说的 ROUGE-1 分数。

精确率 P= 重叠词数 / AI生成总词数= 5 / 10= 0.500
召回率 R= 重叠词数 / 参考摘要总词数= 5 / 8= 0.625

F1 分数= 2 × P × R / (P + R)= 2 × 0.500 × 0.625 / (0.500 + 0.625)= 0.556
🙋‍♀️
为什么不直接用精确率或者召回率,非要搞个 F1?
👨‍💻

因为单用一个会被钻空子。

如果只看精确率,AI 只要说一个词——正好是对的——精确率就是 100%。但这摘要基本没用。

如果只看召回率,AI 把整个原文复制粘贴进摘要,所有参考词都出现了,召回率 100%。但这不叫摘要。

F1 是调和平均,哪边太极端都会被拉低。两边都得表现好,F1 才高。所以 F1 是更可靠的综合指标。

第四部分

ROUGE-2:把匹配单位变成词组,更严格

👨‍💻

ROUGE-1 匹配单个词,太松了。比如参考说"business centers",AI 说"business cafeteria",ROUGE-1 会把"business"算作一个重叠词——但这两句明显意思不同。

ROUGE-2 的改进:把每两个相邻的词作为一组(叫 bigram),整组都相同才算匹配。我们用同一个例子算一遍:

参考摘要的所有词组(bigrams)
(usda, established)
(established, business)
(business, centers) ✓
(centers, in)
(in, eight)
(eight, mission)
(mission, areas) ✓
共 7 个词组
AI 生成的所有词组(bigrams)
(usda, created)
(created, business)
(business, centers) ✓
(centers, across)
(across, mission)
(mission, areas) ✓
(areas, and) (and, staff) (staff, offices)
共 9 个词组
👨‍💻

两边都有的词组只有 2 个:(business, centers)(mission, areas)

"established business"和"created business",虽然 ROUGE-1 认为"business"匹配了,但 ROUGE-2 看的是词组整体,就不算了。

精确率 P= 匹配词组 / AI词组总数= 2 / 9= 0.222
召回率 R= 匹配词组 / 参考词组总数= 2 / 7= 0.286

ROUGE-2 F1= 2 × 0.222 × 0.286 / (0.222 + 0.286)= 0.250
跟 ROUGE-1 比较一下

同一个例子,ROUGE-1 = 0.556,ROUGE-2 = 0.250。差了一倍多。

原因:这两句话换了很多词("established"→"created","in eight"→"across"),单词层面还有很多重叠,但词组层面重叠少了很多。ROUGE-2 对改写非常敏感,只要换了说法,分数就掉。

第五部分

ROUGE-L:顺序也要对,不能乱说

🙋‍♀️
ROUGE-L 那个"句子顺序"我上次没懂,能再说一遍吗?
👨‍💻

ROUGE-1 只问:这个词有没有出现?不管顺序。所以即使 AI 把词全说对了但顺序打乱,ROUGE-1 也给高分。

ROUGE-L 多加了一个要求:重叠的词,在两段文字里出现的先后顺序也要一致。找的是"最长公共子序列"(LCS)——可以跳过中间词,但顺序不能倒。

对比两个 AI 版本,感受"顺序"的影响

参考摘要:

USDA established business centers in eight mission areas

版本 A(正常顺序):LCS = 5 个词

USDA created business centers across mission areas and staff offices

✓ 蓝色词按顺序出现:USDA → business → centers → mission → areas

版本 B(顺序打乱):同样的词,但乱序

mission areas business centers USDA established in eight

✗ 词都一样,但顺序完全倒了 → LCS 很短 → ROUGE-L 大幅下降

版本A ROUGE-L F1LCS = 5,计算方式同 ROUGE-1≈ 0.556
版本B ROUGE-L F1LCS = 2(established, in),顺序打乱≈ 0.182

版本 A 和 B 的 ROUGE-1 相同(词都一样),但 ROUGE-L 差了 3 倍——顺序很重要
👨‍💻

所以 ROUGE-L 检测的是:AI 不只说对了词,而且说话的逻辑顺序也和参考摘要接近。一篇好摘要不只包含正确的信息,还要按合理的顺序组织。

第六部分

BERTScore:ROUGE 的盲点,换了说法就不认了

🙋‍♀️
按上面算下来,ROUGE-2 只有 0.25,感觉很低?但"established"和"created"意思明明差不多啊……
👨‍💻

对,你发现了 ROUGE 最大的问题:它只认字面,不认语义。"established"和"created"在 ROUGE 眼里是完全不同的词,给 0 分。但任何人都知道这两个词在这个语境里意思相同。

BERTScore 用神经网络来理解词义。你可以把每个词想象成一个坐标点——语义接近的词,坐标距离也近。BERTScore 比的是坐标距离,而不是字面是否相同。

用同一个例子看 BERTScore 的判断

对 AI 摘要里的每个词,找参考摘要里语义最接近的词,算相似度(0~1):

AI 的词 最接近的参考词 语义相似度 ROUGE-1
USDA USDA 0.99 ✓ 匹配
created established 0.92 ✗ 不匹配
business centers business centers 0.99 ✓ 匹配
across in 0.78 ✗ 不匹配
mission areas mission areas 0.99 ✓ 匹配
staff offices mission areas 0.71 ✗ 不匹配
BERTScore Precision= 所有 AI 词的最高匹配相似度取平均≈ 0.90

ROUGE-2 给这对句子打了 0.25,BERTScore 给了 ~0.90——同一对句子,差这么多,因为它们测的是不同的事
为什么 BERTScore 在这个项目里特别重要?

我们的分治法有一个合并步骤(merge),模型会把多段子摘要改写成统一的表达,不可避免地换词换说法。

这种改写让 ROUGE-2 下降——但改写不等于质量变差,只是换了说法。BERTScore 能识别同义改写,所以它更真实地反映了合并后摘要的语义质量。

ROUGE 和 BERTScore 要组合着看:一个量字面匹配,一个量语义质量,缺一个都会漏掉重要信息。

第七部分

Faithfulness:怎么判断 AI 有没有瞎编?

🙋‍♀️
前面那些指标都是和参考摘要比的,Faithfulness 是怎么算的?就是 Y 或者 N 吗?
👨‍💻

对,本质上就是 Y 或者 N——但背后有一套逻辑,我把整个流程说清楚。

先说为什么需要它:ROUGE 和 BERTScore 都是「和参考摘要比」,但参考摘要是人写的,它不等于原文。AI 完全可以写出一段 ROUGE 很高、但里面有编造内容的摘要。所以需要一个指标直接和原始文档比,检查 AI 有没有造谣。

第一步:构造一个 prompt(指令)给裁判模型

不是人工一句句核查,而是写一个 prompt,让另一个 LLM 来当裁判。prompt 大概长这样:

You are a fact-checking assistant.
Here is the original document:
[原始文档全文,几千词]

Here is a summary generated by an AI:
[AI 生成的摘要]

Does the summary make any claims NOT supported by the document?
Answer "Faithful" or "Unfaithful", then explain why.

裁判模型读完这两段,给出它的判断。

第二步:裁判模型在检查什么?

裁判模型会逐条比对摘要里的说法,检查三类问题:

① 幻觉
摘要里写了原文根本没提的内容。比如原文没说"改革完成了",但摘要里写了。
② 事实错误
摘要里的数字或描述和原文矛盾。原文说 13 个 office,摘要说 15 个。
③ 关键遗漏
原文的核心结论没有出现在摘要里,让读者对报告产生误解。
第三步:输出 Y 或 N,然后汇总成通过率

每篇报告跑完,得到一个 Faithful 或 Unfaithful 的判定,以及一段解释文字。比如:

✓ Faithful 案例
"Faithful. The summary accurately reflects the document's key findings about USDA's business center initiative."
✗ Unfaithful 案例(Sample #3)
"Unfaithful. The summary omits several key details about the investigation methods and specific abuse statistics cited in the report."

100 篇跑完,比如 73 篇 Faithful、27 篇 Unfaithful,那 Faithfulness 通过率 = 73%。这就是最终汇报的数字。

🙋‍♀️
那这个 Y 或 N 是一个模型自己说的,靠谱吗?它自己判断会不会出错?
👨‍💻

这个问题很好,是这个方法的局限性之一。裁判模型也是 LLM,它也可能判断失误,也可能对复杂的逻辑推理出错。

但这种方式叫 LLM-as-Judge,是目前 NLP 领域里对"事实忠实度"评估的主流做法,因为没有更好的自动化方案——让人逐篇核查成本太高。更好的做法是用更强的模型当裁判(比如 GPT-4),我们用的是同一个模型自己检查自己,有一定局限,这也是我们在 limitations 里会提到的。

但即便如此,它发现的 Unfaithful 案例是有意义的——至少说明该摘要在模型自己的理解框架里也说不通,这本身是有价值的信号。

总结:Faithfulness 和 ROUGE 解决的是完全不同的问题
ROUGE / BERTScore
和参考摘要比——摘要写得像不像人写的版本
答的是:质量好不好?
Faithfulness
和原始文档比——摘要里的内容原文里有没有
答的是:可不可信?

一篇 ROUGE 高的摘要完全可能 Faithfulness 不通过,反过来也成立。两个维度相互独立,缺一个都有盲区。

第八部分

覆盖率:有了 ROUGE,为什么还需要它?

🙋‍♀️
上面那些指标感觉已经够了,为什么还要一个覆盖率?
👨‍💻

因为 ROUGE 测的是质量——你说的对不对——但不测数量——你说得够不够多。

想象一下:标准答案是 10 句话,AI 写了 2 句,但这 2 句恰好是最重要的,每个词都在参考里出现。ROUGE-1 会给高分。但你会说这是一个好摘要吗?它漏了 80% 的内容。

覆盖率 = 生成长度 ÷ 参考长度

Baseline:平均生成 374 词,参考平均 587 词 → 覆盖率 = 374/587 = 63.7%

Map-Refine:平均生成 495 词,参考平均 574 词 → 覆盖率 = 495/574 = 86.2%

这 22% 的提升,直接说明分治法解决了截断问题——原来只有六成信息被说到,现在有了八成六。这是这个项目最重要的一个数字。

👨‍💻

所以四个指标加在一起,才是完整的评估体系:

ROUGE-1:你说的词有没有出现在参考里(词级别质量)

ROUGE-2:你说的词组有没有出现(短语级别质量,更严格)

BERTScore:语义上有多接近(能识别同义改写)

Faithfulness:有没有编造(和原文对比)

覆盖率:说得够不够多(数量维度)

缺任何一个,你都只能看到故事的一部分。

第十部分

实验结果:数字是什么,说明了什么

👨‍💻

先看数字,然后我们逐行解读。

方法 ROUGE-1 ↑ ROUGE-2 ↑ BERTScore ↑ 覆盖率 ↑ Faithfulness ↑
Baseline
直接截断,没有分治
0.4945 0.1818 0.0774 63.7% 98.0%
MapReduce 0.505 ↑ 0.154 ↓ 0.089 ↑ 82.9% ↑ N/A
Map-Refine ★
推荐方法
0.5129 ↑ 0.1696 ↓ 0.0918 ↑ 86.2% ↑ 83.5% ↓
Map-Cluster-Reduce 0.502 ↑ 0.160 ↓ 0.088 ↑ 83.4% ↑ N/A
等等——Faithfulness 反而下降了?这是怎么回事?

Baseline 的 Faithfulness 高达 98%,Map-Refine 只有 83.5%,下降了 14.5 个百分点。这看起来像是倒退,但背后有一个合理解释:

Baseline 因为截断,只处理了前半段文档。前半段内容相对简单清晰,AI 不容易编造,所以 Faithfulness 很高——但这是"截断带来的幸存者偏差",后半段那些复杂内容根本没被处理,自然也不会出错。

Map-Refine 读完了整篇文档,包括后半段那些细节更多、更复杂的内容。处理的内容越多,出现遗漏或轻微偏差的概率就越高,Faithfulness 因此下降。这是信息量扩大带来的代价。

换句话说:Baseline 的 98% 是"少做少错",Map-Refine 的 83.5% 是"做得多难免有失误"——后者覆盖了 86.2% 的内容,前者只覆盖了 63.7%。这个 tradeoff 在政府报告场景里,覆盖率更重要。

👨‍💻

现在逐行解读,不能只看数字,要读懂它背后的意思。

Baseline:63.7% 覆盖率意味着什么?

Baseline 根本没有用分治法——直接把文档截断到 16000 词喂给模型。政府报告动辄几万字,后半部分根本没被读到。

63.7% 覆盖率说明:AI 生成的摘要平均只有 374 个词,而人工参考摘要平均有 587 词。有 36% 的内容因为截断压根没被处理到,这就是要解决的核心问题。

ROUGE-2 = 0.1818 是所有方法里最高的——不是因为 Baseline 最好,而是它处理的都是能读到的前段内容,改写少,字面匹配率高。这个数字是个陷阱,单看会以为 Baseline 最好。

MapReduce:覆盖率大幅提升,但合并一次性压缩太猛

覆盖率从 63.7% 跳到 82.9%,说明分治法确实解决了截断问题——文档的后半部分现在被读到了。ROUGE-1 和 BERTScore 也都上升了。

但 ROUGE-2 从 0.1818 掉到 0.154,降幅最大。原因是 MapReduce 的最后一步要把所有子摘要一口气合并成一段,压缩量大,模型必须大量改写才能让内容流畅,改写越多,短语匹配越差。

Map-Refine ★:综合最优,每项"正向指标"都是最好的

ROUGE-1 最高(0.5129),BERTScore 最高(0.0918),覆盖率最高(86.2%)。三种分治方法里全面最好。

为什么?因为它每次只合并一小块新内容,合并压力最小,改写最少,误差积累最慢。

ROUGE-2 同样有下降(0.1696),这是所有分治方法共同面临的问题,不是 Map-Refine 特有的缺陷。原因下一节专门讲。

Map-Cluster-Reduce:加了聚类,但没有带来更好的结果

理论上聚类能帮模型按主题整理内容,但实验数据显示它的各项指标都在 MapReduce 和 Map-Refine 之间,没有超过 Map-Refine。

原因很可能是:聚类增加了一个合并层(子摘要→组摘要→最终摘要),改写了两次,内容走样更严重。多一个步骤不等于更好的结果,有时候复杂度带来的副作用超过了收益。

第十一部分

最终结论:我们得出了什么?下一步最佳方案是什么?

👨‍💻

数字看完了,现在把结论说清楚。结论有三层:解决了什么、没解决什么、接下来怎么做。

结论一:分治法确实解决了截断问题
覆盖率从 63.7% 提升到 86.2%,提升了 22.5 个百分点。这是这个项目最重要的贡献。原来 AI 只能读前半段,后半段的信息全部丢失;现在几乎全文都被读到了,摘要的信息量大幅提升。对于政府报告这类需要完整覆盖的文档,这是关键改进。
结论二:语义质量也提升了,BERTScore 上涨 18.6%
BERTScore 从 0.0774 升到 0.0918,意味着 AI 生成的内容在语义层面与标准答案更接近了。这说明分治法不只是"写得更多",而是"写得更准确"——读到了更多内容,理解也更全面。
⚠️
结论三:ROUGE-2 下降,说明短语精确度有代价
三种分治方法的 ROUGE-2 全部低于 Baseline。这不是某个方法的特殊失败,而是分治流程本身的系统性代价——只要有 merge 步骤,就会有改写,改写就会导致 ROUGE-2 下降。这个 tradeoff 是真实存在的,不是可以忽略的小问题。
🏆
结论四:Map-Refine 是三种方法里综合最优的
在所有"越高越好"的指标上(ROUGE-1、BERTScore、覆盖率),Map-Refine 都是最高的。它的滚动更新机制每次只处理一个新块,合并压力最小,误差积累最慢。这个结论有实验数据支撑,不是猜测。
👨‍💻

好,结论有了。那接下来怎么做才是最佳方案?我们有三个改进方向:

改进方向 ① 限制 merge 步骤的改写

现在 merge 的 prompt(指令)没有要求模型保留原词。改进方法是在 prompt 里明确加一句:"合并时请尽量保留原文关键词,不要自行换说法"

预期效果:ROUGE-2 会回升,同时覆盖率和 BERTScore 基本维持。这是成本最低的改进,改一行 prompt 就能验证。

改进方向 ② 引入抽取式摘要(Extractive)减少改写

现在全程是生成式——AI 自己写词。另一种方式是先从原文里直接摘录关键句(不改写),再用模型稍微整理格式。

这叫"抽取式 + 生成式混合"方法。直接摘录的句子 ROUGE-2 会非常高,因为是原文原话。代价是摘要可能不那么流畅,但信息准确性更高,Faithfulness 也会更好。

改进方向 ③ 在更大数据集上做公平对照实验

这次 Baseline 跑了 100 份报告,Map-Refine 跑了 400 份,而且两个实验用的报告不完全一样(参考摘要长度略有差异:587词 vs 574词)。这意味着对比不是完全公平的。

理想情况是:四种方法(Baseline + 三种分治)跑完全相同的 400 份报告,在同一份数据集上比较。受算力和时间限制没做到,但这是未来要补的实验。

👨‍💻

最后总结一句话:分治法在解决截断问题上是成功的,Map-Refine 是当前最佳方案,下一步的重点是压制 merge 步骤的改写,让 ROUGE-2 和 BERTScore 同时提升,而不是只能顾一头。

第十二部分(重要反思)

ROUGE-2 下降这件事,怎么理解才对?

🙋‍♀️
等等,你们明知道分治法会有 merge 这一步,merge 会改写词汇,ROUGE-2 会因为改写而下降——那为什么一开始还要用 ROUGE-2?你不是自己坑自己吗?这不是说明实验失败了吗?
👨‍💻

这个问题很尖锐,说明你真的理解了。我来分四层回答:

① 不报 ROUGE 不行:它是这个领域的通用货币

ROUGE 是文本摘要领域 20 年的标准指标,所有论文都报告它。不报的话,你的结果没法和任何其他工作比较。就算你觉得它有局限,也得报——同时解释它的局限。这是学术规范,不是"坑自己"。

② ROUGE-2 下降不代表失败,代表你发现了一个 tradeoff(权衡)

我们没有隐藏 ROUGE-2 的下降,我们把它完整展示出来,并且解释了原因。这是诚实,不是失败。

分治法用改写换来了覆盖率提升 22.5%,BERTScore 提升 18.6%。ROUGE-2 下降 6.7%。这个权衡是值不值得的,要看你用这个工具做什么。

对于政府报告——遗漏 36% 的内容才是真正的灾难,改写词汇是小问题。所以这个 tradeoff 是合算的。

③ BERTScore 的存在,本来就是为了解决 ROUGE 的这个缺陷

学术界早就知道 ROUGE 对改写不公平,BERTScore 是 2019 年提出来专门解决这个问题的。我们用了两个指标组合,正是因为知道 ROUGE 有这个盲点。

如果我们只报 BERTScore 不报 ROUGE,反而会被质疑"为什么只用一个指标"。一起报,让读者自己看全貌,才是正确做法。

④ 但你的批评有道理——这是真实存在的局限,值得说清楚

如果有人只看 ROUGE-2,他会认为分治法让结果变差了。这是不完整的解读,但它会发生。

所以在演讲里,我们不能只扔出一张表格走人。要主动解释:"ROUGE-2 下降的原因是 merge 步骤引入了改写,而 ROUGE 对改写敏感。BERTScore 能识别同义改写,它的提升说明语义质量是上升的。这两个指标一起看,才是完整的故事。"

未来改进方向:在 merge 的 prompt 里明确要求"保留原文关键词不要换说法",或者引入抽取式摘要减少改写,这样 ROUGE-2 也能同步提升。

👨‍💻

一句话总结:ROUGE-2 下降不是实验失败,是合并步骤的副作用,我们知道它的原因,已经用 BERTScore 做了补偿,并且指出了改进方向。 这就是好的科学。

第十三部分(答辩准备)

老师会追问什么?怎么接住?

👨‍💻

这些是真实答辩中被追问过的问题。老师从头到尾追的核心只有一件事:分块是有损压缩,你们怎么处理信息丢失?

🙋‍♀️
你们怎么切块的?每块多大?
👨‍💻

模型的上下文窗口大约 16,000 token,但政府报告经常 3 万到 5 万 token。所以我们按固定 token 数切——每块约 8,000 token,按文档顺序切。一篇 USDA 报告通常切成 2 到 4 块。

🙋‍♀️
块和块之间有重叠吗?重叠多少?怎么定的?
👨‍💻

有——相邻两块之间重叠约 200 token,大概占每块的 5%,滑动窗口方式。原因是如果刚好在第 8000 个 token 硬切,可能切在句子中间。重叠保证边界上下文不丢——第 1 块的尾巴和第 2 块的开头有一小段相同的文字。

太少会断句,太多浪费上下文窗口。200 token 是经验上的平衡点。

老师的核心追问

"摘要本质上是有损压缩——你们有没有注意到质量下降?块与块之间的上下文断了,跨块的信息怎么办?"

👨‍💻

这是分治法的核心 tradeoff,我们认真对待过。分块确实是有损的——每块独立总结,看不到其他块,跨块上下文确实会断。我们用三层手段应对:

第一,边界重叠——相邻块共享 200 token,句子不会被切断。

第二,合并策略的选择至关重要。MapReduce 每块独立总结再一次性合并——上下文丢失最多。Map-Refine 解决了这个问题:它把草稿带到下一块——处理第 2 块时,手里已有第 1 块的摘要做上下文。跨块信息被逐步保留。这就是 Map-Refine 各项指标都最好的原因。

第三,数据能证明。覆盖率从 63.7% 升到 86.2%——说明分治法恢复了 Baseline 因截断丢失的大部分信息。BERTScore 上涨 18.6%,说明语义质量在提升而不是下降。如果分块真的严重破坏上下文,这些数字应该往下走,不应该往上走。

为什么这个回答能过关

老师不是让你消除 tradeoff——那不可能。他想知道的是:(1)你知不知道有这个代价,(2)你做了什么来缓解,(3)有没有证据说明它有效。三层都答到了,就过关了。

🙋‍♀️
如果某一块的子摘要特别差,会不会传染到最终结果?你们有没有质量兜底?
👨‍💻

目前没有——当前 pipeline 没有逐块质量检查,每块摘要直接进入合并步骤。这是一个真实的局限。

下一步可以做的是:在 chunk 层面也加一个 self-check——不只在最终摘要检查。如果某块摘要的忠实度低于阈值,就重新生成再进入合并。我们没做,但知道方向在哪。

怎么正确地说"我们没做"

永远不要只说"我们没做"就停下。后面一定要跟:"但如果做的话会怎么做,以及这次为什么没做(时间/算力/不是核心假设)。"这说明你理解这个问题——只是没来得及做,不是没想到。

🙋‍♀️
为什么不用更强的模型(GPT-4、Claude)当裁判?Faithfulness 分数不会更好吗?
👨‍💻

会更好——但我们是故意不用的。生成和裁判用的是同一个模型系列。原因是:如果用 GPT-4 当裁判,Faithfulness 分数会好看,但我们分不清提升是来自 pipeline 设计还是来自更强的裁判本身

这是一个概念验证项目。我们要隔离的是分治法本身的效果。如果用 7B 模型自己检查自己都能跑通,换更强模型只会更好——这比"只有靠 GPT-4 才能跑通"更有说服力。

🙋‍♀️
为什么不直接用 128K 上下文窗口的模型,跳过分块?
👨‍💻

更大窗口确实存在——GPT-4 有 128K。但我们用的是 Qwen 2.5-7B,A100 上 4-bit 量化,有效窗口约 16K,再长质量就明显下降。

更重要的是,研究表明即使窗口更大,模型也会丢失中间内容——"lost in the middle" 问题。开头结尾记得住,中间的信息被忽略。分块+分治法反而强迫模型认真处理文档的每一部分,这就是覆盖率能上升的原因。

🙋‍♀️
为什么用 ROUGE 不用 BLEU 或 METEOR?
👨‍💻

BLEU 是给机器翻译设计的,不是摘要——它只算精确率不算召回率,但摘要任务里召回率很重要,因为我们需要知道参考摘要被覆盖了多少。METEOR 更接近但发表论文里很少报告它,结果没法和其他工作比。

ROUGE 是摘要领域 20 年的通用标准,几乎每篇论文都报它。选它是为了可比性。同时我们加了 BERTScore,专门弥补 ROUGE 对改写敏感的盲点。

🙋‍♀️
BERTScore 才 0.07 到 0.09,几乎是零啊?是不是算错了?
👨‍💻

没算错——数字看起来低是因为 BERTScore 做了 baseline rescaling。原始分数集中在 0.85 到 1.0 之间,差异很小看不出来。库会减去一个语料级别的基准值,让差异可见。所以 0.09 不是"9% 相似",而是"高于语料平均 0.09 个标准差"。

关键看相对差异:Map-Refine 比 Baseline 高 18.6%,这是实质性的提升。

🙋‍♀️
如果再加一个评估方式,你会加什么?
👨‍💻

人工评估。我们所有指标都是自动化的代理。金标准是找领域专家——真正读政府报告的人——用 Likert 量表给摘要的信息完整性、流畅度、事实准确性打分。没做是因为时间和成本,但这是最强的验证方式。

答辩回答的四步框架

第一步:正面承认挑战——不闪躲。

第二步:解释做了什么、为什么——即使答案是"没做,因为……"

第三步:指向证据——指标、数据或文献。

第四步:点出局限和下一步——展现成熟度。

老师不期待完美。他期待的是理解

全文脉络,一次过一遍

问题是什么:政府报告太长,AI 一次读不完,被截断的内容直接消失。同时 AI 会幻觉,摘要里有编造内容。

三种分治方法:MapReduce(全并一次)、Map-Refine★(边读边更新)、Map-Cluster-Reduce(先聚类再并)。区别在于如何 merge,Map-Refine 每次只合并一个块,压力最小,效果最好。

ROUGE-1:数重叠的词,拆出精确率(说准了多少)、召回率(覆盖了多少),F1 是两者综合。

ROUGE-2:数重叠的词组,更严格,只要换了说法就给 0 分,对改写敏感。

ROUGE-L:重叠的词顺序也要一致,顺序打乱 ROUGE-L 就掉。

BERTScore:用语义向量计算相似度,能识别同义改写,弥补 ROUGE 的字面盲点。

Faithfulness:让另一个 LLM 对照原文检查摘要有没有编造,二元判断,这是项目亮点。

覆盖率:生成长度 ÷ 参考长度,ROUGE 测质量,覆盖率测数量,两个都要。

结论:分治法让覆盖率从 63.7% → 86.2%(+22.5pp),BERTScore +18.6%,ROUGE-2 小幅下降是 merge 改写的系统性副作用,不是质量下降。Map-Refine 综合最优。

下一步:在 merge prompt 里限制改写、引入抽取式摘要、在同一数据集上做公平对照实验。

From scratch, no detours

Let me walk you through this project in plain English

We'll use one running example — a single sentence from a USDA report — throughout the whole piece. Every metric will be computed on this same example, so nothing gets confusing.

Part 1

What is this project? Two-minute version

👨‍💻

The US government publishes stacks of reports every year — nursing home abuse investigations, USDA administrative reform, things like that. Dozens of pages of dense text. Nobody wants to read every word, but the information is genuinely useful. This project trains an AI to pull out a summary for you.

But AI has a problem — it hallucinates. It will confidently write things into the summary that never appeared in the source document. That's dangerous, especially for policy documents. So the key contribution of this project is: after the AI generates a summary, it checks its own work for fabrications. That's the "Self-Checking" part.

👨‍💻

There's a second problem — the documents are too long for the AI to read in one pass. Our model can take in at most 16,000 tokens at a time, but government reports often run into tens of thousands of words. Everything past that limit just gets cut off and never processed.

The fix is called Divide & Conquer: split the document into chunks, summarize each chunk, then merge. We tried three different ways of doing the merge and ran experiments to see which works best.

Part 2

Three D&C methods — what's actually different?

👨‍💻

Quick setup: our USDA report is long enough that we split it into 2 chunks. I'll walk you through each method using the actual content of those 2 chunks so you can see exactly what happens.

Setup: the report is split into 2 chunks

Chunk 1 (first half, ~3000 words):

USDA has 13 staff offices and 8 mission areas — 18 agencies in total. In 2017, the Secretary of Agriculture issued a memorandum requiring each mission area to establish a business center to consolidate administrative services, with the goal of improving efficiency and cross-agency collaboration...

Chunk 2 (second half, ~3000 words):

By the time the report was written, all 8 mission areas had established their business centers. GAO audited the process and found uneven implementation across agencies, recommending that USDA set up a unified tracking mechanism and report regularly to Congress...

Method 1: MapReduce — "summarize separately, merge all at once"

How it works: summarize each chunk independently, producing two sub-summaries. Then feed both sub-summaries to the model in one shot and ask it to write the final summary.

Step 1: Map (summarize each)
Chunk 1 → sub-summary ①
"USDA has 13 offices and 8 mission areas; Secretary asked for business centers to consolidate services"
Chunk 2 → sub-summary ②
"All 8 mission areas have established centers; GAO recommends a tracking mechanism"
Step 2: Reduce (merge in one shot)
sub-summary ① + ② → model
The model is asked to fuse both pieces into one fluent paragraph, which forces a lot of rewriting. The more rewriting, the lower ROUGE-2.
→ final summary

Where it hurts: the two sub-summaries repeat each other in places. The model has to decide what to keep and what to drop, and it rewrites a lot in the process. More chunks and heavier compression make the rewriting worse. That "one-shot merge" is the bottleneck.

Method 2: Map-Cluster-Reduce — "group by topic, then merge in stages"

How it works: tries to improve MapReduce by clustering sub-summaries by topic first, merging within each cluster, and then merging across clusters. The theory is that grouping similar content together makes each merge cleaner.

Suppose we have 4 chunks
Step 1 Map: 4 chunks → sub-summaries ① ② ③ ④
Step 2 Cluster: algorithm finds ① & ③ similar (office structure), ② & ④ similar (GAO recommendations)
→ grouped into [① ③] and [② ④]
Step 3 Within-group Reduce: [① ③] → A, [② ④] → B  ← rewrite pass 1
Step 4 Final Reduce: A + B → final summary  ← rewrite pass 2

Theory vs. reality: on paper, clustering should make each merge tidier. In practice, adding a layer of merging is just adding another round of rewriting. Two rewrite passes stack up and ROUGE-2 drops even further than MapReduce — this method scores lowest. A more complex pipeline doesn't automatically mean better results.

Method 3: Map-Refine ★ — "update as you read, like a human would"

The first two methods share one fundamental issue: the merge step has to handle multiple pieces of content at once, which forces a lot of rewriting. Map-Refine rethinks this.

How it works: read chunk 1 and write a draft summary. Then feed chunk 2 together with the existing draft back to the model, and let it update the draft slightly to produce the final version. Each step only absorbs "one new piece of information" — it never merges everything at once.

Step-by-step (the 2-chunk USDA report)
Step 1: read Chunk 1 → draft summary
"USDA has 13 offices; Secretary in 2017 asked each mission area to set up a business center"
Step 2: read Chunk 2 + draftextend into final summary
"USDA has 13 offices; Secretary in 2017 asked for business centers; as of this report, 8 have been established, GAO recommends a tracking mechanism"
← the green portion is what Step 2 added; the original draft stays mostly intact

Why this is the best approach: each merge only has to process "one new chunk + existing summary" — minimum load, minimum rewriting, minimum error accumulation. It mirrors how humans actually read: take notes as you go, update them chapter by chapter. The more chunks you have, the bigger this advantage gets.

→ Experimental result: ROUGE-1, BERTScore, and coverage are all the highest — best overall among the three D&C methods.

🙋‍♀️
So the logic is: MapReduce is simplest but the merge is overloaded → Map-Cluster-Reduce tries to help but ends up rewriting even more → Map-Refine flips the approach and cuts the per-merge load from the ground up, which is why it wins?
👨‍💻

Exactly — that's the progression across the three methods. The key insight: don't try to make the big merge smarter, try to make each merge smaller. Map-Refine nails that principle.

Part 3

The running example — meet the two sentences

👨‍💻

Every metric below is computed on the same example — a sentence from the actual project data (the USDA report). Burn these two into your memory:

The running example (just these two sentences — remember them)

Reference summary (human-written ground truth):
"USDA established business centers in eight mission areas"

AI-generated summary:
"USDA created business centers across mission areas and staff offices"

At a glance, both sentences carry similar meaning — USDA established business centers across mission areas. But some words differ: the AI says "created" instead of "established", adds "staff offices", and drops "eight".

We'll take these two sentences and walk every evaluation metric through them, end to end.

Part 4 (key chapter!)

ROUGE-1: precision and recall via a Venn diagram

👨‍💻

ROUGE-1 is the simplest: count how many words appear in both texts. But "how many words overlap" isn't enough — you also need to distinguish "the AI said too much" from "the AI said too little." That's where precision and recall come in.

The Venn diagram makes this obvious in one look:

REFERENCE 8 words AI GENERATED 10 words OVERLAP 5 words established in eight AI missed these in reference, not in AI USDA business centers mission areas created across and staff offices AI added these in AI, not in reference
👨‍💻

From this diagram, three questions are answered immediately:

① How much of what the AI said is correct? → right circle: 5 overlap words ÷ 10 AI words = 50%. That's Precision.

② How much of the reference did the AI cover? → left circle: 5 overlap ÷ 8 reference words = 62.5%. That's Recall.

③ Combined score? → harmonic mean of those two — that's F1, which we call the ROUGE-1 score.

Precision P= overlap / total AI words= 5 / 10= 0.500
Recall R= overlap / total reference words= 5 / 8= 0.625

F1 score= 2 × P × R / (P + R)= 2 × 0.500 × 0.625 / 1.125= 0.556
🙋‍♀️
Why do we need F1? Couldn't we just use precision or recall alone?
👨‍💻

Because either one alone can be gamed.

If you only look at precision, the AI could say just one word — as long as it's correct — and precision is 100%. Useless summary.

If you only look at recall, the AI could copy-paste the entire source document for 100% recall. Not a summary either.

F1 is the harmonic mean — whenever either side goes extreme, F1 gets dragged down. Both have to be good for F1 to be good. That's why it's the more reliable combined metric.

Part 5

ROUGE-2: upgrade from words to word-pairs — stricter

👨‍💻

ROUGE-1 only matches single words, which is too loose. If the reference says "business centers" and the AI says "business cafeteria", ROUGE-1 still counts "business" as overlap — but clearly the two phrases mean different things.

ROUGE-2's upgrade: take every pair of adjacent words as a unit (a bigram), and only count a match if the whole pair is identical. Same example, run through it:

All reference bigrams
(usda, established)
(established, business)
(business, centers) ✓
(centers, in)
(in, eight)
(eight, mission)
(mission, areas) ✓
7 bigrams total
All AI-generated bigrams
(usda, created)
(created, business)
(business, centers) ✓
(centers, across)
(across, mission)
(mission, areas) ✓
(areas, and) (and, staff) (staff, offices)
9 bigrams total
👨‍💻

Only 2 bigrams match: (business, centers) and (mission, areas).

"established business" and "created business" don't match at the bigram level, even though ROUGE-1 would have counted "business" as a match.

Precision P= matched bigrams / AI bigrams= 2 / 9= 0.222
Recall R= matched bigrams / reference bigrams= 2 / 7= 0.286

ROUGE-2 F1= 2 × 0.222 × 0.286 / (0.222 + 0.286)= 0.250
Compared to ROUGE-1

Same example: ROUGE-1 = 0.556 vs ROUGE-2 = 0.250. More than 2× difference.

The reason: the AI swapped a lot of words ("established" → "created", "in eight" → "across"). Single-word overlap is still decent, but bigram overlap collapses. ROUGE-2 is extremely sensitive to paraphrasing — change the phrasing, and the score drops.

Part 6

ROUGE-L: word order has to match too

🙋‍♀️
I didn't really get the "word order" part of ROUGE-L — can you explain again?
👨‍💻

ROUGE-1 only asks: did this word appear? Order is ignored. So even if the AI says all the right words in a scrambled order, ROUGE-1 would still score high.

ROUGE-L adds: overlapping words also have to appear in the same relative order in both texts. It finds the "longest common subsequence" (LCS) — you can skip intermediate words, but you can't reorder them.

Two AI versions — see what order does to the score

Reference:

USDA established business centers in eight mission areas

Version A (normal order): LCS = 5 words

USDA created business centers across mission areas and staff offices

✓ Blue words appear in order: USDA → business → centers → mission → areas

Version B (order scrambled): same words, different order

mission areas business centers USDA established in eight

✗ Same words, order reversed → LCS is short → ROUGE-L drops sharply

Version A ROUGE-L F1LCS = 5; same math as ROUGE-1≈ 0.556
Version B ROUGE-L F1LCS = 2 (established, in); order broken≈ 0.182

A and B have identical ROUGE-1 (same words), but ROUGE-L differs 3× — order matters.
👨‍💻

ROUGE-L checks: did the AI not only use the right words but also put them in a reasonable order? A good summary isn't just about having the right information — it has to be organized coherently.

Part 7

BERTScore: ROUGE's blind spot — paraphrases don't count

🙋‍♀️
ROUGE-2 came out to only 0.25 — feels low. But "established" and "created" mean basically the same thing here…
👨‍💻

Right — you just spotted ROUGE's biggest weakness: it only sees the surface form of words, not their meaning. "Established" and "created" are two completely different tokens to ROUGE — 0 points. Any human would say they mean the same thing in context.

BERTScore uses a neural network to understand meaning. Think of each word as a point in high-dimensional space — semantically similar words end up close together. BERTScore compares distances in that space, not surface spellings.

What BERTScore does on the same example

For each word in the AI summary, find the most semantically similar word in the reference and compute a similarity score (0–1):

AI word Closest reference word Semantic similarity ROUGE-1
USDAUSDA0.99✓ match
createdestablished0.92✗ no match
business centersbusiness centers0.99✓ match
acrossin0.78✗ no match
mission areasmission areas0.99✓ match
staff officesmission areas0.71✗ no match
BERTScore Precision= average of top similarity scores across all AI words≈ 0.90

ROUGE-2 gave this pair 0.25; BERTScore gave ~0.90. Same sentences, very different numbers — they measure different things.
Why BERTScore matters so much for this project

Our D&C pipeline has a merge step where the model rewrites multiple sub-summaries into a unified text. That inevitably swaps words and phrasings.

Rewriting makes ROUGE-2 drop — but rewriting ≠ quality degradation. BERTScore recognizes synonymous rewrites, so it reflects the actual semantic quality of merged summaries more honestly.

ROUGE and BERTScore have to be read together: one measures literal match, the other measures semantic match. Drop either one and you miss half the picture.

Part 8

Faithfulness: how do we check whether the AI made things up?

🙋‍♀️
The earlier metrics all compare against the reference summary. How is Faithfulness computed? Is it just Y or N?
👨‍💻

Yes, at the core it's Y or N — but there's a whole protocol behind it. Let me walk through the full flow.

First, why we need it: ROUGE and BERTScore both compare against the reference summary, but the reference is human-written and isn't the same as the source document. An AI could produce a high-ROUGE summary that still fabricates content. So we need a metric that compares directly against the original document.

Step 1: build a prompt for the judge model

Instead of manually checking each claim, we write a prompt asking another LLM to judge. Roughly:

You are a fact-checking assistant.
Here is the original document:
[full source document, thousands of words]

Here is a summary generated by an AI:
[AI-generated summary]

Does the summary make any claims NOT supported by the document?
Answer "Faithful" or "Unfaithful", then explain why.

The judge model reads both inputs and returns its verdict.

Step 2: what is the judge actually checking for?

The judge walks the summary claim by claim and looks for three kinds of problems:

① Hallucination
Summary claims something not in the source. E.g., source doesn't say "reform was completed," but summary claims it was.
② Factual error
Numbers or descriptions contradict the source. Source says 13 offices, summary says 15.
③ Critical omission
A core conclusion in the source is missing, leaving the reader with a misleading impression.
Step 3: output Y/N, aggregate into a pass rate

Each report yields one Faithful/Unfaithful verdict plus a short explanation. For example:

✓ Faithful case
"Faithful. The summary accurately reflects the document's key findings about USDA's business center initiative."
✗ Unfaithful case (Sample #3)
"Unfaithful. The summary omits several key details about the investigation methods and specific abuse statistics cited in the report."

Run 100 reports, get 73 Faithful + 27 Unfaithful → Faithfulness pass rate = 73%. That's the final number reported.

🙋‍♀️
But that Y/N comes from another model judging itself. Is that reliable? Can it get things wrong?
👨‍💻

Sharp question — it's one of the real limitations. The judge is also an LLM; it can misjudge, and complex reasoning can throw it off.

This approach is called LLM-as-Judge, and it's the mainstream way of evaluating faithfulness in NLP right now — human fact-checking every summary is too expensive. A stronger setup would use a more powerful model (like GPT-4) as the judge; our project uses the same model as both generator and checker, which has known limitations. We call this out in the limitations section.

Even so, the Unfaithful cases it flags are meaningful — at minimum, they show the summary doesn't hold up within the model's own reasoning framework, which is a useful signal.

Recap: Faithfulness and ROUGE answer different questions
ROUGE / BERTScore
Compare against the reference summary — is the summary shaped like the human version?
Question: is it high-quality?
Faithfulness
Compare against the original document — is the content actually in the source?
Question: is it trustworthy?

A high-ROUGE summary can totally fail Faithfulness, and vice versa. Two independent dimensions — drop either one and you have a blind spot.

Part 9

Coverage: why do we need this on top of ROUGE?

🙋‍♀️
Those metrics already feel like enough — why another one?
👨‍💻

ROUGE measures quality — whether what you said is correct — but not quantity — whether you said enough.

Picture this: reference has 10 sentences, AI writes only 2, but those 2 are the most important ones and every word is in the reference. ROUGE-1 scores high. Would you call that a good summary? It skipped 80%.

Coverage = generated length ÷ reference length

Baseline: 374 words on average, reference averages 587 → coverage = 374/587 = 63.7%

Map-Refine: 495 words on average, reference averages 574 → coverage = 495/574 = 86.2%

That +22% jump directly shows D&C solved the truncation problem — from ~6/10 info pieces caught to ~8.6/10. Arguably the single most important number in the project.

👨‍💻

All five put together gives you the full evaluation framework:

ROUGE-1: did the words show up in the reference? (word-level quality)

ROUGE-2: did the word pairs show up? (phrase-level quality, stricter)

ROUGE-L: did the words appear in the right order? (structure)

BERTScore: how close is the meaning? (semantic, catches paraphrases)

Faithfulness: did the AI make anything up? (vs. source)

Coverage: did it say enough? (quantity)

Miss any one and you only see part of the story.

Part 10

Experimental results — the numbers and what they mean

👨‍💻

Numbers first, then we unpack them row by row.

Method ROUGE-1 ↑ ROUGE-2 ↑ BERTScore ↑ Coverage ↑ Faithfulness ↑
Baseline
Truncation only, no D&C
0.4945 0.1818 0.0774 63.7% 98.0%
MapReduce 0.505 ↑ 0.154 ↓ 0.089 ↑ 82.9% ↑ N/A
Map-Refine ★
Recommended
0.5129 ↑ 0.1696 ↓ 0.0918 ↑ 86.2% ↑ 83.5% ↓
Map-Cluster-Reduce 0.502 ↑ 0.160 ↓ 0.088 ↑ 83.4% ↑ N/A
Wait — Faithfulness went down? What's going on?

Baseline scores 98%, Map-Refine only 83.5% — a 14.5-point drop. Looks like regression, but there's a clean explanation:

Baseline only processed the first half of the document due to truncation. That first half is relatively simple, so the AI doesn't fabricate much — Faithfulness stays high. But this is "survivorship bias from truncation" — the complex back half was never processed, so it couldn't go wrong.

Map-Refine reads the entire document, including the denser back half. More content → more opportunity for minor omissions or drift → Faithfulness drops. That's the cost of covering more.

Baseline's 98% means "you can't fail what you don't attempt"; Map-Refine's 83.5% means "you made mistakes because you actually tried." The latter covers 86.2% of content vs. the former's 63.7%. For government reports, coverage matters more.

👨‍💻

Now row by row — can't just stare at numbers, have to read what they mean.

Baseline: what does 63.7% coverage really mean?

Baseline doesn't use D&C — truncates the document to 16,000 tokens and hands it to the model. Government reports are often tens of thousands of words; the back half simply never gets read.

63.7% coverage means: AI's summary averages 374 words while the reference averages 587. 36% of the content is lost to truncation before anything else happens. That's the core problem this project exists to solve.

ROUGE-2 = 0.1818 is the highest of any method — not because Baseline is best, but because it processes only the front half, does less rewriting, and gets higher literal word-pair overlap. That number is a trap: glance at it and you'd think Baseline wins.

MapReduce: coverage jumps, but one-shot merge compresses too aggressively

Coverage leaps from 63.7% to 82.9% — D&C fixes the truncation problem, the back half is now being read. ROUGE-1 and BERTScore also rise.

But ROUGE-2 drops from 0.1818 to 0.154 — the biggest drop. Reason: MapReduce's final step fuses all sub-summaries in one pass, forcing heavy rewriting for fluency. More rewriting → lower phrase-level match.

Map-Refine ★: best overall — every "positive" metric is highest

Highest ROUGE-1 (0.5129), highest BERTScore (0.0918), highest Coverage (86.2%). Best across the board among the three D&C methods.

Why? Each merge only absorbs one new chunk — minimum merging pressure, minimum rewriting, minimum error accumulation.

ROUGE-2 also drops (0.1696). That's a systemic issue across all D&C methods, not a Map-Refine flaw. We unpack that next.

Map-Cluster-Reduce: clustering added, but didn't pay off

Theory said clustering would help the model organize by topic. Data shows Map-Cluster-Reduce lands between MapReduce and Map-Refine on every metric — never beating Map-Refine.

Likely reason: clustering added an extra merge layer (sub-summary → group summary → final summary) — two rounds of rewriting, more content drift. An extra step isn't automatically an improvement — sometimes complexity costs more than it helps.

Part 11

Final conclusions — what we found, what's next

👨‍💻

Numbers done. Now the conclusions. Three layers: what we solved, what we didn't, what comes next.

Conclusion 1: D&C does solve the truncation problem
Coverage rose 63.7% → 86.2% (+22.5 points). The project's most important contribution. Previously AI could only read the front half; now nearly the entire document is processed and the summary carries substantially more information. For long documents like government reports, this is the critical improvement.
Conclusion 2: semantic quality improved — BERTScore up 18.6%
BERTScore 0.0774 → 0.0918 means the AI's content is semantically closer to the ground truth. D&C isn't just "writing more" — it's "writing more accurately" because more content is read and understood.
⚠️
Conclusion 3: ROUGE-2 drops — phrase-level precision has a cost
All three D&C methods score lower than Baseline on ROUGE-2. Not a method-specific failure; it's a systemic cost of D&C itself — as long as there's a merge step, there's rewriting, and rewriting drags ROUGE-2 down. Real tradeoff, can't be hand-waved away.
🏆
Conclusion 4: Map-Refine is the best of the three
On every "higher is better" metric (ROUGE-1, BERTScore, Coverage), Map-Refine is top. Its rolling-update mechanism only handles one new chunk at a time — lowest merge pressure, slowest error accumulation. Conclusion backed by data, not speculation.
👨‍💻

Conclusions in. What would the best next steps be? Three directions:

Direction ① Constrain the merge step's rewriting

The merge prompt currently doesn't ask the model to preserve source wording. Fix: add one line — "When merging, preserve the original document's key terminology; do not substitute alternative phrasings."

Expected effect: ROUGE-2 recovers, Coverage and BERTScore hold steady. Lowest-cost fix — change one line of prompt and run it.

Direction ② Introduce extractive summarization to reduce rewriting

Currently everything is generative — the AI writes the words itself. Alternative: first extract key sentences directly from the source (unchanged), then use the model only to organize the output.

This "extractive + generative" hybrid gets very high ROUGE-2 on extracted sentences (verbatim). Tradeoff: slightly less fluent prose, but stronger factual accuracy and better Faithfulness.

Direction ③ Run a fair head-to-head on a larger dataset

Baseline ran on 100 reports, Map-Refine on 400, and the two runs used slightly different reports (reference lengths: 587 vs 574 words). Not perfectly controlled.

Ideal setup: all four methods on the exact same 400 reports, head-to-head. Didn't do it due to compute/time limits, but it's the next experiment we'd add.

👨‍💻

One-sentence wrap-up: D&C successfully solved the truncation problem, Map-Refine is the current best option, and the next priority is suppressing merge-step rewriting so ROUGE-2 and BERTScore can improve together instead of trading off.

Part 12 (important reflection)

How should we interpret the ROUGE-2 drop?

🙋‍♀️
Wait — if you already knew D&C has a merge step, merge rewrites phrasing, and ROUGE-2 drops from rewriting, why use ROUGE-2 at all? Isn't that setting yourself up to fail? Doesn't that mean the experiment failed?
👨‍💻

Sharp question — shows you actually got it. Four layers of response:

① Not reporting ROUGE isn't an option — it's the field's common currency

ROUGE has been the standard metric in text summarization for 20 years. Every paper reports it. If you don't, your results can't be compared to any prior work. Even knowing its limitations, you report it — and explain the limitations. That's academic rigor, not self-sabotage.

② A ROUGE-2 drop isn't failure — it's discovering a tradeoff

We didn't hide the drop. We displayed it in full and explained why it happens. That's honest reporting, not failure.

D&C trades some rewriting for +22.5 points of coverage and +18.6% BERTScore at a cost of −6.7% ROUGE-2. Whether that tradeoff is worth it depends on what you're using the tool for.

For government reports — missing 36% of the content is the real disaster; some word substitutions are a minor issue. So the tradeoff is clearly worth it here.

③ BERTScore exists precisely to address this ROUGE limitation

The field has known for years that ROUGE is unfair to paraphrasing — BERTScore was introduced in 2019 specifically to address this. We use both together because we know ROUGE has this blind spot.

If we only reported BERTScore without ROUGE, we'd get the opposite criticism: "why only one metric?" Reporting both and letting the reader see the full picture is the correct approach.

④ But the critique is valid — this is a real limitation worth naming

Someone who only looks at ROUGE-2 would conclude D&C made things worse. That's an incomplete read, but it will happen.

In the talk, we can't just throw up a table and walk off. We have to proactively explain: "ROUGE-2 drops because merge introduces paraphrasing, and ROUGE is sensitive to paraphrasing. BERTScore recognizes synonymous rewrites — its rise tells us semantic quality actually improved. Read both together for the full story."

Future direction: update the merge prompt to explicitly "preserve source terminology; do not substitute phrasing," or introduce extractive elements to reduce rewriting, so ROUGE-2 can rise alongside the others.

👨‍💻

One line: the ROUGE-2 drop isn't experimental failure — it's a known side effect of the merge step. We understand the cause, compensated with BERTScore, and pointed to a clear fix. That's good science.

Part 13 (Q&A prep)

Tough Questions — what actually gets challenged

👨‍💻

These are real questions that came up during presentations of this project. If you're presenting this work, you need solid answers to all of them. The core theme is one thing: chunking is lossy compression — how do you handle the information loss?

🙋‍♀️
What techniques did you use to chunk the documents? How big are the chunks?
👨‍💻

Our model has a context window of about 16,000 tokens, but government reports can be 30,000 to 50,000 tokens. So we split sequentially with a fixed token budget — about 8,000 tokens per chunk. A typical USDA report ends up as 2 to 4 chunks.

🙋‍♀️
Do the chunks overlap? How much? How did you decide that?
👨‍💻

Yes — each chunk shares about 200 tokens with the next one, a sliding window. That's roughly 5% of each chunk. The reason: if you hard-cut at exactly token 8,000, you might land in the middle of a sentence. The overlap ensures boundary context isn't lost — the end of chunk 1 and the beginning of chunk 2 share a small strip of text so the model has continuity.

Too little overlap risks splitting sentences. Too much overlap wastes the context window on repeated content. 200 tokens is the empirical sweet spot.

The professor's core challenge

"Summarization is essentially a lossy affair. You're going to let go of some information. Did you notice any reduction in quality because of that? What about cross-chunk context loss?"

👨‍💻

Yes — that's the core tradeoff of Divide-and-Conquer, and we take it seriously. Chunking is lossy: each chunk is summarized without seeing the other chunks, so cross-chunk context is lost. We address this in three ways:

First, overlap at chunk boundaries — 200 tokens shared between adjacent chunks, so sentences at the boundary aren't split.

Second, the choice of merge strategy matters enormously. MapReduce summarizes every chunk in isolation and merges once — maximum context loss. Map-Refine fixes this by carrying the draft forward: when it reads chunk 2, it already has the summary from chunk 1 as context. Cross-chunk information is preserved incrementally. This is why Map-Refine outperforms MapReduce on every positive metric.

Third, our metrics actually capture the effect. Coverage went from 63.7% to 86.2% — meaning D&C recovers most of the information that Baseline loses to truncation. BERTScore went up 18.6%, showing semantic quality improved, not degraded. If chunking were destroying too much context, these numbers would go down, not up.

Why this answer works

The professor isn't asking you to eliminate the tradeoff — that's impossible. They want to know: (1) are you aware of the tradeoff, (2) did you do anything to mitigate it, and (3) do you have evidence it worked? This answer hits all three.

🙋‍♀️
Are you penalizing bad summaries? What if one chunk's summary is terrible — does it propagate?
👨‍💻

In the current pipeline, no — we don't have a per-chunk quality gate. Every chunk summary goes directly into the merge step regardless of quality. That's a real limitation.

A clear next step would be to add a self-check at the chunk level too — not just at the final summary. If a chunk summary scores below a threshold on a quick faithfulness check, regenerate it before it enters the merge. We didn't implement this, but we know it's the right direction.

The right way to handle "we didn't do that"

Never just say "we didn't do that" and stop. Always follow up with: "but here's what we'd do if we did, and here's why we didn't this time." That shows you understand the limitation — you just ran out of time or compute, not understanding.

🙋‍♀️
Why not use a stronger model like GPT-4 or Claude as the judge? Wouldn't that give better Faithfulness scores?
👨‍💻

It would — but intentionally, we didn't. We used the same model family for both generation and judging. The reason: if we used GPT-4 as the judge, the Faithfulness score would look better, but we wouldn't know whether the improvement came from our pipeline design or just from having a stronger judge.

This is a proof-of-concept project. We want to isolate the effect of the D&C method itself. If the idea works even with a modest 7B model judging itself, it'll work even better with a stronger judge later. That's a much stronger claim than "it works, but only if you throw GPT-4 at it."

🙋‍♀️
Why not just use a model with a larger context window — like 128K tokens — and skip the chunking entirely?
👨‍💻

Larger context windows exist — GPT-4 has 128K tokens. But we're using Qwen 2.5-7B on an A100 with 4-bit quantization. At that model size, the effective context window is about 16K before quality degrades noticeably.

More importantly, research shows that even with larger windows, models tend to lose information in the middle of very long inputs — the "lost in the middle" problem. The model pays attention to the beginning and end but forgets things in the middle. Chunking with D&C actually forces the model to process every part of the document carefully, which is why our coverage goes up rather than relying on a long context window that might silently ignore the middle.

🙋‍♀️
Why did you choose ROUGE and not BLEU or METEOR?
👨‍💻

BLEU was designed for machine translation, not summarization. It focuses on n-gram precision and doesn't compute recall — but recall is critical for summaries because we need to know how much of the reference we covered. METEOR is closer but far less commonly reported in summarization papers, so our results wouldn't be comparable to prior work.

ROUGE has been the standard in summarization for 20 years. Every paper reports it. We chose it so our numbers are directly comparable — and we added BERTScore specifically because we knew ROUGE's paraphrasing blind spot would hurt us in a D&C pipeline.

🙋‍♀️
BERTScore is 0.07 to 0.09 — that seems almost zero. Is something wrong?
👨‍💻

No — the raw numbers look low because BERTScore applies baseline rescaling. Without rescaling, most scores cluster between 0.85 and 1.0, which makes everything look identical. The library subtracts a corpus-level baseline so differences become visible. So 0.09 means "0.09 standard deviations above the corpus mean," not "9% similar."

The important thing is the relative difference: Map-Refine is 18.6% higher than Baseline. That's a substantial and meaningful improvement.

🙋‍♀️
If you could add one more evaluation method, what would it be?
👨‍💻

Human evaluation. All our metrics are automated proxies. The gold standard would be having domain experts — people who actually read government reports — rate the summaries on a Likert scale for informativeness, fluency, and factual accuracy. We didn't do it because of time and cost, but it would be the strongest validation of whether D&C actually produces better summaries in practice, not just on automated benchmarks.

The pattern behind every good Q&A answer

Step 1: Acknowledge the challenge directly — don't dodge.

Step 2: Explain what you did and why — even if the answer is "we didn't, here's why."

Step 3: Point to evidence — metrics, data, or literature.

Step 4: Name the limitation and what you'd do next — shows maturity.

Professors don't expect perfection. They expect understanding.

Full story in one sweep

The problem: government reports are too long; AI can't read them in one pass, and truncated content disappears. AI also hallucinates, so summaries can contain fabricated claims.

Three D&C methods: MapReduce (merge all at once), Map-Refine ★ (update as you read), Map-Cluster-Reduce (cluster first, then merge). The difference is how they merge. Map-Refine only absorbs one chunk per merge — minimum pressure, best result.

ROUGE-1: count overlapping words; break into precision (how much of what you said is correct) and recall (how much of the reference you covered); F1 combines them.

ROUGE-2: count overlapping word-pairs; stricter; change phrasing and the score collapses — highly sensitive to paraphrasing.

ROUGE-L: overlapping words also need to keep their order; scramble the order and ROUGE-L drops.

BERTScore: uses semantic embeddings to recognize synonymous rewrites, covering ROUGE's literal-match blind spot.

Faithfulness: another LLM compares the summary against the source and returns Y/N. This is the project's signature contribution.

Coverage: generated length ÷ reference length. ROUGE measures quality; Coverage measures quantity. Both are needed.

Conclusions: D&C pushes coverage 63.7% → 86.2% (+22.5pp), BERTScore +18.6%. The minor ROUGE-2 drop is a systemic side effect of merge rewriting, not quality regression. Map-Refine wins overall.

Next steps: constrain rewriting in the merge prompt, introduce extractive summarization, run a fair head-to-head on the same dataset.