第二十四届全国计算语言学学术会议

The 24th Chinese National Conference on Computational Linguistics (CCL 2025)




Proceedings of the 24rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

pdf(全部) bib(全部)

难度可控的词义辨析选择题自动生成
pdf bib 刘廷超, 王雨, 荀恩东

针对汉语词义辨析选择题的自动生成任务,本文提出一种基于检索增强生成(RAG)技术的智能出题框架。该框架通过构建融合词汇等级、词频与句子长度的多维度难度评估模型,实现习题难度的个性化控制。研究通过整合语言要素知识库与BCC语料库,有效提升语境自然性与干扰项质量,并引入格式校验、逻辑验证与答案唯一性检测的多维校验机制,确保输出题目符合教学规范。实验结果显示,该方法在出题成功率、答案正确率与内容多样性等关键指标上显著优于传统微调模型,展现出良好的教学适配性与应用潜力,为汉语教学智能化提供新的技术路径。

基于多样性数据重组增强的藏汉神经机器翻译
pdf bib 薛嘉怡, 陈锦明, 陈波, 鲍薇, 赵小兵

高资源语言的神经机器翻译虽已取得显著进展,但低资源语言面临更严重的平行数据不足的问题。为此,提出一种面向藏汉神经机器翻译的多样性数据重组增强方法(DiRec)。该方法利用大语言模型的双向语言能力,对已有藏汉平行数据进行成分重组、句型重组和风格重组三种数据重组,经过两轮质量自动筛选后得到多样性增强数据。在藏汉机器翻译的实验中,相较于基线模型,基于DiRec的模型的泛化能力指标提升4.83个百分点,BLEU提高0.55,chrF++提高0.20。最后分析了不同数据重组方式对翻译模型性能的影响。

DeepSeek等大语言模型幽默生成能力及其特征的评测分析
pdf bib 蒋彦廷, 应以周

以人类的笑话文本为基础,比较评测了4个大语言模型生成幽默笑点句的能力。总的来看,目前DeepSeek-R1的中文幽默生成能力强于GPT-4o、Qwen2.5-7B和Qwen3模型 , 但 距 离 人 类 的 幽 默 能 力 还 有 明 显 的 差 距 。 各 模 型 基 于 固 定 表 达 生 成 笑 点 句时,或多或少存在“思维定势”问题。测查了人类与大语言模型幽默文本的9项语言特征。DeepSeek与人类的相似笑点最多,BLEU-4匹配度也最高。与人类相比,AI生成的笑点句更倾向于使用高频常见的词,未登录词、网络新词的比例更低,在长度上普遍更长。基于Sentence-BERT模型获取语义表示,大模型的笑点句在语义联想距离上普遍比人类的笑点句更短。强化谐音双关、语义双关等修辞手法的运用,是大模型提高幽默文本生成能力的重要途径。最后,我们讨论了本文评价方式的优劣,并展望了增强大模型幽默能力的3个策略:优化提示工程、构建幽默多模态大模型、在推理中增强幽默文本的可解释。

面向法律事件检测的大模型协同主动学习框架
pdf bib 崔婷婷, 昝红英, 籍欣萌, 宋金旺, 张坤丽, 贾玉祥

法律事件检测任务旨在识别并分类法律文本中的事件。然而,复杂的法律案件使得收集高质量标注数据面临巨大挑战。目前领域数据标注主要依赖人工,成本高昂且耗时。尽管传统的主动学习能够减少部分标注需求,但仍依赖于人工干预。大模型的发展为自动化数据标注带来了可能性,但如何确保标注的可靠性仍是亟待解决的问题。为此,本文提出了创新的协作训练范式,使用主动学习迭代选择训练数据,并利用大模型生成高质量标注,使用评估筛选机制保留高质量标注,大幅减少了人工标注的工作量。在两个事件检测基准数据集上的实验表明,该方法在低资源场景下显著降低了人工标注需求,在部分情况下可以接近监督学习的性能。

基于多维度答案筛选的低资源语言开放域问答方法
pdf bib 王新阳, 关昕, 张利飞, 余正涛, 黄于欣

开放域问答通常是从大规模数据中检索多个相关文档,并利用大语言模型对文档内容进行理解生成答案。然而,面向缅甸语、老挝语等低资源语言,检索到的数据可能存在问题无关的噪声文档,且大语言模型对低资源语言理解能力弱,生成答案错误率高。对此,提出一种基于多维度答案筛选的低资源语言开放域问答方法,将现有基于大模型直接理解文档生成答案的过程,转换成多个候选答案生成并筛选的多阶段过程。在答案生成阶段,从文档中抽取多样化的候选答案,在筛选阶段,设计多维度答案筛选策略,通过全局篇章答案验证、局部证据答案验证以及不同答案相关性排序,筛选出最优答案。在四种东南亚低资源语言开放域问答数据集上的实验结果表明,基于GPT-4o-mini、DeepSeek-V3等大语言模型底座,提出方法相比思维链、摘要验证等最优方法都取得了更好的性能,验证了多阶段答案生成筛选过程在低资源开放域问答任务中有效性。

基于区域顶点标注的司法文本实体关系联合抽取
pdf bib 乐滢滢, 孙媛媛, 林鸿飞

司法领域中的实体关系联合抽取在许多下游任务中(如量刑预测、知识库构建等)具有重要意义。然而,由于垂直领域中的数据资源稀缺,而且司法文本中存在复杂的长句以及关系重叠现象,这使得信息抽取工作颇具挑战性。为应对这一挑战,我们首先标注了一个包含多个罪名的司法领域的专有数据集,然后提出了一种基于三元组区域顶点的联合抽取填表法。我们采用多标签分类对三元组的边界进行标注,以此提取三元组,从而充分利用实体的边界信息。此外,为融入实体对之间的距离信息,我们引入了距离嵌入,并采用扩张卷积来捕捉多尺度上下文信息。我们在司法数据集上对模型进行了评估。实验结果表明,我们的模型在这个数据集上均取得了最先进的性能。

基于大模型增强的两阶段高效事件共指消解方法
pdf bib 吴耀宗, 齐帅, 王方圆, 鲍琛龙, 唐晋韬

本文针对两阶段事件共指消解方法存在的触发词词目启发机制缺乏同义词聚类能力和小模型理解触发词指代事件能力有限等问题,提出了一种基于大模型增强的两阶段高效的事件共指消解方法,一阶段引入大模型进行同义词聚类,二阶段大模型提供触发词解释文本增强小模型。此外,设计了引导小模型侧重触发词特征向量的损失函数。本文方法在保持近似线性时间复杂度的同时,在ECB+和GVC数据集上的CoNLLF1得分分别提升了2.9和8.0。

MIDF: 基于模态交互和关系引导决策融合的多模态知识图谱补全
pdf bib 曾啸尘, 赵晖, 英迪

多模态知识图补全(MMKGC)通过融合实体间的结构化语义信息与多模态特征,从给定的多模态知识图谱(MMKG)中发现未观察到的潜在事实。然而,现有方法普遍忽略了实体表示过程中不同模态的交互,同时缺乏对补全过程中模态之间互补性的关注。为了解决这些不足,我们提出了一种新的模型MIDF(模态交互和决策融合)来处理多模态的交互和互补。该模型首先设计了一个实体多模态交互融合模块,将实体的图像和文本特征提前交互后,再与结构特征进行融合,充分学习实体的嵌入。为了在补全过程中进一步利用不同模态之间的互补性,我们设计了关系引导的决策融合模块。通过使用不同模态的预测结果以及关系引导的权重,进一步利用模态的互补性,融合预测结果。在DB15K和MKG-W上的广泛实验证明,我们的MIDF优于现有的最先进的模型,证明了我们方法的有效性。

基于大语言模型的中文医学命名实体识别
pdf bib 吕腾啸, 罗凌, 吕慧怡, 孙媛媛, 王健, 林鸿飞

从中文文本中准确识别医学命名实体是实现中文医疗信息结构化的关键。传统机器学习方法在面对中文医学实体边界模糊和嵌套结构复杂等问题时效果有限。本文提出一种基于大语言模型的中文医学命名实体识别方法,首先通过任务重构将识别过程转化为文本生成任务,设计了适配的标注策略以统一处理平面与嵌套实体,然后引入实体筛选器过滤错误候选实体,最后通过大语言模型决策进行冲突消解与多模型集成提升系统整体鲁棒性。在CMeEE-V2与CCKS2019两个数据集上实验结果显示,所提方法在识别准确性与鲁棒性方面均达到当前先进水平,F1值分别为0.7785和0.8821。

例句质量评估体系构建及大语言模型例句生成能力评估
pdf bib 方明炜, 朱君辉, 鲁鹿鸣, 杨尔弘, 杨麟儿

本研究针对大语言模型(LLMs)生成例句的教学适用性问题,基于二语习得认知理论构建了多维例句质量评估体系,涵盖规范性、语境独立性、典型度、词汇适切性及句法复杂度五大核心维度。通过采集汉语词典与教材的优质例句作为基准语料,结合特征工程构建了机器学习模型(准确率为98.6%),验证了评估框架的有效性。在此基础上,本研究利用该评估框架对LLMs生成例句与传统人工编纂词典中的例句进行了系统对比分析。研究结果表明:LLMs在语法典型度、词汇难度、汉字笔画数方面展现出与传统词典例句相当的质量水平,而在语境独立性、语义典型度、词汇常用度方面仍存在一定不足。进一步研究发现,不同提示策略影响例句生成质量,其中融合语言特征约束型提示策略优化效果最佳。本研究首次实现LLMs生成例句教育适应性的量化评估,为智能语言教辅系统开发提供了兼具理论指导意义与实践应用价值的评估范式。

大语言模型汉字富语义能力评测
pdf bib 余艺喆, 董明, 何婷婷

中文相较于以英文为代表的表音文字具有富语义的特点,单个汉字蕴含了读音、字形结构、偏旁部首等丰富的语义特征,在构建自然语言处理相关应用时具有独特的价值,可以视作额外的特征,提升在特定任务的表现。近年来,大语言模型飞速发展,展现出海量的知识储备和强大的推理能力,其中,大模型对汉字富语义特征的掌握可以视作大模型中文能力的基础。然而,目前对于大模型汉字富语义能力评测研究较少,针对性地评测大模型在汉字富语义方面的能力边界,有助于了解大模型中英文能力差异性、并推测大模型在字形、字音相关下游任务上的表现。因此,本研究从汉字的结构、偏旁、读音、笔画、多音字和部件六个维度,对大语言模型进行了全面评测,旨在深入探究其对汉字基本富语义特征的掌握程度。本研究以GB2312 标准字符集和现代汉语词典为依据,围绕汉字的结构、偏旁、读音、笔画、多音字和部件六个维度,构建了一系列“问题-答案”对,并制定了科学合理的评分标准。在此基础上,对十余种主流的大语言模型进行了深入评测。同时,为探究模型在中英文能力上的差异,将上述中文评测任务翻译为英文,并选取了三个代表性模型进行对比评测。此外,本研究进一步从汉字结构推理、偏旁推理、读音推理三个关键角度出发,设计了一系列推理评测任务,旨在深入评估大语言模型对汉字富语义特征的推理能力。本研究的评测结果具有重要的参考价值,可为大语言模型相关领域的研究人员在中文下游任务优化、基础模型选择等关键环节提供参考和启发。

面向工艺规范的树结构检索增强生成方法研究
pdf bib 姜禹辰, 王裴岩, 冯煜博, 余卓, 纪贵阳

检 索 增 强 生 成 (Retrieval-Augmented Generation,RAG) 是 一 种 有 效 优 化 大 语 言模 型 在 工 艺 规 范 问 答 任 务 中 性 能 的 方 法 。 然 而 , 基 于 固 定 文 本 长 度 分 块 的 朴素RAG(Naive RAG)在构建工艺规范问答任务时表现不佳。主要原因在于工艺规范是一类复杂的技术文档,采用固定文本长度分块会丢失工艺规范段落层级之间的结构关系以及隐含的知识关联关系,导致输出结果质量下降。因此,本文提出了一种利用工艺规范篇章段落间隐含的树结构关系来构建RAG的方法,该方法有效解决了固定文本长度分块导致的段落之间的知识关联丢失问题。实验结果表明,树结构RAG在评价指标上优于朴素RAG,其中ACC平均提升3.81%,ROUGE-L提升3.28%,BLEU-4提升2.97%,验证了树结构RAG的有效性。

基于检索增强生成的两阶段常识推理方法
pdf bib 李东洋, 袁志勇, 车超

常识推理任务是指模型利用日常经验知识对隐含信息进行推断,从而理解和预测现实世界中的合理情境。当前研究趋势之一是通过引入外部知识库来获得额外的背景知识。然而现有的常识推理模型存在引入的外部信息不够精准和融合不充分的问题,致使其在实际应用中的表现不佳。针对上述问题,本文提出了一种基于检索增强生成的两阶段常识推理方法。该方法基于维基百科构建了包含6.28M篇文章的知识库,使用检索增强生成方法,赋予模型语义相关的上下文作为补充信息,辅助模型推理。同时,为了节省时间和资源,本文提出了一种两阶段推理策略,将简单问题交由小模型处理,将复杂问题交由大模型完成。在OpenBookQA等多个数据集上的实验结果证明,本文方法展现出优越的性能,而且适配不同的骨干网络和大模型,可做到即插即用。

基于思维链和知识迁移的多语言问答推理研究
pdf bib 罗健, 孙媛

近年来,大型语言模型如ChatGPT显著提高了机器对自然语言的理解能力,其中,问答推理任务在推动语言理解能力和人机交互智能化方面具有重要意义,但目前仍面临诸多挑战。本文针对现有大模型资源消耗大、小模型推理能力弱,低资源语言推理能力受限等问题,提出了融合思维链和微调技术的方法,通过Human-Thinking提示策略优化大模型推理能力,并借助大模型指令微调提升小模型推理性能,引入多角色协作机制进一步优化推理步骤质量。通过探索跨语言思维链提示方法,利用高资源语言知识弥补低资源语言不足,采用双通道机制和投票打分机制整合不同语言推理知识,提升模型在低资源语言的推理表现。实验结果表明,本文方法能有效提升小型模型在多语言问答推理的能力,具有一定的研究价值。

基于检索增强思维提示的汉语框架语义解析方法
pdf bib 李迎旭, 陈涛, 黎议泽, 李斌阳

汉语框架语义解析基于框架语义学理论,旨在通过识别句子中词语所激活的语义框架, 分析句子中各个成分的语义角色, 从而揭示语言背后的深层语义结构,进一步更好地抽取事件关系和语境信息。 大语言模型出现后,其强大的通用文本理解与生成能力被广泛应用于各种自然语言处理任务中。 然而,当前大语言模型在汉语框架语义解析任务中存在推理路径简单、 准确率过低的不足,尤其在思维链的逻辑连贯性和检索增强生成的深度应用上存在欠缺。 为此,本文提出了一种面向汉语框架语义解析的思维提示方法。 该方法结合检索增强生成(RAG)与链式思维(CoT)技术,引导大语言模型完成汉语框架语义解析任务。我们在CFN2.1数据集上的实验结果表明,与最好方法相比,该方法的框架识别准确率提升13.52%,论元识别F1提升2.24%,角色识别F1提升5.09%。

基于数据合成的多模态讽刺隐喻理解大模型的构建
pdf bib 戴凌睿, 李浩, 吴云芳

讽刺和隐喻是文学与语言表达中常见的修辞手法,以往相关研究多聚焦于分类任务上,且更多的基于英文数据进行探索。随着大模型与多模态大模型的不断涌现,模型对各种自然语言处理任务与多模态任务的处理能力得到了显著的提高。本文利用GPT-4o进行自动数据合成,来训练多模态大模型,实现了图文多模态讽刺隐喻综合理解任务。本文训练出能理解图片或图文讽刺隐喻内容,并进行详细解释或配文的参数量较小的多模态大模型,并且保证了模型具备良好的鲁棒性和通用性能。本文精心设计了数据构造方法,包括数据源的选择,指令数据的合成,回复数据的合成,来获得了一批高质量的多模态讽刺隐喻指令微调数据。我们选用了当前表现较好的多模态大模型作为骨干模型,使用合成数据并结合公开多模态图文数据集进行训练。在模型评测方面,本文分别从讽刺隐喻理解能力和通用能力进行评测,验证了模型的可用性。本文的数据以及模型权重将在后续放置在https://github.com/652897698/Multimodal-LLMs-for-Sarcasm-and-Metaphor-Undrerstanding

基于构式知识本体抽取评价名词的评价对象
pdf bib 周红照

针对中文评价对象抽取缺少评价名词的专门研究和深层情感知识本体的问题,提出构式知识本体驱动的评价名词—评价对象抽取方法。根据评价对象在情感构式中充当何种句法成分,归纳概括出主语型、定语型、同位语型等九种情感构式;提炼九种情感构式的意义模式,精准指出机器自动识别每一意义模式中的评价对象所依据的形式特征;定义形式符号与逻辑运算规则,把九种模式及其形式特征转化为机器可读的形式语言;创建语义词典与情感构式规则库,编程实现为评价名词—评价对象智能抽取系统CUCNsas。实验结果表明:CUCNsas在1万条《人民日报》和《新闻联播》测试语料上的准确率为88.3%、召回率为82.1%、F1值为85.1%。在中文评价名词—评价对象抽取任务上,着眼于句子整体形-义配对关系的构式语法,相较于语义特征法、短语结构文法和依存语法更具优势。

引入反思机制的机器译文质量估计方法
pdf bib 万洁, 李茂西

在缺乏人工参考译文对照的情况下,如何自动地评估机器译文的质量?现有一种机器译文质量估计方法利用异构翻译系统对源语言句子进行直接翻译,把生成的译文作为伪参考译文,将机器译文和伪参考译文进行对比来评估机器译文的质量。为了使生成的伪参考译文能够帮助机器译文质量估计方法准确地识别当前机器译文中存在的错误,本文提出引入反思机制的伪参考译文生成方法,并将其应用在机器译文质量估计任务中。生成伪参考译文的异构翻译系统是一个反思智能体,该反思智能体将待评估机器译文作为生成伪参考译文过程中的关键元素,它的推理步骤包括对机器译文进行回译、对源语言句子和回译进行智能反思、基于反思结果生成对机器译文的修正意见以及生成候选伪参考译文。在WMT'23句子级别机器译文质量估计任务基准数据集上的实验结果表明,所提方法显著提高了机器译文质量估计的效果。

现代汉语同音字家族系统的效率与易学性:来自计算与模拟的证据
pdf bib 肖哲

信息论视角的语言研究揭示了语言系统中普遍存在的效率与易学性的认知约束。本研究探讨了现代汉语中同音字家族系统的认知约束,发现(1)在系统内部,家族效率与易学性正相关;(2)相比计算模拟系统和拼音化系统,同音字家族系统易学性虽较低,但效率更高;(3)无论是否考虑声调、声符和生僻字,家族系统均表现出上述特点。结果表明,汉语同音字家族系统对效率与易学性进行了权衡,揭示了其庞大规模形成背后的认知机制。

基于古汉语大语言模型的多任务学习探究
pdf bib 姚欣宇, 王梦笛, 高原, 高歌, 陈波, 赵小兵

随着大语言模型在多任务学习领域展现强大泛化能力,其在低资源古汉语场景的应用价值亟待探索。本文基于LLaMA3-Chinese-8B利用21GB高质量古汉语语料进行增量预训练,接着进行十项任务微调(包括句读、词性标注、命名实体识别(NER)、事件识别、翻译、词语解释、反向词典、历史人物知识、诗歌赏析、诗歌生成),设计了单任务微调和双任务组合微调两种策略,通过55组实验量化了任务之间的正增益与负增益,首次系统揭示了古汉语多任务学习中的增益关系。实验结果表明,不同任务之间存在协同效应与任务干扰效应,并且具有不对称性。基础类古汉语任务之间表现出更强的协同效应,相比之下,翻译类和生成类任务之间协同效应表现较弱。同时,受双任务设定的影响,不同古汉语任务的稳定性存在明显差异。

TibLex:一种基于拉丁编码的藏文词表优化策略
pdf bib 更尕多杰, 孙媛

预训练语言模型通过大规模无监督学习在多任务场景展现卓越性能,但其研究多集中于中英文等高资源语言。藏语等低资源语言因数据稀缺及形态复杂(黏着语特性、音节结构多样),导致主流子词分词方法存在语义割裂与形态失配问题,制约模型训练效率与表征质量。为此,本文提出基于拉丁化编码的藏文扩展分词策略TibLex(Tibetan Latinization-based Extended Tokenizer)该方法通过将输入文本进行编码转写,将每个藏文音节根据其字形或发音转换为一个短序列,然后基于编码文本使用子词分词构建词汇表。实验表明,TibLex相较主流分词器具有双重优势:(1)通过拉丁化降维处理,使词表不规则组合减少15%,输入序列长度平均缩短36.10%,显著提升计算效率。(2)音译分词器可将同音异形字编码为相同音译序列并输出一致的分词结果,从而实现对同音错别字的鲁棒性处理。与此同时,基于TibLex训练的预训练模型在下游任务中保持竞争力,验证了该方法在低资源语言场景的有效性。本工作为解决形态复杂语言的分词瓶颈提供了新范式,其编码框架可扩展至蒙古文、梵文等文字系统,为跨语言NLP研究提供技术支撑。

多模态嵌入的全局对齐增强下的基于强化学习的扩散模型
pdf bib 尤昊辰, 刘宝静

扩散模型作为新一代生成模型,在文本引导图像生成任务中展现出卓越性能。然而,现有预训练扩散模型的训练目标通常无法直接对齐用户偏好或下游任务需求,导致其生成结果难以兼顾图文语义一致性与主观美学质量。为此,近年来研究者提出将强化学习引入扩散微调过程,使模型在奖励信号引导下优化生成策略,代表性方法如策略优化扩散模型与去噪扩散策略优化已取得显著成果。然而,此类方法所依赖的奖励函数多为黑盒式打分器,难以捕捉生成图像与输入文本之间的结构性语义关系,缺乏对模态间对齐结构的显式建模。为解决上述问题,本文提出一种融合强化学习与结构对齐正则的文本引导扩散模型微调方法GARD(Geometry-Aligned Reinforced Diffusion)。该方法在强化学习微调框架下,引入一种基于嵌入空间几何结构的对齐正则项,即通过计算图像与文本嵌入向量构成的平行多面体体积,衡量其语义对齐程度,并与奖励信号与散度正则共同构成统一优化目标,从而在提升生成质量的同时增强多模态语义一致性。实验结果表明,GARD 在多个公开数据集上相较于现有方法在语义一致性、审美得分与训练稳定性等方面均实现显著提升,验证了本文方法在多模态结构对齐建模与强化学习微调融合方面的有效性与通用性。

目标自适应的可解释立场检测:新任务及大模型实验
pdf bib 蓝伊, 王子豪, 陈波, 赵小兵

传统立场检测通常假设目标已知,且仅输出立场类别(支持,反对,中立),难以应对目标不确定、立场判断需要有具体依据的情形。为此,本文提出目标自适应的可解释立场检测新任务,定义模型的输出为目标、观点和立场标签。具体地,构建了首个中文高质量立场检测数据集,并设计多维评估标准;评估了多种大语言模型的基线性能。实验发现:DeepSeek-V3在目标识别与立场分类表现最优,GPT-4o在观点生成上领先;大语言模型在目标明确时具备较强目标自适应能力,但处理存在反讽现象的输入时性能下降。数据集和实验结果公布于https://github.com/Cassieyy1102/TAISD。

基于提示探针的大模型知识掌握能力评测
pdf bib 王淳昱, 陈波, 徐洋, 赵小兵

大语言模型在知识密集型任务中的表现高度依赖其内化知识的覆盖面和掌握程度。然而,当前缺乏系统化、细粒度的评测方法以刻画模型对不同类别知识的掌握能力。为此,本文提出一种基于提示探针的方法,系统评估大语言模型在常识性知识、事实性知识和专业领域知识方面的掌握情况。首先构建了一个高质量的知识探针评测数据集KPE-Pro(Knowledge Probing & Evaluation for Proficiency)。然后设计提示模板对多个主流大语言模型进行系统评测。评测结果表明,大语言模型在常识性知识方面表现较好,ERNIE X1模型取得整体最好成绩;在事实性知识上,大语言模型的表现较弱,轻量模型的知识掌握能力明显不足。评测数据公开于:https://github.com/cyuu313/KPE-Pro。

基于关系结构感知增强的知识图谱规则挖掘方法
pdf bib 徐会亲, 黄琪, 章程, 刘祥棋, 罗文兵, 王明文

知识图谱推理(KGR)旨在通过对知识图谱中蕴含的逻辑规则进行挖掘和应用,进而推断和发现新事实。该任务广泛应用于智能问答、语义搜索和推荐系统等领域。近年来,由于基于嵌入的知识图谱推理算法缺乏可解释性,一些研究者开始研究基于规则的知识图谱推理方法。然而,现有基于规则的推理方法在理解关系语义时难以处理关系之间的隐式关联信息且容易陷入局部最优解。为此,本文提出了一种基于关系结构感知增强的规则挖掘模型ReSA。该方法通过构建关系图,显式地建模关系之间的层次结构,提高规则挖掘的效率。同时,ReSA还通过全局规则融合模块和相对关系编码器,结合全局语义建模和局部结构建模,增强模型对规则体整体逻辑的感知能力。实验表明,ReSA模型在WN18RR等数据集上取得了显著的性能提升,MRR指标相较于现有最新规则挖掘方法提升了4个百分点。

AntIF:大语言模型抗干扰能力评估
pdf bib 罗雅晶, 侯钰涛, 陈云, 陈冠华

本文提出了一种多智能体协同的干扰数据生成框架,旨在评测分析大语言模型在复杂干扰下的鲁棒性。该框架以数学领域为起点,逐步扩展至医学、法律、科学及通用场景,构建了涵盖拼写干扰、数字干扰、类型干扰与谣言干扰四类干扰的跨领域数据集AntIF,共计近5000条数据。在此基础上,本文对主流开源语言模型进行了系统的抗干扰能力评估,并结合不同的提示工程策略与模型微调方法,深入分析了AntIF 在提升模型鲁棒性方面的实际效果。

K-CoT:基于关键词思维链提示的中文排比句生成研究
pdf bib 钟茂生, 甘家其, 张鹤君, 谢林康, 李宏伟

本文针对中文排比句研究面临的高质量语料匮乏和细粒度标注缺失两大挑战,构建了一个包含主题、情感基调、排比标志词和关键词多维标注的中文排比句语料库。基于此,本文提出了一种基于关键词引导的思维链排比句生成框架K-CoT,通过模拟人类修辞创作的认知过程,将排比句生成分解为“主题解构-特征映射-关键词生成-句式合成”的渐进式推理流程。在ChatGLM和Llama等主流模型上的实验表明,本文提出的K-CoT在排比句生成任务上取得了显著的性能提升。本文为排比句研究提供了一个新颖的数据集,也为生成模型的修辞能力优化提供了可解释的技术路径,其分阶段推理机制对提升语言模型的语义可控性具有普适意义。

面向信息处理的中国手语音系学标注加工规范
pdf bib 赵源, 金澎, 敬思远, 姚登峰, 孟子文

随着对手语进行大规模数据化处理的需求日益增强,手语的音系学标注及规范化工作愈发迫切。然而,手语作为一种视觉-空间语言,不同于有声语言,其多信道(手形、位置、手掌朝向、运动方式以及面部表情、躯干动作等非手动特征)信息的复杂性与缺乏统一标注规范,一直制约着手语语料库构建与自动分析技术的发展。针对这一问题,本研究在手语音系学理论的指导下,提出了一套面向中国手语音系学标注加工的系统化规范。该规范由原则和细则两部分构成:原则部分明确标注对象的粒度、标注单位的界定与分层方式;细则部分则给出多信道特征的具体标注实例与操作指南。该规范的实施为中国手语多信道特征的系统标注提供基础支撑,将有助于推动手语识别、翻译、生成以及教学平台的深入发展,加速中国手语信息处理标准化与规范化的进程。

基于证据理论和局部语义区分的嵌套命名实体识别
pdf bib 徐波波, 叶娜, 蒋明翀

嵌套命名实体识别(NER)是自然语言处理中一个基本任务,其目的是通过计算机辅助技术识别并提取嵌套实体及其对应语义类型。目前嵌套命名实体识别的主流研究方法是基于跨度的方法,该方法将实体识别视为一个跨度分类任务,可以有效地处理嵌套实体。然而,基于跨度的嵌套命名实体识别方法无法准确区分相似实体之间的细微语义区别。并且通过枚举的方式会产生大量噪声跨度,影响模型性能。针对上述问题,本文提出一种方法,既能够量化模型预测的不确定性,通过不确定性辅助模型的推理,降低噪声跨度对模型性能的影响,还能通过局部语义区分模块区分出实体间的语义区别。具体来说,针对噪声跨度对模型性能产生影响的问题,本文设计了一种不确定度引导的KNN辅助决策机制,用于在不确定性较高时对预测结果进行校正。此外,针对嵌套命名实体识别模型对实体边界模糊与语义重叠问题的识别能力不足,利用局部语义区分模块,通过建模当前跨度与邻域跨度的表示差异,引导模型关注细粒度语义差异,从而提升嵌套实体的识别准确性。该方法在GENIA 英文数据集和自建中文嵌套数据集上分别取得了81.27%和82.26%的F1 值,对比基线模型分别提升了0.52%和1.48%的F1值,验证了它对嵌套命名实体识别任务的有效性。

基于多模型协同的儿童互联网新闻风险管理与价值观引导框架
pdf bib 梁宇蓝, 王悦, 于东, 刘鹏远, 康晨

随 着 互 联 网 在 儿 童 群 体 中 的 广 泛 普 及 , 新 闻 内 容 的”毒 性 遗 留”与 价 值 观 缺 失 已成 为 亟 待 解 决 的 安 全 挑 战 。 本 文 提 出 了 一 种 多 模 型 协 同 的 儿 童 新 闻 改 写 框 架(CRV-LLM),旨在从词汇、事件、标题和价值观四个维度,对原始新闻文本进行深度风险识别与精准改写。CRV-LLM集成了四个轻量化风险检测模型和R1-Distill-Qwen-32B改写模型,通过模型间的协同与反馈,能够在保证儿童可读性的前提下,有效剔除潜在有害信息并植入积极价值引导。实验结果表明,CRV-LLM框架在安全性、教育性等核心指标上优于主流模型,且推理效率提升62%,为儿童互联网内容安全管理提供了一种高效、可扩展的技术方案。

大语言模型和知识图谱协同的查询扩展方法
pdf bib 张旷, 涂新辉, 刘晗

查询扩展旨在通过丰富查询来提升检索效果。在大语言模型结合伪相关反馈的查询扩展方法中,伪相关文档中的噪声及不连贯信息严重影响了大语言模型的扩展质量。为此,本文提出一种大语言模型和知识图谱协同的查询扩展方法(LKQE)。LKQE 首先检索出相关文档并提取关键句,然后利用大语言模型从中抽取知识三元组,并补全实体关系构建出知识图谱,最终在知识图谱指导下生成高质量扩展文本。实验结果表明,与基线模型相比,LKQE 在 DL19 与 DL20 数据集上的表现具有显著优势。

基于自提示多模态大语言模型和语义感知离散扩散模型的图像描述生成算法
pdf bib 陈宇峰, 江爱文, 黄琪, 王明文

近年来,非自回归图像描述生成技术凭借其双向传播和并行词语生成的能力受到广泛关注。与此同时,基于离散扩散方法的研究也取得了显著进展。然而,在离散噪声添加与去噪过程中,现有方法仍面临图像文本关联性低、目标物体遗漏、描述准确性不足以及词语重复等关键问题。为应对这些挑战,我们提出一种基于语义感知的离散扩散模型。该模型通过可学习查询机制构建语义感知模块,以捕捉与图像物体级语义特征的潜在关联从而更好地生成图像描述。在此基础模型之上,我们进一步引入自提示优化框架,利用大语言模型生成与图像细节内容更相符的丰富描述。在COCO数据集上的综合实验表明,本方法在图像描述任务中取得一定的提升,其性能优于现有的相关方法。

融合MOE的多任务学习文档级企业新闻事件抽取
pdf bib 郑傲泽, 张坤丽, 王影, 袁颂瑞, 田怡豪, 昝红英

企业新闻事件抽取是支撑企业动态分析与产业决策的关键技术。企业新闻事件抽取具有文本篇幅较长,内容多元化的特点,面临多事件抽取和论元分散等核心挑战。大语言模型(Large Language Model,LLM)虽然具有强大的长距离依赖建模和语义关联能力,但通用大语言模型难以满足企业级应用对专业性与资源效率的需求。本文提出了融合MoE的多任务学习企业新闻事件抽取模型(MoE-Enhanced Multi-Task Learning for Corporate News Event Extraction,MoE-ML-CNEE)。通过构建统一微调数据集与多任务联合训练范式,将事件检测与论元抽取构建为结构化语言模板,增强模型全局建模能力。设计MoELoRA模块,利用动态路由机制实现多专家网络在低秩空间的知识共享与特征解耦,进一步提升模型事件抽取性能。实验表明,MoE-ML-CNEE模型在ChiFinAnn和DuEE-fin公共数据集和自建企业新闻数据集的事件检测、事件论元抽取结果均优于现有基线模型。

基于大语言模型多维度特征增强的医学命名实体识别方法
pdf bib 蒋明翀, 叶娜, 徐波波

医学命名实体识别在医疗信息提取和知识图谱构建中至关重要,但因医学领域的专业性和复杂性,面临数据稀缺、特征不显著及上下文利用不足的挑战。本文提出LLM-MedNER方法,充分利用大语言模型(LLM)的预训练知识,通过提示工程生成语义等价但表达多样的增强文本,并提取多维度特征,包括关键字集合、语义描述、词性信息及医学实体关联特征,从而显著提升模型的特征表达能力。方法采用双通道MacBERT-BiGRU编码模块并行学习原始文本特征与大语言模型增强特征,通过交叉注意力机制融合不同语义特征。随后,引入自适应多粒度扩张卷积层,通过不同膨胀率的一维卷积捕获多尺度的局部上下文信息,进一步丰富词表示。并在输出层引入Biaffine模块实现实体边界及类型的精准识别。对比实验表明,LLM-MedNER在多个医学命名实体识别数据集上的表现优于现有基线方法;消融实验进一步证实各模块的有效性。

传统价值观成语当代语境表现分析———基于BCC语料库的计量研究
pdf bib 孙浩, 刘洋洋, 杜惠东, 刘鹏远, 于东, 康晨

中华优秀传统文化是提升我国新时代文化软实力的重要源泉,将传统价值观和成语相结合,有助于继承和弘扬我们的优秀文明。本文提出了传统价值观成语当代语境表现的研究框架,基于BCC语料库对传统价值观成语语料数量分布和成语传统价值观偏好分布特征、在当代语境中的情感倾向及高频词分布特点、社会话题及道德特征进行计量研究,并提出了传统价值观成语的当代社会话题及道德适应性指数,以系统研究传统价值观成语的当代语境表现。本文为传统文化的当代计量研究提供了新的视角,也为数字人文领域的相关研究提供了参考依据,旨在增强中华优秀传统文化在当今新时代的影响力,为中华文明的传承与创新作出贡献。

基于LLM与跨语言嵌入的中亚低资源语言平行语料库构建方法
pdf bib 袁琦, 阿力木木拉提

在“一带一路”倡议持续推进的背景下,中国与中亚国家交流日益深化,对高质量的跨语言信息处理技术提出了迫切需求。然而,中文与中亚国家语言之间的平行语料库资源极度匮乏,且现有资源质量参差不齐,严重制约了机器翻译、跨语言信息检索、情感分析等下游任务的发展。针对中亚国家低资源语言,本文提出一种融合神经机器翻译(NMT)与跨语言语义匹配的平行语料构建框架。该方法通过定向爬取中亚国家官方渠道的单语新闻数据,利用DeepSeek模型的多语言翻译能力生成伪平行句对,再通过LaBSE 模型获取跨语言句子嵌入向量,基于余弦相似度动态阈值和边距实现噪声过滤。实验表明,该方法在BLEU分数指标上比较传统回译方法提升了0.65,最终构建包含8 万句对的多领域平行语料库,覆盖政治、经济、文化等核心领域,该语料库为提升中亚低资源语言的机器翻译、跨语言信息检索、文本分类等下游任务的生成质量奠定了坚实的基础。

基于双系统推理框架的法律判决研究
pdf bib 尹圣迪, 白泽文, 林鸿飞, 杨亮

法律判决预测是法律人工智能领域的一项重要任务。本文提出了一种基于外部知识的可解释性双系统推理框架,来解决现有方法在刑期预测任务中精度不高且可解释性不强的问题。该框架借鉴认知科学领域的双系统理论,利用大型语言模型的文本理解和生成能力,模拟人类法官处理案件时的决策过程,最终给出具有清晰推理路径的刑期预测结果。此外,通过构建一个高质量思考增强数据集和一个外部法条知识库,提升了模型的解释能力并且有效地抑制法条判断模型出现法条幻觉。实验结果表明,该框架显著提升了CAIL-small和CAIL-big数据集中刑期预测子任务上的精度和可解释性。

面向对话场景的构式数据集
pdf bib 薛旭晶, 李俊材, 苏雪峰, 杨沛渊, 柴清华, 李茹

大语言模型在多种自然语言处理任务中展现出强大的语义理解能力。现有研究通常基于各类语义解析数据集对大语言模型进行评估,然而,这些数据集难以覆盖对话语料中常见的口语化表达与特定结构表达语义的语言现象,无法有效评估大语言模型在对话场景中的细粒度语义理解能力。为此,本文面向对话语料构建了一个包含2146条语句、1748个构式的中文构式数据集,实现语义信息细粒度表达的同时有效覆盖了现有语义解析评估数据集的缺口。基于该数据集,本文选取了其中部分代表性构式,结合框架语义学理论,提出了构式识别与构式语义理解两项评测任务,以系统评估大语言模型在对话场景中识别构式与理解深层语义的能力。实验结果表明,当前大语言模型在构式识别方面仍存在明显不足;且在缺乏思维链推理的引导下,难以理解构式所承载的深层语义。

主题感知的多意图识别与槽位填充联合建模方法
pdf bib 罗晶, 王炜华, 曹越, 飞龙, 高光来

意图识别与槽位填充是口语理解中的两个子任务,联合建模这两项任务能够利用共享特征提升任务间的协同建模效果。然而,现有方法普遍缺乏对句子主题语义的显式建模,难以捕捉更充分的全局语义信息,尤其在多意图场景下系统建模性能下降严重。为缓解上述问题,本文提出了一种主题感知的意图识别与槽位填充联合建模方法,该方法构造了主题提取模块以学习句子主题分布表示,结合主题引导的意图和槽位表示增强网络插入主题信息,使得模型在识别句子意图和填充槽位过程中能够显式建模主题信息。实验结果表明,本文所提出方法在多意图公开数据集MixATIS和MixSNIPS上分别获得了50.9%和84.8%的整体准确率,相较多个基线模型取得了更优的性能表现。

中文阅读中的信息密度与认知资源动态分配研究
pdf bib 梁玉豪, 钟晓路, 杨泉

本研究通过分析北京句子语料库的眼动数据,运用混合效应模型和贝叶斯分析方法,系统考察了信息密度在汉语阅读过程中的表现及其与视觉复杂度因素的交互作用。研究结果表明,信息密度对注视时长具有显著正向预测作用,信息密度越高的词汇,受试者的注视时间越长,这与预测编码理论中“预测误差”增加导致加工负荷增加的假设一致;同时,信息密度在跳读行为分析中显示出显著负向预测作用,表明信息密度较高的词越不容易被跳读,支持了读者依据信息分布动态分配注意力的“调节假设”。研究还发现了汉语阅读的语言特异性表现:首先,词长效应在中文中呈现与拼音文字不同的模式,长词在中文中更易被跳读;其次,视觉复杂度与语言预测性之间存在非线性交互,支持了“语言特定性假设”。基于这些发现,本研究提出了中文阅读的“双通道加工模型”,即语言预测(信息密度)与视觉编码(笔画数、词长)共同调节认知资源的动态分配,这一理论框架不仅解释了中文阅读的特异性机制,也为跨语言认知加工研究提供了新视角。

Ti-MISO:基于TiLamb的藏文多模态生成式文本摘要
pdf bib 巩鑫, 闫晓东, 常浩远, 田金超

为 了 解 决 现 有 单 一 文 本 特 征 生 成 的 藏 文 摘 要 质 量 较 低 的 问 题 , 提 出 了 一 种 基于TiLamb的 多 模 态 生 成 式 文 本 摘 要 模 型——Ti-MISO。 该 模 型 采 用ViT(Vision Transformer)模型从图像中提取视觉特征,同时利用预训练微调的TiLamb(Tibetan Large Language Model Base)模型提取藏文文本特征,再通过跨模态交叉注意力机制实现图文特征深层次融合,最终将融合的特征送入模型,借助束搜索算法平衡生成质量更高的摘要。为验证方法有效性,与基于相同语料的其他四种模型进行了对比实验。实验结果表明,Ti-MISO在ROUGE-1、ROUGE-2、ROUGE-L和BLEU四项评价指标上均取得最佳成绩,显示出模型在融合视觉与语言信息、提升摘要质量方面的显著优势。此外,通过一系列消融实验进一步验证了采用ViT模型进行图像特征提取及交叉注意力融合策略的重要性。加入图像信息后采用交叉注意力机制进行特征融合,使融合后的特征保留更多关键信息,帮助模型更加精确地捕捉重点,从而生成的摘要在概括性和可读性上都有明显提升。

基于自监督表征蒸馏的Whisper低资源语音识别优化方法
pdf bib 胡剑, 董凌, 王文君, 相艳, 高盛祥, 余正涛

Whisper是一种强大的多语言语音识别模型,在英语等高资源语言上表现优异,但在缅甸语等部分低资源语言的性能仍受限于预训练数据的不足。为此,本文提出了一种基于自监督表征蒸馏的Whisper低资源语音识别优化方法。通过跨模型表征蒸馏机制,实现自监督模型表征向Whisper编码器的知识迁移,提升对缅甸语等语言的表征建模能力。实验结果表明,该方法在缅甸语、柬埔寨语、乌兹别克语和旁遮普语ASR任务中有效降低了字符错误率,验证了所提方法的有效性。

大语言模型可以分析花园幽径句吗?—基于跨语言数据集的实证研究
pdf bib 李琦, 纪悦, 李洪政

花园幽径句是在句法或语义上存在局部或临时歧义的一类特殊句子,在汉语和英语中都普遍存在,对于语言处理和认知机制等研究具有重要价值。本文聚焦于大语言模型理解分析花园幽径句的能力。本研究首先构建了一个具有典型结构的英汉双语花园幽径句数据集。随后基于该数据集开展了跨语言、跨模型的句法结构分析及语义理解的对比实验,考察多个大语言模型处理不同语言花园幽径句的消歧和理解分析能力,并对比了大模型与传统句法分析器Stanford Parser模型的分析能力。实验结果显示大语言模型测试结果呈现出与人类认知相似的花园幽径效应,可以利用名词合理性及动词偏向性为线索辅助消除句子歧义,英语句子的消歧能力显著优于汉语。语言模型句法分析与语义分析准确率具有较大差异。本实证研究揭示了大语言模型处理不同条件歧义句的表现差异,为语言处理和认知机制等提供了新的计算视角证据。

基于依存树库的越南语依存距离研究
pdf bib 罗云, 闫丹辉, 马延周

依 存 语 法 框 架 下 的 依 存 距 离 是 衡 量 句 法 分 析 难 度 的 重 要 指 标 。 本 文 基 于UD-Vietnamese依存树库对越南语依存距离的分布及影响越南语依存距离均值的因素进行分析研究。研究发现伬越南语依存距离分布符合幂律分布和指数分布的混合模型伻句长、长距离依存关系、依存方向均能对依存距离均值产生重要影响。该研究结果有助于从依存语法的角度揭示越南语的句法特点和规律伬为提出更科学合理的依存句法分析算法提供语言学支撑。

基于特征融合的大模型生成文本作者检测
pdf bib 赵晰莹, 白梓萌, 张妍, 袁彩霞, 王小捷

大语言模型在高效生成文本的同时也带来了文本滥用的问题,如何有效地区分不同大模型生成的文本成为了关键的挑战。为了解决这个问题,本文首先构建了一个面向多分类的大模型生成文本检测任务的数据集LGT-AA,包含7个领域的人类和10个常用大模型生成的94k条文本;其次,本文提出了一种提取不同大模型生成文本的全局性区分性特征的方案,并与分布特征进行融合构建文本检测器,提升了对生成文本的检测能力。实验结果表明,本文提出的方法在不同模型组合下和不同生成模型类别下都取得了更优的性能。

基于个性化记忆策略的小参数语言模型高效对齐方法
pdf bib 朱孟笑, 唐沛林, 沙九, 冯冲, 拉马杰, 闫丹智草

在信息爆炸的时代背景下,大模型每天都需处理庞大的知识与数据量。面对缺乏大规模工业级训练设施的现实,小参数模型成为了一种必要选择。然而,这些模型的信息处理需求远远超出其自然存储能力,这引发了一个核心问题:小参数模型应该记住什么,又应该忘记什么?传统的全记忆学习方法由于模型参数容量有限而不再高效,尝试记住一切不仅效率低,还可能引起过重的认知负担,降低思考质量。本文旨在重新定义有限记忆资源下的大语言模型记忆策略。本文首先将模型的记忆划分为内部记忆与外部记忆两个维度,并系统探讨了哪些知识应被优先内化为内部记忆。基于此,我们提出一种个性化记忆策略,针对不同类型的内部知识构建对应的对齐机制,使模型记忆更符合人类偏好与推理需求。这一策略不仅显著增强了小参数模型的理解能力与深度推理能力,也从根本上挑战了坜记得越多越好圢的传统假设,展示了战略性记忆选择在提升学习效率方面的巨大潜力。此外,本文还构建了关于内部记忆的训练集和评测数据集,并在仅使用3B参数规模的模型上进行了系统实验。实验结果显示,本文方法在该评测数据上实现了最佳效果,甚至在多个指标上超越了闭源模型及参数规模达70B的大型模型。为推动行业发展,我们已开源整个训练策略、模型权重及对应的评测数据集和评测方法。

人机价值观驱动的对话情绪生成模型
pdf bib 马志强, 叶浩然, 刘佳, 吕凯

对话系统情绪生成任务旨在生成待回复话语的情绪类别。针对现有情绪生成模型忽视了用户与模型价值观一致性对情绪生成的调节与引导作用,导致对话系统生成情绪与用户期望情绪之间存在偏差,降低了对话系统与用户之间的情绪共鸣。本文提出一种人机价值观驱动的对话情绪生成模型-HVDEGM,通过多阶段的门控机制动态引入用户价值观特征来引导情绪生成。该模型基于价值观一致性原理,设计了三个单元。首先情境修正注意力单元通过两次注意力机制增强了情绪与语义特征信息,其次价值观融合单元通过多阶段融合门控动态平衡了用户价值观特征与对话系统历史价值观特征的权重,最后反应调节单元通过双向注意力与交叉注意力机制,强化了情绪、语义、价值观特征之间的互补关联。模型在新构建的价值观对话数据集ValueCon上进行实验,实验结果表明,HVDEGM相比DialogueRNN、DialogueGCN等基线模型在Precision、Recall、F1及情绪共鸣度等指标分别提升了2.9%、2.5%、0.9%和4.1%,证明了所提出方法的有效性。

基于强化学习的大语言模型古文释义选择研究
pdf bib 徐维潞, 黄书剑

古文释义选择任务对语言模型的语义理解与语境匹配能力提出了较高挑战。本文提出一种基于强化学习的训练框架,通过结果导向的奖励设计,引导大语言模型优化古文释义判断策略。实验表明,相比监督微调(Supervised Fine-tuning, SFT),强化学习方法在准确率指标上表现更优。进一步分析发现,强化学习仅在释义选择任务上的训练不仅提升了模型的古文翻译能力,还在古汉语通用能力评估基准(ACLUE)上展现出更优的跨任务迁移性。相较之下,SFT训练后的模型在翻译与其他古文任务中的表现出现明显下降。本研究为古文处理任务提供了新的训练范式,验证了强化学习在非推理类语言任务中的有效性与泛化潜力。

历时演变视角下的古汉语分词:时期嵌入与大规模语料库的应用
pdf bib 柯永红

古汉语自动分词是古籍数字化和智能化处理的关键环节,但古汉语在数千年演变过程中呈现出显著的历时性差异,对构建通用的分词模型构成了严峻挑战。为应对这一挑战,本研究构建了一个覆盖上古、中古及近代三个主要历史时期的大规模古汉语分词标注语料库,在此基础上,本文提出了一种基于时期嵌入(Period Embedding)的古汉语历时分词模型‘RoBERTa-PeriodEmb-Fusion-CRF‘ 。该模型以预训练语言模型‘roberta-classical-chinese-large-char‘ 为骨干,通过引入可学习的时期向量来感知文本的时代背景,并设计了非线性融合层以有效整合时期信息与上下文语义表示,最后结合条件随机场(CRF)进行序列解码。在构建的历时语料库上的大量实验结果表明,与不包含时期信息的强基线模型相比,本文提出的模型在整体分词性能(F1值达到0.9505)以及跨时期文本的适应性上均取得了显著提升。本研究不仅验证了显式建模时期信息对于提升古汉语分词效能的重要性,也为构建高性能、通用的古汉语处理工具提供了有益的思路和数据支持。

控制句长的句子可读性研究:大语言模型驱动的数据集构建与评估
pdf bib 李罗希, 李炜, 邵艳秋

文本可读性评估研究旨在衡量文本对特定读者的理解难度,可以分为文档级和句子级。句长这一因素在句子级的难度分类中起主导作用,现有的句子级研究普遍未能控制该变量,从而掩盖了其他深层语言因素在句子难度中的作用。鉴于此,本文提出构建句长受控的句子难度分级语料库。然而,传统人工标注在构建该数据集上存在效率低、质量难以保证的问题。为解决这个问题,本文提出一种大语言模型驱动的智能受控改写方法,利用生成式人工智能从开放语料中自动筛选内容生成候选句,再通过专家审核来保证质量,最终构建了包含二分类三分类的控制句长句子难度分级语料库。在此数据集上的实验结果显示,传统特征分类模型的准确率在控制句长后显著下降,揭示了传统方法的局限性。大语言模型仍具有高准确率,表明其具备识别句长无关语义难度的能力。

基于细粒度时空建模的语音驱动手势生成模型
pdf bib 万浩聪, 刘长红, 杨海, 江爱文, 王明文

语音驱动手势生成技术根据输入的语音自动生成丰富的虚拟角色动作,在数字动画、虚拟现实和人机交互等领域具有广泛的应用前景。虽然现有方法在时序连贯性方面取得一定进展,但由于缺乏对关节间局部交互的显式建模,生成的肢体动作往往存在机械感且缺乏自然性。针对这一问题,提出一种基于细粒度时空注意力的扩散模型,从细粒度层面建模骨架关节点间的动态依赖关系。具体而言,设计了一种时空Transformer,其中空间注意力层显式建模了关节间的空间结构关系,而时序注意力层捕获手势运动的动态性。此外,通过自适应实例归一化技术AdaIN引入说话者身份控制,实现个性化手势生成。在BEAT、BEAT2和SHOW数据集上验证了所提模型的有效性。

跨语言方位词对“左-右”的语义衍化与语义关联模式探究
pdf bib 王梦焰, 安纪元, 杨麟儿, 杨尔弘

“左-右”作为普遍空间概念,其语义不断向政治、文化等领域衍化,但对其系统性的跨语言比较仍付阙如。本研究依托词汇类型学框架,选取汉语、英语、挪威语等十种语言,对“左-右”方位词的语义衍化路径与对应关联进行量化分析。在梳理权威词典义项的基础上,利用大语言模型(LLM)生成补充语料,并经母语者审核校对,最终构建跨语言方位词对“左-右”的语义网络。结果表明,“左-右”普遍沿“空间→政治→文化”三阶衍化,对应性语义衍化呈现高度跨语言一致性。该发现为二元对立概念的跨语言普适性提供了新的实证支持,亦丰富了方位词语义演变的类型学证据。本文提出的“智能体设计+上下文学习+多语对齐控制+母语者验证”混合模式为低资源语言语料扩展与语义研究提供了可复制方案。研究成果可服务于跨语言语义探索及基于对立概念的语言教学设计。

多领域翻译中语义消歧的话题方向盘方法
pdf bib 满志博, 张玉洁, 陈圆梦, 陈钰枫, 徐金安

近年,大语言模型(Large Language Models, LLMs)在通用文本翻译任务上的翻译质量取得大幅的提升,但是在面对多领域文本时,翻译质量呈现明显下降。如何利用有限的领域双语平行语料增强领域翻译知识成为主要的研究目标,已有方法大多使用人为设置的领域标签学习语义表示,导致其在消歧知识的获取上受到限制,如何构建有效的消歧知识成为一种挑战。为此,本文提出一种多领域翻译中语义消歧的话题方向盘方法,旨在增强大语言模型在多领域上的语义消歧能力,具体包括:(1)基于话题模型的语义表示获取机制:我们首先利用ETM自动聚类算法获取细小颗粒度的话题语义表示用于之后构建消歧知识,这种话题的表示更贴近语义,也更适合作为语义单元来构建语义表示。然后,我们设计TopicModel函数将大模型的表示转换成话题的语义表示。(2)基于话题方向盘的领域消歧知识获取机制:我们设计可学习的变换矩阵,通过建模不同领域下话题分布的投影方向获取多领域上的语义消歧知识。话题的语义表示经过领域方向投影的再次变换后,有效的语义消歧特征得到强化,从而提升大语言模型在不同领域下的语义消歧能力。我们选取Qwen-2.5-1.5B作为基础模型,在英语-汉语以及德语-汉语两个多领域翻译任务上进行实验验证。实验结果表明,该方法在平均BLEU值和COMET均超出基线模型,进一步我们对于翻译质量的提升与消歧效果之间的关系进行了分析,并通过翻译实例给出详细说明。

藏汉篇章机器翻译研究及语料库构建
pdf bib 田佳乐, 江静, 李亚超

篇章机器翻译旨在使用计算机将源语言篇章自动翻译为具有相同语义的目标语言篇章,是机器翻译的前沿研究热点。相对于传统的句子级翻译,以篇章作为翻译单位,模型能够更有效地利用上下文信息,提升翻译的一致性与连贯性,具有广阔的应用前景和研究价值。与资源丰富语言(如汉语、英语、法语等)机器翻译研究相比,藏语机器翻译资源稀缺,公开可用的数据集数量有限,在篇章级机器翻译方面的探索尚无公开论文发表。鉴于此,本文首先构建一个藏汉翻译数据集,标注了句子级、段落级和篇章级的边界,为藏汉篇章翻译任务提供高质量的多粒度标注数据集。然后,本文基于该数据集研究了藏汉篇章机器翻译,并对比机器翻译在句子层面、段落层面和篇章层面翻译效果的差异。本文对所构建的藏汉篇章翻译语料库予以开源,希望能推动相关研究的发展。链接:https://github.com/liyc7711/tb-zh-mt。

基于关联神经元识别的知识编辑方法
pdf bib 吴钰璋, 穆永誉, 王成龙, 何荞至, 肖桐, 马安香, 张春良, 朱靖波

近年来,大语言模型展现出了从训练语料中存储并提取知识的优秀能力,但相应地,其可靠性也容易遭受训练语料中错误信息的破坏,进而产生信息过时、错误回复等问题。基于神经元识别的知识编辑方法通过在模型中识别并微调与目标知识相关的知识神经元,实现对模型内部知识的精确修改。然而,本文研究发现,知识的表达形式会显著影响知识神经元的识别结果,例如,现有神经元识别方法对于同一知识的不同表达形式识别得到的神经元集合平均重叠率只有21.86%。这就导致只对单一的表达形式进行知识编辑无法覆盖到与这个知识相关的所有神经元,所以现有知识编辑方法的鲁棒性往往较差。为了全面且准确地识别到与某一知识相关的所有神经元,本文设计了一种轻量级关联神经元识别器(Light weight Associated Neuron Detector,LAND),通过学习不同表达形式的知识识别出的知识神经元集合之间的差异,从而在知识神经元识别的过程中,自动补全因表达形式差异而未被检出的知识神经元。实验结果表明,LAND方法能够将不同表达形式的文本识别出的知识神经元平均重叠率提升至96%以上,在不同句式的知识编辑成功率上较基线方法多提升了至多10.83个百分点。

基于大语言模型的立法文本可读性研究
pdf bib 胡钦谙, 崔玉珍

我国现行有效法律在内容上所呈现出的多样性及其庞大体量,使得人工方式的可读性评估难以实现全面覆盖。 本研究采用大语言模型对立法文本可读性进行评估,以深度学习的端到端方式摆脱了可读性研究对传统语言特征工程的路径依赖。 研究表明,大语言模型对立法文本可读性的自动化评分与人工评分具有显著相关性。 本研究从部门法等维度出发,系统揭示了不同法律的显著特征差异,刻画了我国现行有效法律文本可读性的整体面貌;并通过大语言模型文本生成与人工校验,从法律适用的角度探讨了提升立法文本可读性的可能路径,为立法语言的优化提供参考。

言行不一:大语言模型决策中的隐性偏见
pdf bib 林莘茹, 李璐旸, 刘湘婷

大语言模型的隐性偏见会隐蔽地影响模型的决策过程,使其在应用中难以保证公平性。本文首先构建基于决策的提示数据集进行隐性偏见评估,实验结果表明性能强的大语言模型可能表现出更严重的隐性偏见。进而为了缓解模型的隐性偏见,本文探索了自我反思和模型编辑两类方法。实验发现前者有助于识别隐性偏见,但无法在回答中去偏。在模型编辑实验中通过构建纠偏数据集,得出对模型后四层进行微调可获得最佳去偏效果,这一结论显示出有限参数调整在缓解隐性偏见方面的潜力。

语音相似性调节听觉词汇语义加工的脑电研究
pdf bib 耿立波, 赵悦, 陈茜

语音与语义的交互加工机制是理解语言认知过程的核心问题之一。以往研究多集中于词汇层面的线性处理路径,而对音节内部语音片段在语义加工中的作用关注不足。为探索语音信息在词汇语义加工中的调节机制,并为心理加工模型建构提供实证依据,本研究采用事件相关电位(ERP)技术,结合听觉启动范式,考察汉语双音节词尾音(第二音节韵母)相似性对语义加工的影响。实验操控词对的尾音相似性(相同/不同)与语义关系(相关/无关),通过语义判断任务测量被试的行为反应与脑电指标。结果发现:(1)尾音相似词对在晚期N400时间窗内诱发更大的负波幅,提示尾音信息对语义加工过程具有显著调节作用;(2)语义启动效应在尾音不同条件下显著,而在尾音相同条件下消失,显示语音信息可影响语义加工的时间进程与效应强度。研究表明,在听觉词汇加工中,语音片段的结构特征(如尾音相似性)不仅被高度感知,而且会通过调节语义预激活和整合过程参与语义建构。这些发现支持语音中语义交互模型的构想,揭示了语言加工过程中低层语音输入对高层语义处理的动态影响,为听觉词汇识别的认知心理模型建构提供了重要证据。

HMUM:面向仇恨模因检测的多阶段多模态理解模型
pdf bib 裴淑娟, 左家莉, 何乐, 万剑怡, 王明文

随着社交媒体的广泛普及,模因(meme)已成为信息传播与舆论引导的重要载体,其中蕴含的仇恨内容对网络生态与公共安全构成威胁,尤其是通过图像暗示、文化隐喻或社会符号等方式表达的隐性仇恨模因,具有更强的隐蔽性与误导性,给仇恨模因检测任务带来显著挑战。针对上述问题,本文提出了一种仇恨模因理解模型(Hateful Meme Understanding Model,HMUM),在Qwen2.5-VL-72B-Instruct模型基础上引入LoRA微调,并设计了一种多模态多阶段的提示学习框架。该框架通过阶段性引导模型依次完成文本识别、情绪建模与仇恨性推理,逐步增强其对模因语义与情感的理解能力,从而有效提升模型在中文语境下检测语义隐晦、情绪复杂仇恨模因的准确性。在公开数据集ToxiCN MM上的实验结果表明,HMUM(Qwen)在整体任务中取得了显著性能提升,在隐性仇恨模因子集检测方面,相较于基线模型表现出更强的优势。为评估其在更广泛隐性场景中的检测能力,本文构建了以隐性仇恨模因为主的数据集ITTD-220,实验结果显示,HMUM(Qwen)在该数据集上的检测性能同样优于现有模型,验证了其出色的泛化能力。

基于动态子空间重构的跨语言词向量对齐及应用
pdf bib 顾晓洋, 胡玲, 徐月梅

无监督双语词典归纳(Bilingual Lexicon Induction,BLI)通过学习映射函数对齐两种不同语言的单语词嵌入空间,从而推导单词翻译,在相似语言对中取得显著成功。然而,传统方法依赖单一线性映射,在远距离或低资源语言对上性能欠佳。为解决此问题,本文提出DM-BLI,一个基于动态多子空间对齐的无监督双语词典归纳算法及其应用框架。首先,DM-BLI通过多子空间映射提升对齐精度,重构源语言词嵌入空间,采用无监督聚类识别子空间,结合粗略全局对齐定位目标空间对应子空间,并通过簇内和簇间对比学习优化映射矩阵。在包含5个高资源和5个低资源语言对的有监督和无监督实验中显著提升性能。此外,DM-BLI基于所构建的词典使用logits lens技术评估大语言模型(Large Language Model, LLM)的跨语言能力,通过翻译和重复任务计算余弦相似度,结合词向量空间语义特征验证模型生成翻译的语义合理性。相较传统LLM的跨语言评估方法仅以静态的BLI翻译对为标准,DM-BLI能识别未被词典覆盖但语义合理的翻译,显著提升评估的鲁棒性和语义泛化能力,更准确全面地衡量大语言模型的跨语言语义映射能力。我们的代码发布https://github.com/huling-2/DM-BLI.git.

Self-Supervised Contrastive Learning for Content-Centric Speech Representation
pdf bib Li Jinlong, Dong Ling, Wang Wenjun, Yu Zhengtao and Gao Shengxiang

Self-supervised learning (SSL) speech models have achieved remarkable performance across various tasks, with the learned representations often exhibiting a high degree of generality and applicability to multiple downstream tasks. However, these representations contain both speech content and some paralinguistic information, which may be redundant for content-focused tasks.Decoupling this redundant information is challenging. To address this issue, we propose a Self-Supervised Contrastive Representation Learning method (SSCRL), which effectively disentangles paralinguistic information from speech content by aligning similar content speech representations in the feature space using self-supervised contrastive learning with pitch perturbation and speaker perturbation features. Experimental results demonstrate that the proposed method, when fine-tuned on the LibriSpeech 100-hour dataset, achieves superior performance across all content-related tasks in the SUPERB Benchmark, generally outperforming prior approaches.

RankLLM: A Multi-Criteria Decision-Making Method for LLM Performance Evaluation in Sentiment Analysis
pdf bib Xue Huzhi, Zhao Butian, Xie Haihua and Sun Zeyu

Large Language Models (LLMs) have made significant advancements in sentiment analysis, yet their quality and reliability vary widely. Existing LLM evaluation studies are limited in scope,lack a comprehensive framework for integrating diverse capabilities, and fail to quantify the im-pact of prompt design on performance. To address these gaps, this paper introduces a set of LLM evaluation criteria with detailed explanations and mathematical formulations, aiding users in understanding LLM limitations and selecting the most suitable model for sentiment analysis.Using these criteria, we apply the Technique for Order Preference by Similarity to an Ideal Solu-tion (TOPSIS), a classic decision-making method, to rank the performance of LLMs in sentimentanalysis. We evaluated six popular LLMs on three Twitter datasets covering different topics and analyze the impact of prompt design by assessing model-prompt combinations. Additionally,a validation experiment on a publicly available annotated dataset further confirms our ranking results. Finally, our findings offer valuable insights into the evaluation and selection of LLMs for sentiment analysis.

A Chunk-based Chain of Thought Prompting Method for Mitigating Over-Correction in Chinese Grammatical Error Correction
pdf bib Chang Xinquan and Zhu Junguo

Large Language Models (LLMs) have demonstrated remarkable capabilities in semantic under-standing and text generation. However, when applied to downstream tasks such as Chinese Grammatical Error Correction (CGEC), they often suffer from over-correction issues, where grammatically correct parts are mistakenly altered. Moreover, some existing methods aim to address over-correction in Sequence-to-Sequence (Seq2Seq) models, they are difficult to adapt to decoder-only LLMs. To address these challenges, we propose a Chunk-based Chain ofThought (CoT) Prompting Method. Our study is structured into three key components. Initially, we identify specific types of grammatical errors in the input sentences. Following this,sentences are segmented into smaller chunks, and each chunk is analyzed to match the detected error types. Ultimately, the aggregated information guides LLMs in performing localized correction within the input sentences. The experimental results have proved the effectiveness of our method in mitigating over-correction, achieving higher F0.5 score while maintaining robust grammatical error correction performance. This method provides innovative perspectives on employing LLMs to enhance the precision and granularity of CGEC task.

Linguistic Differences between AI and Human Comments in Weibo: Detect AI-Generated Text through Stylometric Features
pdf bib Li Ziqi and Zhang Qi

LLM-enhanced social robots (LLM-Bots) generate responses similar to human interactions and pose risks to social media platforms. Distinguishing AI-generated texts (AIGTs) from human-written content is important for mitigating these threats. However, current AIGT detection technologies face limitations in social media contexts, including inadequate performance on shorttexts, poor interpretability, and a reliance on synthetic datasets. To address these challenges, this study first constructs a social media dataset composed of 463,382 Weibo comments to capture real-world interactions between LLM-Bots and human users. Second, a stylo metric feature set tailored to Chinese social media is developed. We conduct a comparative analysis of these features to reveal linguistic differences between human-written and AI-generated comments. Third,we propose a lightweight stylo metric feature-based self-attention classifier (SFSC). This model achieves a strong F1-score of 91.8% for detecting AI-generated short comments in Chinese while maintaining low computational overhead. Additionally, we provide interpretable criteria for the SFSC in AIGT detection through feature importance analysis. This study advances detection forAI-generated short texts in Chinese social media.

Unveiling the Linguistic Acceptability Judgments of Large Language Models in Multilingual Contexts
pdf bib Xing Fuyu, Huang Haoyu, Mo Dawei, Yang Xinzhuo, Gao Zixuan, Wang Wei, Wang Zimu and Zhang Haiyang

Linguistic acceptability judgments are essential for evaluating how language models internalize human-like grammatical knowledge. Though some studies have evaluated large language mod-els (LLMs) in this context, existing research lacks systematic exploration of diverse learning paradigms in a multilingual setting. In this paper, we present the first multilingual evaluation of LLMs across four languages (English, Chinese, Japanese, and Russian) in the field of linguistic acceptability. Our evaluation spans both general-purpose (i.e., GPT-4o, GPT-4o mini,DeepSeek-V3, GLM-4-32B, and the Qwen series) and reasoning-oriented (QwQ-32B-Preview and DeepSeek-R1-32B) models under zero-shot and monolingual, cross-lingual and multilingual fine-tuning settings, with comparisons to pre-trained language model (PLM) baselines. Our analysis highlights the strong generalizability of large-scale LLMs through zero-shot prompting, the challenges of fine-tuning small-sized LLMs with skewed training data, the effectiveness of multilingual fine-tuning for low-resource languages, the scaling law exhibited on the task, and the limitation of reasoning-oriented models on the task, even when “aha moments” occur during the reasoning process.

Self-Preference: An Automated Method for Preference-Aligned Data Constructed from Business Metrics
pdf bib Gao Feng, Zhang Xuan, Ni Boyi, Wang Chunping and Chen Lei

Large language models (LLMs) have become integral components of various AI solutions, with the reinforcement learning from human feedback (RLHF) stage playing a critical role in align-ing model outputs with human preferences. However, generating the human preference data required for RLHF is often costly and time-consuming due to its reliance on human evaluation.This study addresses this challenge within the dialogue scenarios of the fintech industry. We leverage rich, non-confidential, multi-turn dialogue data, such as call center dialogue records,which include associated business metrics (e.g., problem-solving rates, turnover ratios) to con-struct preference-aligned data. We introduce Self-Preference, an automated method for creating preference-aligned data guided by these objective business metrics. The approach involves clustering dialogue histories based on their semantic representations and calculating a well-designed conditional probability ratio that correlates sequences with business metrics to generate preference data. In contrast to traditional preference alignment data generation methods that depend on subjective human evaluations, Self-Preference significantly reduces labeling costs and mitigates model-induced biases. Experimental results indicate that models trained with Self-Preference generated data demonstrate a strong positive correlation with target business metrics, highlight-ing the method’s effectiveness in facilitating efficient, goal-oriented alignment of LLMs.

MQM-MSC: Enhancing Translation Quality Estimation Interpretability with Mask-Driven Self-Correction in Large Language Models
pdf bib Cai Guanghui and Zhu Junguo

Large Language Models (LLMs) have demonstrated significant potential in interpretable translation quality estimation by providing both holistic ratings and fine-grained feedback. However,state-of-the-art methods, such as GEMBA-MQM, still suffer from an excessive number of false positives in error prediction, leading to misalignment with human annotations and reducing interpretability. To address this issue, we propose MQM-MSC, a novel training-free framework that employs a mask-driven self-correction (MSC) mechanism. The core of MSC is to use masks to highlight error spans in the initial prediction, enabling the model to re-evaluate these masked portions and verify their correctness. This approach mirrors human cognitive processes: when individuals express inconsistent judgments about the same issue at different times, it often indicates that their initial assessment was flawed. Similarly, MSC exploits contradictions between two evaluations to identify and filter false positives, thereby improving the accuracy and reliability of error annotations. Experimental results show that MQM-MSC effectively reduces false positives across four LLMs and three language pairs, consistently improving the reliability and quality of error annotations in the GEMBA-MQM approach

EDGE: Enhanced Debiased Gradient Extraction for Robust Fine-tuning
pdf bib Li Jinglong, Zhang Kun, Zou Chenyu, Shi Wei, Li Xin and Wei Si

Recent advances in large-scale pre-training have substantially enhanced the robustness and generalization capabilities of foundation models (e.g., Qwen3 and Llama-4). However, when fine-tuning them on downstream tasks, these models often latch onto dataset-specific biases, learning spurious correlations tied to easy-to-learn but non-robust features. This undermines their performance under distribution shifts, despite strong in-distribution (ID) accuracy. Existing fine-tuning methods, including full-parameter and parameter-efficient techniques, primarily optimize for ID performance and largely overlook out-of-distribution (OOD) robustness. Meanwhile, debiasing has been explored in full fine-tuning, while debiasing strategies on Parameter-Efficient Fine-Tuning (PEFT) remain underexplored. To this end, in this paper, we propose Enhanced Debiased Gradient Extraction (EDGE), a lightweight gradient projection-based method that explicitly suppresses bias-amplifying updates during fine-tuning process. EDGE is a model-agnostic, and plug-and-play debiasing method that operates without relying on predefined bias types or labels.It seamlessly integrates with both full and parameter-efficient fine-tuning, and generalizes acrossNLP and vision tasks. Experiments on synthetic and real-world benchmarks demonstrate thatEDGE effectively reduces bias and consistently improves OOD generalization, offering a unified and practical framework for robust adaptation under dataset bias.

Improving Abstract Reasoning Ability of Large Language Models through Mixture Program-based Data Synthesis
pdf bib Wang Yile and Huang Hui

Abstract reasoning is a challenging task that involves identifying patterns from limited input-output grids and applying them to new grids. With the development of large language models(LLMs), recent studies attempt to transfer the problems to textual format and tackle abstract reasoning tasks using models such as GPT-4. However, the overall accuracy is still low, which also results in the poor quality of abstract reasoning data directly synthesized by GPT-4, making it unsuitable as effective fine-tuning data. In this paper, we propose mixture program-based data synthesis strategies, including low-level code-based synthesis, high-level DSL-based synthesis,and shuffle-based synthesis. Through these strategies, we construct diverse and valid abstract reasoning instruction data to help improving the general abstract reasoning ability of LLMs for multiple datasets. Experimental results show that, by supervised fine-tuning Qwen-2.5-7B on our synthesized instruction data, the resulting model shows improved abstract reasoning ability and outperforms various strong baseline LLMs, including closed-source model GPT-4 and open-source models such as LLaMA-3 and Qwen-2.5. We release the logs by GPT and our model at https://github.com/szu-tera/ARC.

Fine-tuning GEC Model Based on Language Family Corpus
pdf bib Liu Yitao and Mark Dras

It is widely known that the first language (L1) of the English learners will influence their language study, causing them make to biased errors. However, it is relatively limited for the research of using the L1 information to improve Grammatical Error Correction (GEC) models. Among the limited research, a common method is to train a set of GEC models, and each model is trained bya corpus from one (and only one) specific L1 background. This method has been proven efficient,while the waste of the training / fine-tuning data makes it suffer from the data limitation issue.This paper introduces a novel method to address this issue by exploiting the linguistic similarities between a language family and its member languages. We expand the fine-tuning data from one specific L1 background to its language family one, making the quantity increase exponentially. We use the Italic language family corpus as our language family corpus and experiment with two approaches facing two situations, mainly differing in development data. The results show that,for the approach that uses the Italic language family corpus to be the fine-tuning data and uses the development data where the L1 background is the same as the one of the test data, the GEC models improve clearly; however, the way that influences the models is not uniform, and varies by error types.

Cross-modal Ambiguity Learning with Heterogeneous Interaction Analysis For Rumor Detection
pdf bib Fan Zhuo, Zhu Qing and Xiao Yang

Rumor detection on social media has recently attracted significant attention. Due to the complex user group and lack of regulation, rumor-spreaders intentionally disseminate rumors to sway pub-lic opinion, severely harming the general interests. Existing approaches generally perform rumor detection by analyzing both image and text modalities, and pay less attention to the interaction behaviors in social media, which can assist in distinguishing rumors from normal information.Furthermore, the images associated with rumors are often inconsistent or manipulated, how to distinguish these different features and utilize them effectively has become crucial in prevent-ing the widespread dissemination of rumors. To address the aforementioned issues, we proposeCross-modal Ambiguity Learning with Heterogeneous Interaction Analysis (CAHIA) for rumor detection. Specially, we design a novel heterogeneous graph feature extractor to fully utilize the different types of behavioral patterns in social interaction networks, we design an frequency inception net to extract manipulated visual features and adopt different fusing strategies to detect various types of rumors according to the ambiguity between text and image. Finally, a hierarchical cross-modal fusing mechanism is used to simulate the process users view and determine the authenticity of posts. Extensive experiments results demonstrate that CAHIA outperforms state-of-the-art models on four large-scale datasets for rumor detection in social media.

BiSaGA: A Novel Bidirectional Sparse Graph Attention Adapter for Evidence-Based Fact-Checking
pdf bib Ran Junfeng, Luo Weiyao, Tian Zailong, Zhao Guangxiang, Zhu Dawei, Wu Longyun, Huang Hailiang and Li Sujian

Evidence-based fact-checking aims to verify or debunk claims using evidence and has greatly benefited from advancements in Large Language Models (LLMs). This task relies on clarify-ing and discriminating relations between entities. However, autoregressive LLMs struggle with understanding relations presented in different orders or narratives, as their unidirectional na-ture hampers effective performance. To address this challenge, we propose a novel method that leverages bidirectional attention as an external adapter to facilitate two-way information aggregation. Additionally, we employ hierarchical sparse graphs to merge local and global information and introduce an efficient feature-compression technique to minimize the number of adapter parameters. Experimental results on both English and Chinese datasets demonstrate the significant improvements achieved by our approach, showcasing state-of-the-art performance in the evidence-based fact-checking task.

RJAG: Retrieval Judgment Augmented Generation
pdf bib Wang Kuangzhi, Hu Zhenhua, Ren Min and Tao Xiangzhi

Large Language Models (LLMs) inevitably suffer from hallucinations, as relying solely on their parametric knowledge cannot guarantee the accuracy of generated content. To enhance text generation, retrieval-augmented generation (RAG) is proposed to incorporate external knowledge to achieve this. However, its effectiveness heavily depends on the relevance of retrieved documents, which poses a critical challenge: how to ensure the accuracy and reliability of model responses when retrieval results are inaccurate. Tackling this challenge, we propose RetrievalJudgment Augmented Generation (RJAG), a method that can enhance RAG through LLM-driven fine-grained relevance judgment mechanism and a task-adaptive knowledge combination strategy. RJAG judges and dynamically combines retrieved documents for both open-ended generation and closed-ended selection tasks. Additionally, large-scale web search is also included to expand the knowledge beyond static corpora. Experimental results on multiple bench-marks show that RJAG outperforms existing RAG methods, which will significantly enhance the accuracy and reliability while maintaining the system’s simplicity. Code is available at https://github.com/wangkz2023/RJAG.

Statistically Optimized SGNS Model: Enhancing Word Vector Representation with Global Semantic Weight
pdf bib Liu Yulin, Xiong Feng, Liu Wanwei and Wu Minghui

Addressing the limitations of the Skip-gram with Negative Sampling (SGNS) model related to negative sampling, subsampling, and its fixed context window mechanism, this paper first presents an in-depth statistical analysis of the optimal solution for SGNS matrix factorization,deriving the theoretically optimal distribution for negative sampling. Building upon this analysis, we propose the concept of Global Semantic Weight (GSW), derived from Pointwise Mutual Information (PMI). We integrate GSW with word frequency information to improve the effectiveness of both negative sampling and subsampling. Furthermore, we design dynamic adjustment mechanisms for the context window size and the number of negative samples based on GSW, enabling the model to adaptively capture contextual information commensurate with the semantic importance of the center word. Notably, our optimized model maintains the sametime complexity as the original SGNS implementation. Experimental results demonstrate that our proposed model achieves competitive performance aganist state-of-the-art word embedding models including SGNS, CBOW, and GloVe, across multiple benchmark tasks.Compared with the current mainstream dynamic word vector models, this work emphasizes achieving a balance between efficiency and performance within a static embedding framework, and provides potential supplementation and support for complex models such as LLMs.

Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models
pdf bib Chen Yushuo, Tang Tianyi, Xiang Erge, Li Linjiang, Zhao Wayne Xin, Wang Jing, Chai Yunpeng and Wen Ji-Rong

In real world, large language models (LLMs) can serve as the assistant to help users accomplish their jobs, and also support the development of advanced applications. For the wide application ofLLMs, the inference efficiency is an essential concern, which has been widely studied in existing work, and numerous optimization algorithms and code libraries have been proposed to improve it.Nonetheless, users still find it challenging to compare the effectiveness of all the above method sand understand the underlying mechanisms. In this work, we propose a coarse-to-fine method that encompasses both experimental and analytical components. This method can be applied across various models and inference libraries. Specifically, we examine four usage scenarios within two practical applications. We further provide both theoretical and empirical fine-grained analyses of each module in the Transformer architecture. Our methods can be a general and invaluable method for researchers to evaluate various code libraries and improve inference strategies across different LLMs. We open-source the supporting dataset, code, and evaluation scripts at the link:https://github.com/RUCAIBox/Inference-Efficiency-Evaluation.

MASP: A Multilingual Dataset for Probing Scalar Modifier Understanding in LLMs
pdf bib Gao Xinyu, Ding Nai and Liu Wei

This study aims to test how large language models (LLMs) understand gradable adjectives and whether their understanding compares with humans, under the framework of formal semantics.We introduce a diagnostic dataset, referred to as the Modifier-Adjective Scale Probe (MASP),to evaluate how well LLMs understand a gradable adjective (e.g., long) when the adjective is combined with one modifier (e.g., very long or slightly long, a condition referred to as degree modification) or is further negated (e.g., very not long and not very long, a condition referred to as compositional negation). The dataset consists of over 80,000 natural language inference questions in both Chinese and English. We apply the MASP dataset to test both humans and11 popular LLMs, including GPT-4o and Gemini-2.0-Flash. The results show that most LLMscan correctly understand whether a modifier boosts (e.g., very) an adjective. However, they fail to understand the modifiers that weaken the degree and the negation forms of modifiers.Furthermore, we parameterize the human and LLM behavior, and find that the judgment patterns of LLMs differ from humans especially in the Chinese tests. These findings suggest that LLM sare still not well aligned with humans in terms of the interpretation of simple adjective phrases,and MASP provides a new approach to quantify the interpretation of adjective phrases in LLMs.

HFSD-V2C: Zero-Shot Visual Voice Cloning Via Hierarchical Face-Styled Diffusion Model
pdf bib Liu Yaping, Wang Linqin, Gao Shengxiang, Yu Zhengtao and Dong Ling

The goal of this work is zero-shot visual voice cloning (ZS-V2C), which aims to generate speech samples with unseen speaker identity and prosody derived from a video clip and an acoustic reference. ZS-V2C presents greater challenges as: 1) unseen speaker modeling; and 2) unseen prosody modeling. Unlike previous works, we propose a novel ZS-V2C framework that incorporates a hierarchical face-styled diffusion model (HFSD-V2C). Specifically, first, we leverage cross-modal biometrics to predict unseen speaker embeddings based on facial features. Then, we jointly model the unseen prosodic features at the text, speech and video levels. Finally, a diffusion model is constructed based on the embeddings of the unseen speaker and prosodic features,enabling the generation of expressive and diverse speech. Extensive experiments on the LRS2and GRID benchmark dataset demonstrate the superior performance of our proposed method.

CRAF:Cross-Modal Representation Alignment and Fusion for Speech Translation
pdf bib Guo Zhenbei, Wu Wenzhou, Lai Hua, Xiang Yan, Huang Yuxin and Yu Zhengtao

The end-to-end speech translation task involves directly transforming speech into the text of another language, bypassing the generation of an intermediate transcription. However, existing methods may lose key information during cross-modal length alignment and fail to effectively integrate different representations, resulting in low quality of the fused representation. To address these issues, we propose an efficient method named CRAF for effective cross-modal alignment and fusion for speech translation, which reduces information loss and enhances the integration of cross-modal representations. First, CRAF minimizes information loss by improving the cross-modal length alignment, ensuring the alignment process retains more critical information from the speech modality. Second, CRAF strengthens the integration of cross-modal representations by allowing the model to combine complementary features from diverse modalities, enhancing its capacity to concentrate on the most pertinent and critical information. Finally, we evaluateCRAF by conducting extensive experiments on eight language pairs from the MuST-C dataset.Experiments show that the average BLEU score of CRAF achieves 29.0, outperforming other comparison methods. Our code is available at https://github.com/wu-wen-zhou/first/tree/master.

Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models
pdf bib Sun Wangtao, Zhang Chenxiang, Zhang Xueyou, Yu Xuanqing, Huang Ziyang, Xu Haotian, He Shizhu, Zhao Jun and Liu Kang

Although Large Language Models (LLMs) have demonstrated strong instruction-following abil-ity, they are further supposed to be controlled and guided by inferential rules in real-world scenarios to be safe, accurate, and intelligent. This demands the possession of inferential rule-following capability of LLMs. However, no prior work has made a clear evaluation of the inferential rule-following capability of LLMs. Previous studies that try to evaluate the inferential rule-following capability of LLMs fail to distinguish the inferential rule-following scenarios from the instruction-following scenarios. Therefore, this paper first clarifies the concept of inferential rule-following and proposes a comprehensive benchmark, RuleBench, to evaluate a diversified range of inferential rule-following abilities. Our experimental results on a variety of LLMs show that they are still limited in following rules. Our analysis based on the evaluation results provides insights into the improvements for LLMs toward a better inferential rule-following intelligent agent. We further propose Inferential Rule-Following Tuning (IRFT). The experimental results show that through IRFT, LLMs can learn abstract inferential rule-following abilities from purely synthetic data and then generalize to RuleBench. The data and code can be found at:https://gitee.com/forangel2014/llm-rule-following-code

Lao-English Code-Switched Speech Synthesis Via Neural Codec Language Modeling
pdf bib Liu Yaping, Wang Linqin, Gao Shengxiang, Yu Zhengtao, Dong Ling and Tian Tian

This paper addresses the challenges of data scarcity and limited speaker resources in Lao-English code-switched speech synthesis. We propose a neural encoder-decoder-based method for mixed-lingual speech synthesis. The method first extracts phoneme-level speech representations and employs a dot-product attention mechanism to map Lao and English phonemes into a shared la-tent space, thereby enhancing the model’s capability to represent cross-lingual phonetic information. In addition, language ID embedding module is extended to explicitly indicate the language of each input token, helping the model distinguish and adapt to language-specific pronunciation characteristics. Experiments are conducted on the open-source English dataset LibriTTS anda proprietary Lao speech corpus. Both subjective evaluations (MOS, AB preference tests) and objective metrics (RMSE) demonstrate that the proposed approach significantly outperforms the baseline VALL-E X model in terms of naturalness and language-switching fluency. Furthermore, ablation studies confirm that both the shared phoneme latent space and the language ID mod-ule play critical roles in improving synthesis quality. This approach offers a novel solution for integrating low-resource languages into mixed-lingual speech synthesis.

UMAD: Enhancing LLM Debiasing via Multi-Agent Debate and Token-Level Bias Interpretation
pdf bib Gu Hanwen, Ma Jie, Qin Ying and Hu Ling

Textual data often contain biases that compromise fairness in AI systems, particularly in sensitive areas such as gender, race, and politics. While large language models (LLMs) have shown success across various tasks, they still face limitations due to inherent biases within the model sand restrictive safety policies that hinder direct bias mitigation. To overcome these challenges,we propose UMAD (Unsupervised Multi-Agent Debate), a novel framework that leverages aMulti-Agent Debate mechanism alongside Best-Worst Scaling (BWS) to foster more effective discussions among LLMs, facilitating the identification of biases. By combining this with gradient-based interpretation techniques, UMAD extracts token-level bias insights, which are then integrated into models using in-context learning. This enhances the debiasing performance, as shown by our experiments across three bias categories—gender, religion, and politics—using five different LLMs. Our approach demonstrates significant improvements in metrics, with large models matching or even surpassing GPT-4 in Style Accuracy (STA). We release our code at:https://github.com/Couen/UMAD.git.

Instruction-Driven In-Context Learning for Domain-Specific Chinese Spelling Correction
pdf bib Hyunsoo Park, Wu Hongqiu and Zhao Hai

This paper investigates domain adaptation in Chinese Spelling Correction (CSC) based on the instruction-following ability of large language models (LLMs). In the instructions, we include a variety of domain-specific requirements for spelling correction, such as the domain’s formal-ity or writing tone, which go beyond the considerations of previous CSC research. To evaluate the LLMs’ performance on instruction-following, we propose IDSpell, a semi-supervised con-struction pipeline for a CSC dataset containing a wide range of domain-specific sentences along with specific instructions. We construct a dataset with IDSpell and evaluate it on Qwen2.5 andGPT-4o, where we find that instructions serve a meaningful influence in correction, increasing the average F1 score by 10.4% compared to when the instructions are not provided. To further enhance the result, we propose Contrastive Prompting, a method incorporating contrastive false examples into the prompt to better guide the model to understand the instruction. Experiments demonstrate that our method outperforms baseline prompting with an average improvement of5.4%. Our dataset and code are publicly available for further research.

TAG: Dialogue Summarization Based on Topic Segmentation and Graph Structures
pdf bib Shen Yatian, Hao Qichao, Deng Guosong, Wang Songyang and Zhang Eryan

In recent years, dialogue summarization has emerged as a rapidly growing area of research in natural language processing. Dialogue summarization is challenging due to dispersed key information, redundant expressions, ambiguous topic identification, and difficult content selection.To address these challenges, we propose an innovative approach to dialogue summarization that integrates topic segmentation and graph-structured modeling. Specifically, we first per-form topic segmentation of the dialogue through clustering and quantify the key information in each utterance, thereby capturing the dialogue topics more effectively. Then, a redundancy graph and a keyword graph are constructed to suppress redundant information and extract key content, thereby enhancing the conciseness and coherence of the summary. Evaluations were conducted on the DialogSum, SAMSum, CSDS, and NaturalConv datasets. The experimental results demonstrate that the proposed method significantly outperforms existing benchmark mod-els in terms of summary accuracy and information coverage. The Rouge-1 scores achieved were 48.03%, 53.75%, 60.78%, and 81.48%, respectively, validating its effectiveness in the dialogue summarization task. Our code is available at https://anonymous.4open.science/r/TAG-E64A.

DualReward: A Dynamic Reinforcement Learning Framework for Cloze Tests Distractor Generation
pdf bib Huang Tianyou, Chen Xinglu, Zhang Jingshen, Qiu Xinying and Niu Ruiying

This paper introduces DualReward, a novel reinforcement learning framework for automatic dis-tractor generation in cloze tests. Unlike conventional approaches that rely primarily on super-vised learning or static generative models, our method employs a dual reward structure with adaptive scaling that differentiates between human-created gold standard distractors and model-generated candidates. The framework dynamically adjusts reward signal intensity based on model performance and confidence. We evaluate our approach on both passage-level (CLOTH-F) and sentence-level (MCQ) cloze test datasets, demonstrating consistent improvements overstate-of-the-art baselines. Experimental results show that our adaptive reward scaling mechanism provides modest but consistent benefits on homogeneous datasets (CLOTH-F) and more substantial improvements (3.48-3.86% in P@1) on diverse, cross-domain data (MCQ), suggest-ing its particular effectiveness for handling varied question types and domains. Our work offers a flexible framework that effectively balances learning from reliable human examples while exploring novel, high-quality distractors for automated test generation.

Enabling Real-Time Conversations with Minimal Training Costs
pdf bib Xu Wang, Wang Haoyu, Wang Shuo, Zhao Weilin, Han Xu, Yan Yukun, Zhao Haiyan, Zhang Yudi, Tao Zhe, Liu Zhiyuan and Che Wanxiang

Large language models (LLMs) have demonstrated the ability to improve human efficiency through conversational interactions. Conventional LLM-powered dialogue systems, operating ona turn-based paradigm, preclude real-time interaction during response generation. To address this limitation, researchers have proposed duplex models. These models can dynamically adapt to user input, facilitating real-time interactive feedback. However, these methods typically require substantial computational resources to acquire the duplex capability. To reduce overhead, this paper presents a new duplex decoding approach that enhances LLMs with duplex ability, requiring minimal additional training. Specifically, our method employs parallel decoding of input and responses in conversations, effectively implementing a channel-division-multiplexing decoding strategy. Experimental results indicate that our proposed method significantly enhances the naturalness and human-likeness of user-AI interactions with minimal training costs.

DSMR-SQL: Enhancing Text-to-SQL with Dual-Strategy SQL Generation and Multi-Role SQL Selection
pdf bib Huang Yiming, Guo Jiyu, Zeng Jichuan, Gao Cuiyun, Han Peiyi and Liu Chuanyi

Recent advancements in Large Language Models (LLMs) have markedly improved SQL generation. Nevertheless, existing approaches typically rely on single-model designs, limiting their capacity to effectively handle complex user queries. In addition, current methods often face difficulties in selecting the optimal SQL from multiple candidates. To mitigate these limitations,this study presents DSMR-SQL, a two-stage framework consisting of: (1) Dual-Strategy SQLGeneration: DSMR-SQL aims to produce a broader spectrum of SQL queries by using multiple models with two strategies: Supervised Fine-Tuning and In-Context Learning; (2) Multi-RoleSQL Selection: DSMR-SQL seeks to identify the SQL most aligning with user intent by introducing a collaborative framework involving three roles (i.e., Proposer, Critic, Summarizer).Extensive experiments on various datasets substantiate the efficacy of DSMR-SQL in enhancing SQL generation.

Quality-aware Neural Machine Translation with Self-evaluation
pdf bib Cui Jiajia, Mu Lingling, Liu Qiuhui and Xu Hongfei

The performance of neural machine translation relies on a large amount of data, but crawled sentence pairs are of different quality. The low-quality sentence pairs may provide helpful translation knowledge but also teach the model to generate low-quality translations. Making the model aware of the quality of training instances may help the model distinguish between good and bad translations while leveraging the translation knowledge. In this paper, we evaluate the quality of training instances with the average per-token loss (negative log-likelihood) from translation mod-els, convert the quality scores into embeddings through vector interpolation and feed the quality embedding into the translation model during its training. We ask the model to decode with the best quality score to generate good translations during inference. Experiments on the IWSLT 14 German to English, WMT 14 English to German and WMT 22 English to Japanese translation tasks show that our method can effectively lead to consistent and significant improvements across multiple metrics.

Proceedings of the 24rd Chinese National Conference on Computational Linguistics (Volume 2: Evaluations)

pdf(全部) bib(全部)

CCL25-Eval任务1系统报告:基于上下文学习与格式化约束的空间语义理解
pdf bib 郑奕扬

本系统报告详细介绍了我们团队参加第五届中文空间语义理解评测(SpaCE2025)的方法和成果。SpaCE2025旨在评估大语言模型在空间语义理解和空间推理能力上的表现,涵盖空间信息正误判断、空间异形同义判断、空间参照实体判断、中文空间方位关系推理和英文空间方位关系推理五个子任务。针对不同任务,我们采用基于上下文的有监督微调和格式化约束的逻辑推理框架,结合LoRA高效微调Qwen2.5-7B-Instruct和DeepSeek-R1-Distill-Qwen-7B模型,设计了约束提取、排列遍历和求解器求解的推理流程。在测试集上,我们在信息正误判断、异形同义判断、参照实体判断、中文方位推理、英文方位推理分别取得0.6454、0.7082、0.7720、0.6254、0.5997的准确率,综合排名第二。

CCL25-Eval任务1系统报告:基于数据、训练、推理三阶协同增强的空间语义理解
pdf bib 华中天, 罗毅, 王梦园, 于美佳, 韩英杰

SpaCE2025以空间语义理解为核心,聚焦于具有较高难度的空间语义理解任务,旨在评估大语言模型(LLM)在空间语言能力和空间推理能力两方面的表现。面对空间语义复杂、训练数据缺失和模型参数限制等挑战,本文提出了一个基于数据、训练、推理三阶协同增强的模型优化框架,针对空间语言能力和空间推理能力两个子任务分别设计了两套不同的优化方案。对于空间语言能力任务,我们利用DeepSeek-R1结合空间词表对训练集进行了扩充,对Qwen系列LLM进行了LoRA微调,在推理过程中使用了测试时增强来进一步优化结果;对于空间推理能力任务,我们将空间语言能力数据集也纳入训练集,对DeepSeek-R1-Distill-Qwen-7B模型进行微调,并对模型预测结果进行了累计投票集成。最终,我们的方法排名第六,总体准确率得分为58.54%。此外,本文还报告了一些尝试过但未能提升模型表现的其他方法。

CCL25-Eval任务1系统报告:使用思维链和投票集成增强大型语言模型空间语义理解
pdf bib 刘海芯, 昝红英, 宋金旺, 李一帆, 孔露露

本技术报告详细介绍了我们团队在第五届空间语义理解评测(SpaCE2025)中的方法与成果。SpaCE2025 继续聚焦大语言模型在空间语义理解方面的能力评估,涵盖空间语言理解与空间推理两个核心维度,共设置五个子任务:空间信息正误判断、空间参照实体判断、空间异形同义判断、中文空间方位关系推理以及英文空间方位关系推理。我们通过设计结构化提示词并引入思维链推理机制,结合LoRA 微调技术和投票集成方法,有效提升了大语言模型在空间语义理解任务中的表现。在最终评测中,我们团队五个子任务的综合准确率为0.5983,整体排名第五。

Overview of CCL25-Eval Task 1: The Fifth Spatial Cognition Evaluation
pdf bib Qin Yuhang, Xiao Liming, Hu Nan, Deng Sirui, Ma Jingyuan, Cui Hyang, Zhang Zihan, Tsai Chihsu, Ding Jingkun, Kang Sumin, Sui Zhifang and Zhan Weidong

The Fifth Spatial Cognition Evaluation (SpaCE2025) presents a benchmark aimed at evaluating the spatial semantic understanding and reasoning capabilities of Large Language Models(LLMs), primarily in Chinese.It consists of five subtasks: (1) Retrieving Spatial Referents(RSR), (2) Detecting Spatial Semantic Anomalies (DSA), (3) Recognizing Synonymous SpatialExpression (RSE), (4) Spatial Position Reasoning (SPR) in Chinese, and (5) SPR in English. The fourth and fifth subtask share the same content and structure, differing only in language, and are designed to assess the cross-linguistic spatial reasoning capability of LLMs. A total of 12 teams submitted their final results, and the best-performing team achieved an accuracy of 0.7931. The results suggest that while LLMs are capable of handling basic spatial semantic understanding tasks such as RSR, their performance on more complex tasks, such as DSA and RSE, still re-quires improvement. Additionally, finetuning methods that effectively activate LLMs’ reasoning ability are essential to improve their performance.

System Report for CCL25-Eval Task 2: Enhanced Chinese Frame Semantic Parsing with Pre-trained Model and Linguistic Features
pdf bib Liu Yahui, Qiao Ziheng, Gong Chen and Zhang Min

This paper presents our system submitted to the Chinese Frame Semantic Parsing evaluation task at the 24th China National Conference on Computational Linguistics (CCL2025). For the three subtasks of Frame Identification (FI), Argument Identification (AI), and Role Identification(RI), we utilized a larger Chinese pre-trained model, as the foundation and adopted specific optimization strategies for FI and RI subtasks. Specifically, we incorporated word segmentation structure information and updatable pre-trained target word embeddings in the FI subtask, and explored the use of Focal Loss combined with target word embeddings and word segmentation structure information in the RI subtask. Furthermore, a voting mechanism was employed in both the FI and RI subtasks to enhance performance. Our system ultimately achieved first place on the TestA and second place on the TestB.

CCL25-Eval2系统报告:基于高效参数旋转位置编码的汉语框架语义解析
pdf bib 黄永清

汉语框架语义解析(Chinese Frame Semantic Parsing)是中文自然语言处理中的一项重要任务,其目标是从句子中提取框架语义结构,实现对句子中事件或情境的深层理解。框架语义解析对阅读理解、文本摘要、关系抽取等下游任务具有重要意义。旨在从句子中提取框架语义结构,实现对句子中事件或情境的深层理解。本文将框架识别和论元角色识别任务建模为分类任务,将论元范围识别任务建模为抽取任务,使用预训练语言模型进行微调,并通过对抗训练、指数滑动平均、分组学习率、参数高效旋转位置编码等策略提升模型性能。

Overview of CCL25-Eval Task2: Chinese Frame Semantic Parsing Evaluation
pdf bib Xu Hao, Li Juncai, Yan Zhichao, Liu Haikun, Su Xuefeng, Zhang Jiayang and Li Ru

Chinese Frame Semantic Parsing (CFSP) aims to extract fine-grained frame semantic structures from text, providing rich semantic information to enhance the capabilities of natural language understanding models in semantic representation and downstream applications. Building on the CCL-2024 CFSP evaluation task and motivated by the prevalent phenomenon of semantic roles nesting in sentences, we update the nested role annotation data by simultaneously labeling all nested semantic roles. Based on this enhancement, we publish a more challenging CFSP evaluation task for CCL-2025. The evaluation dataset consists of 22,000 annotated examples involving 703 frames, including nested annotations covering 101 semantic roles. The evaluation task, divided into three subtasks: frame identification, argument identification, and role identification, has attracted wide attention from both industry and academia, with a total of 156 teams participating. As for the evaluation results, Yongqing Huang from Guangdong Province won first place with a final score of 70.76.In this paper, we report key information about the evaluation task, including key concepts, evaluation dataset, top-3 results and corresponding methods. More information about this task can be found on the website for the CCL-2025 CFSP evaluation task.

System Report for CCL25-Eval Task 2 Solving Frame Semantic Parsing with LLMs
pdf bib Du Jingtao

Frame Semantic Parsing (FSP) is a critical task in natural language processing (NLP) that involves identifying semantic frames, argument spans, and their corresponding roles within a sentence. This paper presents a novel approach to Chinese Frame Seman-tic Parsing by fine-tuning the Qwen3 large language model to simultaneously address three sub-tasks: Frame Identification, Argument Identification, and Role Identification.We propose a unified prompt-based framework with iterative refinements, including direct argument output for span identification and a majority-voting mechanism for frame prediction. Our experiments demonstrate significant improvements in argument and role identification through modified output formats, while frame identification benefits from ensemble voting. However, integrating Chain-of-Thought (CoT) reasoning with model-generated explanations yielded suboptimal results, suggesting limitations in the auxiliary model’s performance. This work highlights the potential of fine-tuned large language models for complex semantic parsing tasks and identifies avenues for further optimization.

System Report for CCL25-Eval Task 3: Hallucination Mitigation in Chinese Abstract Meaning Representation Parsing with a Multi-Agent Approach
pdf bib Chen Rongbo, Bai Xuefeng, Chen Kehai and Zhang Min

This paper introduces our system for the Fifth Chinese Abstract Meaning Representation(CAMR) Parsing Evaluation task at the 24th China National Conference on ComputationalLinguistics (CCL 2025). Our framework formulates both CAMR parsing and document-level coreference resolution as sequence-to-sequence generation tasks, employing large languagemodels (LLMs) to produce linearized CAMR sequences and coreference sequences. To mitigate hallucinations in generated graphs, we design a multi-agent system comprising: (1) two detection agents for automated error detection and hallucination identification; (2) a refinement agent that corrects graph structures based on detected inconsistencies. Experimental results show that:(1) recent LLMs, especially Qwen-3, achieve promising performance in CAMR parsing; (2)the proposed multi-agent system can effectively identify and correct hallucinations of CAMR predictions; and (3) sequence-to-sequence methods exhibit significant limitations in document-level coreference resolution due to context length constraints.

CCL25-Eval任务3总结报告:第五届中文抽象语义表示解析评测
pdf bib 许智星, 张艺璇, 李斌, 徐静, 曲维光, 周俊生

本文为第五届中文抽象语义表示解析评测(CAMRP 2025)的总结报告。CAMRP2025包含两个子任务:中文抽象语义表示(CAMR)句子级解析任务,和CAMR篇章共指解析任务。评测任务共有96支队伍报名,4支队伍提交结果,最终总计26份有 效 成 绩 。 哈 尔 滨 工 业 大 学 ( 深 圳 ) 团 队 在 开 放 测 试 下 , 取 得 了84.72%的F值 ,为CAMRP评测系列五年来的历史最好成绩。该团队在篇章共指消解任务中同样获得了最高61.15%的好成绩,相比baseline有较大提升。参赛队伍的实验结果表明,尽管基于监督微调和图聚合的策略在句子级解析任务中展现出了较好的性能,但大模型对于细粒度的篇章共指关系识别仍然存在挑战。如何有效利用CAMR结构化信息来提升大模型篇章共指解析的性能,仍是未来研究的重要方向。

CCL25-Eval任务4系统报告:基于叙实性分类和语境特征的大语言模型叙实性推理
pdf bib 张箫驿, 鲁嘉琪, 张达, 陈笑宇, 卢达威

叙实性推理是机器理解文本隐含事实的关键能力之一,核心在于结合动词的语义判断动词宾语命题的真值。本研究基于首届中文叙实性推理评测任务4(FIE2025)开展叙实性推理研究,经过前期对不同模型的测验和比对,选择了Deepseek-R1模型为基座模型。提示语的总体撰写思路是:首先将动词叙实性进行分类,从传统的三分法扩展至五分法(叙实、弱叙实、反叙实、非叙实、半叙实),同时,对自然语料与人造语料进行差异化处理,再针对部分语义复杂的动词编写更加细致的判断规则。最终结果显示,自然语料的正确率达到0.9155,人造语料的正确率为0.9541,总正确率达到0.9261。

System Report for CCL25-Eval Task 4: From Plain to Hierarchical —Knowledge-Augmented Prompting for Chinese Factivity Inference
pdf bib Park Minjun and Lee Seulki

To improve the factivity inference capability of large language models (LLMs), we adopted a Retrieval-Augmented Generation (RAG) framework using a curated bibliography on Chinese factivity semantics. We compared a baseline without retrieval against two RAG-based strategies, showing that hierarchical prompting with RAPTOR yields the high-est accuracy. Using recursive summarization from the bottom up, RAPTOR allows models to access document context at multiple abstraction levels, resulting in more accurate and stable inference. Our findings contribute to deeper Chinese semantic inference through linguistic knowledge-augmented prompting in factivity inference and textual entailment.

CCL25-Eval 任务四系统报告:现代汉语动词的叙实性语用推理机制
pdf bib 赵培翔, 李明珠, 梅立亚, 王芳, 高年鑫, 赵浪

叙实性推理是一项与事件真实性判断密切相关的语义理解任务,主要关注语言表达中的事实性信息传递。本次测评任务基于沈家煊(2003)提出的“行、知、言”三域理论,对动词叙实性分类体系进行了进一步细分。这一改进不仅为汉语叙实性研究提供了更为精细的分析工具,还显著提升了大语言模型对“叙实性”语义的理解能力。测试结果表明,在不微调赛道上,我们团队在测试集上的最终正确率达到93.41%。

CCL25-Eval任务四系统报告:宏观模式提示与高效微调在叙实性推理中的应用
pdf bib 李泽群, 钟元浩, 柴成亮

本文研究了利用大语言模型进行谓词引导的叙实性推理任务。在不微调场景下,针对Gemini 2.5 Pro模型,我们构建了基于谓词类型的思维链(CoT)提示,并创新性地让模型学习整个带答案的样本集以归纳宏观模式和规则,最终形成高效的提示词模板。在微调场景下,我们选用Qwen3-32b模型,利用llama factory进行LoRA微调,并使用llama.cpp完成模型向gguf格式的转换、量化及Ollama部署。实验结果展示了所提方法的有效性,其中在不微调赛道上,基于宏观模式提示的方法取得了94.01%的准确率;在微调赛道上,基于微调模型的系统取得了92.61%的准确率。

System Report for CCL25-Eval Task 4: Factivity Inference Based on Dynamic Few-Shot Learning
pdf bib Gu Sunyan, Lu Taoyu, Liu Siqi, Guo Kan and Yan Shao

This paper presents the implementation approach we employ in the First Chinese Factivity Inference Evaluation 2025 (FIE2025). Factivity inference (FI) is a semantic understanding task related to judging the truth value of events, based on the use of semantic verbal elements, such as “believe”, “falsely claim”, “realize”. We approach factivity inference as a large language model(LLM) based task. We aim to enhance LLM’s discriminative capability by adequately integrating the task-specific information via prompts, as well as constructing dynamic few-shot datasets for fine-tuning. Additionally, we incorporate data augmentation and ensemble strategies to further boost the performance. Our approach achieves a score of 93.41% in the official evaluation of the shared task, ranking second in the leaderboard.

CCL25-Eval任务四系统报告: 基于层次化思维链构造与推理模型高效微调的中文叙实性推理
pdf bib 闫强, 范意兴, 钟芸霏

本文介绍了我们在第二十五届中国计算语言学大会(CCL 2025)中文叙实性推理评测(FIE2025)中荣获双赛道第一名和第二名的系统方案。针对中文叙实性推理任务中模型需要从谓词语义正确推断事件真实性的挑战,我们提出了层次化思维链(Hierarchical Chain-of-Thought, HCoT)推理框架,通过结构化的多级推理过程引导模型逐步识别关键谓词、分析其叙实性类型及其在否定、疑问等复杂语境下的叙实性变化。在非微调赛道中,我们通过集成多种强大的推理型大模型(如Deepseek-R1-671B、Deepseek-v3-671B、GPT-4o、Gemini-2.5-pro-0506等)的预测结果,并采用自适应投票策略,取得了0.9376的分数。在微调赛道上,我们构建了高质量的思维链指令数据集,发现专注于推理能力的基础模型(如DeepSeek-R1-Distill-Qwen-32B)经微调后在叙实性推理任务上优于同等规模甚至更大参数量的通用大模型(如Qwen2.5-72B-Instruct)。通过伪标签训练进一步优化,最终在官方评测中取得0.9396的最高正确率。实验结果表明,我们提出的层次化思维链结构与推理模型的结合在中文叙实性推理任务中具有显著优势,特别是在处理复杂语境和隐含语义的情况下。

System Report for CCL25-Eval Task 4: Prompting, Scheduling, and Arbitration Strategies for Chinese Factivity Inference
pdf bib Liu Daohuan, Xia Lun, Zhang Yuxuan, Yang Xinyu and Kong Fanzhen

This report presents the methodology and findings of prompting large language models (LLMs) for Chinese Factivity Inference (FI). We evaluated five LLMs, among which DeepSeek-R1 demonstrated the best overall performance. A combination of Chain-of-Thought (CoT), few-shot, and system-level instructions were combined for final prompting. Additionally, we introduced a pairwise task scheduling strategy and a multi-agent disagreement arbitration mechanism to further enhance inference quality. Experimental results show that the integration of prompting, scheduling, and arbitration strategies significantly improves performance, with DeepSeek-R1 achieving 91.7% overall accuracy on the evaluation set. The report also highlights findings regarding LLM behavior on FI tasks and outlines potential directions for future improvement.

CCL25-Eval任务四系统报告:基于RAG与谓词相似性方法的叙实性检测智能体
pdf bib 王昱, 杨倩, 梁科, 杨怿恒, 翟雨, 黄居仁

本文聚焦于“叙实性推理”任务,即判断语言中事件真实性的语义理解能力。该任务不依赖外部知识,而基于语言结构本身进行推理,对当前大语言模型(LLMs)提出挑战。为解决模型在叙实性漂移、多义词处理等方面的不足,作者提出一种结合RAG(检索增强生成)与谓词相似性的方法,构建了一个融合参数化与非参数化知识的叙实性检测智能体系统。该系统通过分步提示与知识库支持,实现了更高的一致性、准确性与可解释性,在评测任务中取得了0.9240的稳健表现。

CCL25-Eval任务四系统报告:基于多策略知识融合的叙实性推理方法研究
pdf bib 李宏宇, 杨智惠, 胡韧奋

FIE2025任务旨在使用大语言模型对文本及相关假设进行叙实性推理。我们参加了微调和非微调两个赛道,分别在人工数据集和自然数据集上采用提示词优化和词表RAG策略融合语言学知识,并利用模型集成投票方法提升判断准确率。评测结果显示,我们的方法在非微调赛道取得了0.9351的成绩,在微调赛道取得了0.9261的成绩,均位列第三名。

Overview of CCL25-Eval Task 4: Factivity Inference Evaluation 2025
pdf bib Cong Guanliang, Wu Junchao, Chen Yang, Xun Tianqi, Derek F. Wong, Li Bin and Yuan Yulin

This paper presents the results of the FIE2025, a shared task aimed at evaluating the ability of Large Language Models (LLMs) to perform factivity inference on Chinese texts: whether LLMs can correctly discern the veridical information of propositions encoded in the complement clauses. The responses to the task mirror the extent to which LLMs can grasp the implicit truth judgments made by human speakers through texts, as well as their subjective stances. Such a capability is crucial for autonomous inference in intelligent agents and for achieving fluid human–AI interaction. The task was hosted on the Alibaba Tianchi platform and evaluated through two tracks: with and without finetuning. A mixed dataset was constructed, combining both synthetic sentences and authentic corpus instances. The dataset comprises a total of about 3,000 items labeled by expert linguists, including 845 (300+545) manually created items and 2,143 (700+1,443) items selected from existing corpus. 404 results proposed by 74 teams were successfully submitted to Tianchi system. Overall, under current technological conditions, the key to successful factivity inference lies in whether LLMs effectively identify different types of predicates and various contextual conditions from the given texts. Models that support long-context prompt inputs tend to achieve the best inference performance when provided with numerous shots. This shared task deepened our understanding of the factivity phenomenon in Chinese, expanded the influence of factivity research within the field of natural language processing, and provided an exploratory precedent for future activities focusing on factivity inference in Chinese and potentially other languages.

System Report for CCL25-Eval Task 5: Data Augmentation and Large Language Model Fine Tuning for Chinese Ancient Poetry Comprehension and Inference
pdf bib Li Chengfei, Wang Chunyu, Liu Bin, Li Hanlin, Zhang Wenya, Gao Hui and Wu Yue

This paper introduces the CCL25-Eval evaluation task for ancient poetry comprehension and inference, which aims to enhance the capabilities of large language models(LLMs) in processing context-dependent texts with strong cultural backgrounds. Addressing the dual challenges of se-mantic analysis and emotional inference in ancient poetry, we propose a solution that integratesQwen-series LLMs with systematic data augmentation and LoRA-based parameter-efficient fine-tuning. We construct a high-quality dataset and design multi-phase training and inference strategies. Particularly in emotional inference tasks, we explore two approaches: emotion lexicon-based indirect matching and emotion appreciation-based direct judgment of emotional lexicon options. Experimental results indicated that: 1) Data augmentation significantly improves the model’s overall performance; 2) The result of emotion appreciation-based direct judgment approach achieves an accuracy of 0.865, ranking first in Task A; 3) Attempts with Qwen3 and reinforcement learning approaches do not significantly improve Task B results, but demonstrated good performance in sentence semantic similarity scores and format stability.

CCL25-Eval任务5系统报告:基于风格改写与投票机制的中文古诗词赏析评测
pdf bib 周盼盼, 杨清怡

本研究聚焦于古诗文理解与情感推理任务,面向CCL-EVAL任务5评测中的关键词解释、关键句意译与情感分类三个子任务,以古典诗词为核心语料,通过高质量数据清洗、模型改写和情感推理优化等策略,提升模型对复杂语义和历史情感的建模能力,探索了语言风格适配与生成策略对模型性能的影响。实验表明,经过指令微调的Qwen2.5-14B-Instruct在多项指标上优于7B模型,尤其在情感推理任务中表现突出,准确率达0.714。此外,基于多次生成结果的加权投票机制有效提高了输出稳定性。然而,引入其他古诗文数据训练与模型风格改写未提升任务正确率,暴露出数据一致性与评测机制适配性方面的问题与挑战。本研究验证了大模型在古诗文理解中的能力及提升潜力,未来可从数据质量提升、评测优化与计算效率控制等方面进一步改进。

System Report for CCL25-Eval Task 5: Hierarchical Multi-Task Prompt Fine-Tuning and PPO Reinforcement for Classical Chinese Poetry Comprehension and Sentiment Reasoning
pdf bib Tang Jingjun and Tang Zhiwen

We present a hierarchical multi-task framework to enhance classical Chinese poetry understand-ing and sentiment reasoning using large language models. Centered on Qwen2.5-14B-Instruction or Xunzi-Qwen-14B, we construct a 1,225-sample corpus of Tang and Song poems with parallel translations and multi-label sentiment annotations (e.g., nostalgia, patriotism, contemplation).The task is divided into comprehension, translation, and sentiment inference, each guided by dynamic prompting and task-specific templates. We employ mixed supervised fine-tuning to better capture syntactic and metaphorical patterns. For sentiment reasoning, we apply proximal policy optimization (PPO) with a custom reward function, boosting accuracy from 0.771 to 0.807(p < 0.01). Our model achieves a 0.714 comprehensive score, outperforming single-task base-lines by 12.6%. Ablation studies further confirm the benefits of multi-task learning in promoting cross-task knowledge transfer.Keywords: Classical Chinese Poetry, Multi-Task Fine-Tuning, Data Augmentation, ProximalPolicy Optimization

System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5
pdf bib Xie Haotao

Recently, large language models (LLMs) have achieved promising progress in the fields of classical Chinese translation and the generation of classical poetry. However, domain-specific re-search on precise translation and affective-semantic understanding of classical poetry remains limited. The main challenge is that most studies treat the poetic appreciation task as a general-domain problem, neglecting the distinctive features of poetic appreciation, while high-quality and domain-specific datasets are extremely limited. To address this limitation, we decompose the task into three subtasks: term interpretation, semantic interpretation, and emotional inference.Based on multiple open-source datasets, we perform data cleansing and alignment to construct the Classical Chinese Poetry Instruction Pair Dataset (CCPoetry-49K), which comprises 49,404high-quality instruction–response pairs explicitly optimized for this domain. We then propose a domain-specialized LLM, called PoetryQwen, by applying Low-Rank Adaptation (LoRA) to fine-tune the Qwen2.5-14B model. Experimental results on the CCL25-Eval Task 5 benchmark demonstrate that PoetryQwen achieves a score of 0.757, representing a 9.7% improvement over the Qwen2.5-14B-Instruct baseline (0.690). These findings clearly indicate that Poetry Qwen significantly enhances performance in precise translation and emotional understanding of classical poetry. We present new dataset and methodological considerations intended to support the domain-specific optimization of LLMs.

CCL25-Eval 任务5系统报告:基于千问大模型的古诗词理解与推理研究
pdf bib 王珏

中国古典诗词语言凝练、意境深远,对自然语言处理系统提出了严峻挑战。本次评测聚焦于古诗词理解与推理,包括词语释义、句子翻译和情感分析三项子任务。本文基于Qwen2.5-14B-Instruct 模型,在LLaMA Factory 框架下采用监督微调(SFT)与LoRA 参数高效微调策略,提升模型在few-shot 条件下的表现。训练数据来自官方发布的多类别JSON 格式语料,经整合与指令格式转换后用于模型训练。实验表明,LoRA 微调显著优于zero-shot 基线。本研究验证了参数高效微调方法在有限数据场景下的有效性。

Overview of CCL25-Eval Task 5: Chinese Classical Poetry Appreciation Evaluation
pdf bib Pei Zhenwu, Zhu Yingjie, Chen Rongbo, Bai Xuefeng, Chen Kehai and Zhang Min

This paper presents a review of CCL2025-Eval Task 5: Appreciation Evaluation (CCPA). The primary aim of this task is to evaluate the ability of lan-guage models in performing deep semantic understanding and aesthetic appreciation of Chinese classical poetry. The evaluation comprises two tracks: (1) Poetic content understanding, which examines models’ ability to interpret both fine-grained and coarse-grained semantics; (2) Poetic emotion recognition, which evaluates models’ capacity to identify and analyze emotional expressions. A total of 55 teams registered for the task, among which 7 teams provided valid submissions. The paper provides an in-depth analysis of the submissions and results from all participating teams.

System Report for CCL25-Eval Task 6: Chinese Essay Rhetoric Recognition Using LoRA, In-context Learning and Model Ensemble
pdf bib Lai Yuxuan, Wang Xiajing and Zheng Chen

Rhetoric recognition is a critical component in automated essay scoring. By identifying rhetorical elements in student writing, AI systems can better assess linguistic and higher-order thinking skills, making it an essential task in the area of AI for education. In this paper, we leverage LargeLanguage Models (LLMs) for the Chinese rhetoric recognition task. Specifically, we exploreLow-Rank Adaptation (LoRA) based fine-tuning and in-context learning to integrate rhetoric knowledge into LLMs. We formulate the outputs as JSON to obtain structural outputs and trans-late keys to Chinese. To further enhance the performance, we also investigate several model ensemble methods. Our method achieves the best performance on all three tracks of CCL 2025Chinese essay rhetoric recognition evaluation task, winning the first prize.

CCL25-Eval 任务6系统报告:基于数据增强及大小模型协同的中小学作文修辞识别
pdf bib 宗绪泉, 安纪元, 付祥, 鲁鹿鸣, 朱浩楠, 杨麟儿, 杨尔弘

CCL25-Eval任务6提出了一个段落级、多层次,细粒度中小学修辞识别与理解任务。针对修辞分类任务的特点,本文构建了一种以数据增强为核心、结合高效监督微调的多策略融合框架,并融合语句层面修辞识别与段落句间关系建模及识别,以全面提升模型的修辞理解能力。针对修辞成分抽取任务的特点,本文采用先进行修辞类别判定,后在该基础上进行修辞相关实体识别的两阶段处理策略,有效提升了整体识别精度。结果表明,本文所提出的方法能够有效对修辞进行识别和抽取,三个赛道上的分数分别达到了43.47、51.71、38.27,总成绩位列第二。

System Report for CCL25-Eval Task 6: Enhancing Chinese Essay Rhetoric Recognition through Targeted Data Augmentation and Model Ensemble Voting
pdf bib Tang Jingjun and Tang Zhiwen

This paper presents our approach to the Second Chinese Essay Rhetoric Identification and Understanding Competition, which focuses on analyzing rhetorical features in essays written by primary and secondary school students. The competition includes three tasks: multi-label classification of rhetorical forms, divided into 9 coarse-grained and 19 fine-grained categories; multi-label classification of rhetorical content, comprising 5 coarse-grained and 11 fine-grained categories specific to certain rhetorical types; and extraction of rhetorical components, including connectives, descriptive objects, and specific rhetorical content. To address the challenge of limited training data, we applied targeted data augmentation and manual corrections to build a high-quality dataset. We then fine-tuned large language models using one-shot and in-context learning. Finally, we employed an ensemble strategy that integrates model predictions through a voting mechanism. Our system achieved a score of 52.78 and ranked third in the competition.

Overview of CCL25-Eval Task6: Chinese Essay Rhetoric Recognition Evaluation
pdf bib Lu Yujiang, Liu Nuowei, Ren Yupei, Zhu Yicheng, Lan Man, Bai Xiaopeng, Xu Mofan and Liao Qingyu

Literary grace in Chinese composition writing is a hallmark of linguistic sophistication, often realized through various rhetorical devices. The automatic identification and analysis of rhetorical devices in essays play a crucial role in educational NLP applications, particularly for assessing writing proficiency and facilitating pedagogical interventions. Although prior research has predominantly focused on coarse-grained recognition of limited rhetorical devices at sentence level, these approaches prove inadequate for handling complex rhetorical structures and emerging educational demands. In this paper, we present the CCL25-Eval Task6: Chinese EssayRhetoric Recognition Evaluation (CERRE), a novel framework comprising three distinct evaluation tracks at the document level: (1) Fine-grained Form-level Categories Recognition, (2)Fine-grained Content-level Categories Recognition, and (3) Rhetorical Component Extraction.The evaluation has attracted 29 registered participating teams, with 8 teams submitting valid system outputs. In particular, two participating systems demonstrated superior performance by exceeding the baseline metrics in complete evaluation criteria.

CCL25-Eval任务7系统报告:基于古典汉语理解的双阶段多域微调解析框架
pdf bib 魏祺哲

古典汉语作为中华传统文化的重要载体,其语言表达高度凝练且语义复杂,给现代大语言模型带来挑战。为提升中文文学语言理解能力,本文提出一种新的解析框架,采用双阶段多域微调训练策略:第一阶段利用指令生成技术获取大量数据集,随后在此数据集上进行稀疏微调,实现基础适应;第二阶段则高质量标注数据上通过冻结参数在不同域精调,提升具体任务表现。实验基于第一届中国文学语言理解评测(争鸣)七项任务,此微调框架得到的结果显著优于基线,验证了双阶段多域微调方法的有效性,相关模型已开源于https:/huggingface.co/wqz123/D2Dtest。

CCL25-Eval任务7系统报告:学而不思则罔?
pdf bib 郑陈锐, 朱奕澄, 王欣雨, 姜伟麟, 吴会腾

以DeepSeek-R1为 代 表 ,思 考普 遍 被 认 为 是 一 种 提 高 大 语 言 模 型 性 能 的 方 法 。在CCL25-Eval争鸣中文阅读理解任务下,本文分别探索了思考和非思考两种模型在这项任务下的潜力。具体来说,在古代文学知识理解任务中,本文构建了古汉语特定领域的知识数据集,用大模型蒸馏了思考数据集,整理了高质量思考数据集,在这些数据基础之下同样lora微调,发现思考模型虽然性能有巨大提升,但依旧比不上原本的非思考模型。最后,开源并提交了基于Qwen2.5的SongPanda模型。

System Report for CCL25-Eval Task 7: A Two-stage Framework for Aligning LLM to Chinese Literature via Fine-Tuning and Prompting
pdf bib Su Fan, Qin Yiming, Zhao Aijia, Wang Zhenxu and Huang Zekang

This system report presents our approach and results for the First Chinese Literary Language Understanding Evaluation (ZhengMing) task at CCL25-Eval. The ZhengMing evaluation benchmark consists of seven subtasks: Biases in Modern Literary Criticism, Modern Literary Criticism Mining, Classical Chinese Literature Comprehension, Literary Reading Comprehension,Literary Named Entity Recognition, Literary Language Style Transfer, and Literary Work Style Prediction. To address these tasks, we propose a two-stage framework named StageAli to align large language models (LLMs) to the Chinese literature domain. In the first stage, we employLow-Rank Adaptation (LoRA) to fine-tune an LLM on Chinese literary datasets, aiming to adapt the model to Chinese literature domain. In the second stage, we utilize a combination of prompt-ing strategies to further unleash the potential of the fine-tuned model in addressing the ChineseLiterary Language Understanding task. Our proposed StageAli framework achieves second place in the overall evaluation, demonstrating the effectiveness of our method.

Overview of CCL25-Eval Task 7: Chinese Literary Language Understanding Evaluation
pdf bib Wang Kang, Wang Qing, Peng Min, Yue Kun and Hu Gang

The 24th Chinese Computational Linguistics Conference (CCL25-Eval) features 12 technical evaluation tasks. Among them, Task 7 is the Chinese Literary Language Understanding Evaluation (ZhengMing). ZhengMing is a universal and scalable evaluation framework designed to assess natural language processing (NLP) tasks in the literary domain, such as text classification, text generation, automated question answering, relation extraction, and machine translation.ZhengMing framework aims to evaluate the performance of large language models (LLMs) in the literary field at a fine-grained level. In this mission, 89 teams signed up for the competition, with5 teams ultimately submitting results. The highest score achieved is 0.65. This paper presents and discusses the dataset, task descriptions, competition results, and other relevant information for this evaluation task. This paper introduces and presents relevant information about this evaluation task, including the dataset, task description, and competition results. More details are available at https://github.com/isShayulajiao/CCL25-Eval-ZhengMing.

CCL25-Eval任务7系统报告:微调与提示协同增强大语言模型的文学语义理解
pdf bib 杨清怡, 周盼盼

本报告基于第一届中国文学语言理解评测(争鸣)任务,对Qwen2.5-7B-Instruct模型进行了低秩适配(Low-Rank Adaptation, LoRA)微调实验。任务包括五项主任务:古代文学知识理解、文学阅读完形填空、文学命名实体识别、文学作品风格预测和文学风格转换;另有两项域外任务,涉及现代文学批评倾向与批评挖掘。在有限计算资源条件下,采用LoRA技术实现了高效参数更新,并结合少量样本提示和高质量指令设计,提升了模型在少样本条件下的鲁棒性与泛化能力。实验结果显示,该方法在五项主任务上取得了良好表现,并在域外任务中展现出显著的跨领域能力。其中,在批评挖掘任务中取得了0.847的准确率,体现了较强的抽象推理与知识迁移能力。基于本报告方法训练的模型在所有任务的平均指标为0.540,在参赛队伍中排名第三。

System Report for CCL25-Eval Task 8: Improving ICD Coding with Large Language Models via Disease Entity Recognition
pdf bib Lv Tengxiao, Li Juntao, Liu Chao, Yuan Haobin, Luo Ling, Wang Jian and Lin Hongfei

With the widespread adoption of Electronic Medical Records (EMRs), automated coding of theInternational Classification of Diseases (ICD) has become increasingly essential. However, the complexity of Chinese clinical texts presents significant challenges to traditional methods. To address these issues, CCL25-Eval Task 8 organized the Chinese EMRs ICD Diagnosis CodingEvaluation. This paper presents a method based on Large Language Models (LLMs), which divides the task into primary and other diagnosis coding. For the primary diagnosis, a confidence-guided semantic retrieval strategy is applied, while ensemble learning enhanced with NamedEntity Recognition (NER) is used for other diagnoses. The proposed approach achieved 83.42%accuracy on the official test set, ranking second in the evaluation.

System Report for CCL25-Eval Task 8: Structured ICD Coding with LLM-Augmented Learning and Group-specific Classifiers
pdf bib Wang Bo, Zhang Kaiyuan, Feng Chong, Shi Ge, Ye Jinhua, Teng Jiahao, Wang Shouzhen, Meng Fanqing, Yuan Changsen and Zhuang Yan

The International Classification of Diseases (ICD) provides a standardized framework for encoding diagnoses, serving critical roles in clinical scenarios. Automatic ICD coding aims to assign formalized diagnostic codes to medical records for documentation and analysis, which is challenged by an extremely large and imbalanced label space, noisy and heterogeneous clinical text,and the need for interpretability. In this paper, we propose a structured multi-class classification framework that partitions diseases into clinically coherent groups, enabling group-specific dataaugmentation and supervision. Our method combines input compression with generative and discriminative fine-tuning strategies tailored to primary and secondary diagnoses, respectively.On the CCL2025-Eval Task 8 benchmark for Chinese electronic medical records, our approach ranked first in the final evaluation.

CCL25-Eval任务8系统报告:基于规则奖励与自主思考强化学习的中文电子病历ICD诊断编码探索
pdf bib 邹游, 张蕾, 梁晓东, 莫坤东, 郭子滔, 危枫, 王晨子

世界卫生组织国际疾病分类ICD诊断编码的自动生成是医疗信息化的核心挑战,面临主诊断单标签分类准确性不足、其他诊断多标签预测不完整以及长尾分布等技术瓶颈。本文系统研究探索了大语言模型在中文电子病历ICD诊断编码任务中的微调范式创新,针对生成式微调、判别式微调,以及强化学习分别提出了不同的微调训练策略。其中,创新性地设计针对医疗特性的基于规则奖励的强化学习框架(RBRs-RL),通过动态难度校准、Token级梯度优化和超长奖励塑造策略改进了GRPO算法的效率和性能,同时结合提出的策略轮动数据增强迭代训练(SRADIT)策略,实现了强化微调性能上限的提升。此外,本文还系统比较了生成式与判别式微调在中文诊断ICD编码任务中的性能边界,同时构建了端到端的临床决策优化框架,为奖励微调提供有效路径。并且针对推理阶段,本文设计了一种温度调节集成共识预测方法(TCECP),提升了推理的稳定性和可靠性。最后基于Qwen2.5-7B模型的微调实验结果表明,通过本文提出的优化后的RBR-R1式强化微调方法,在CCL25-Eval任务朸的A榜和B榜分别取得80.98和82.33的优异成绩,其效果显著超越传统SFT的性能上限。综上所述,本文的探索与发现为医疗诊断编码系统的实际应用提供了重要的技术参考。

System Report for CCL25-Eval Task 8: ClinSplitFT: Enhancing ICD Coding in Chinese EMRs with Prompt Engineering and Candidate Set Splitting
pdf bib Chen Pusheng, Tan Qiangyu and Tang Zhiwen

CCL25-Eval Task 8 focuses on ICD coding from clinical narratives. The challenge of this task lies in the imbalanced and complex label space, with primary diagnoses having a small, focused set of labels and secondary diagnoses involving a much larger, intricate set. To address these challenges, we propose ClinSplitFT (Clinical Code Split Fine-Tuning), a novel framework that enhances ICD coding accuracy using large language models (LLMs). The key innovation of ClinSplitFT is its candidate set split strategy, which splits the full candidate set into several manageable subsets and fine-tunes the model separately on each. During inference, predictions from all subsets are aggregated to produce the final output. This split-based fine-tuning approach enables more focused learning and better generalization in multi-label settings, making it an effective solution for clinical code prediction at scale. Experimental results show significant improvements in ICD coding performance. The code for our system is publicly available at https://github.com/277CPS/ICD-Code-prediction.

CCL25-Eval任务8总结报告:中文电子病历ICD诊断编码评测
pdf bib 梁镇鹏, 李传龙, 廉颖, 陈国强, 管红娇, 鹿文鹏

中文电子病历国际疾病分类(ICD)诊断编码评测依托第二十四届中国计算语言学大会(CCL)举办。该评测聚焦于自然语言处理技术在智能医疗领域的应用,旨在从真实脱敏的电子病历文本中自动分析关键临床表征,实现主诊断及其他诊断ICD编码的精准预测与分配,从而辅助临床医生与专业编码员提升编码工作的准确性和效率。本次评测在阿里云天池平台进行,获得了学术界与工业界的广泛关注和积极参与。数据显示,共有445支队伍报名参赛,其中A榜和B榜分别有85支和36支队伍成功提交了有效结果。最终,8支表现优异的队伍受邀撰写并分享了其技术报告,为推动该领域的技术进步与方法创新贡献了宝贵经验。本次评测的详细信息可参见相关发布页面。

CCL25-Eval任务9系统报告:中医辨证辨病及处方生成中的少样本数据增强方法
pdf bib 左梓呈, 任佳敏, 吐尔地.托合提

中医药在临床诊断和治疗中发挥了不可或缺的作用。中医辨证辨病及中药处方生成任务包含两个富有挑战性的问题,包括中医多标签辨证辨病和中药处方推荐。由于缺乏高质量的标注数据,之前的方法大多需要引入外部数据,容易出现知识滞后的问题。因此,我们提出了一种融合大模型与可控文本生成的混合增强策略。具体而言,通过设计基于词汇独立性的数据增强,与微调大模型进行可控文本生成,在少量标注样本的基础上构建高质量扩展数据集。然后采用LoRA微调技术适配此任务。实验结果表明,该方案分别获得了0.553和0.4515的得分。在不需要引入额外数据的情况下,也能获得较好的效果。

CCL25-Eval任务9系统报告:一种面向中医辨证与处方生成任务的检索增强大模型方法
pdf bib 康益扬, 姚佳琪, 吕腾啸, 徐博, 罗凌, 孙媛媛, 林鸿飞

本文面向CCL2025-Eval任务9中的中医辨证辨病与中药处方推荐两个子任务,提出了一套基于大语言模型的系统性方法。在子任务1中,本文基于QLoRA方法对Qwen2.5-7B、Mistral-7B和Baichuan-7B三种预训练模型进行高效微调,并引入多模型集成投票策略。在子任务串中,本文设计了融合向量检索、监督微调与强化学习的中药推荐框架,通过相似度检索构建候选处方集合,并利用强化学习优化模型的生成能力。最终在评测中获得总分0.5171(Task1得分0.5710,Task2得分0.4632),排名第四,验证了所提方法的有效性与实用性。

CCL25-Eval 任务9系统报告:基于大模型及指令微调方法的中医辨证辨病及中药处方生成研究
pdf bib 李南书

辨证论治是中医认识疾病和治疗疾病的核心原则和方法,其基本思想是通过望、闻、问、切的方法,收集患者症状、舌苔、脉象等临床信息,通过分析、综合,辨清疾病的病因、病机,概括、判断为某种性质的证,进而制定个性化的治疗方案,开具合适的中药处方予以治疗。本研究探究如何增强大模型根据格式化,标准化的中医病例自动生成相对应的辨证辨病及中药处方的能力。本研究将任务拆分为辩证辨病与中药处方生成两个任务,使用的训练框架是LLamafactory,使用的大模型是开源模型(qwen2.5-7B-Instruct(Qwen Team, 2024),qwen3-4B)。首先设置lora参数为LLamafactory默认参数,修改参数中验证集比例为0.2,epoch为5,进行lora监督微调,获得验证集相对最佳的epoch。然后,设置lora参数为默认,修改其中的epoch参数为验证集最佳epoch+1,同时对模型进行全数据lora调参优化,择其中相对最优者。最后对全数据进行full微调,与lora调参最优模型比较,择其更优者。最终在B榜中获得score1:0.648,score:0.4259,总score:0.5369,综合排名第一的成绩。

System Report for CCL25-Eval Task 9: Leveraging Chain-of-Thought and Multi-task Learning for Optimized Traditional Chinese Medicine Diagnosis and Treatment
pdf bib Zhang Jian, Zhu Wei and Tang Zhiwen

This paper introduces an intelligent diagnostic system for Traditional Chinese Medicine (TCM) that emulates clinical reasoning through a phased multi-turn dialogue process. The system architecture is divided into three sequential stages: syndrome differentiation, disease diagnosis,and prescription generation. Each stage leverages Chain-of-Thought (CoT) techniques to ensure coherent reasoning, maintaining contextual continuity and consistency throughout the diagnostic process. To optimize model performance, we employ a multi-task fine-tuning approach, combin-ing data from all three stages for training the Qwen2.5-7B-Instruct model. Experimental results show that the system achieves strong performance across all diagnostic tasks. Error analysis re-veals that the accuracy of the first two stages, syndrome differentiation and disease diagnosis, has a significant impact on the quality of the generated prescriptions. This work provides a scalable framework for intelligent TCM diagnosis, advancing both medical knowledge reasoning and the application of domain-specific large language models.

CCL25-Eval任务9总结报告:中医辨证辨病及中药处方生成评测
pdf bib 王聪, 赵直倬, 李一硕, 管红娇, 王怡斐, 李振宇, 鹿文鹏

中医辨证辨病及中药处方生成评测任务专注于中医“辨证论治”。该任务由齐鲁工业大学(山东省科学院)与山东中医药大学附属医院联合发起,基于真实病历构建了中医“辨证论治”全流程公开数据集TCM-TBOSD,覆盖10类中医证型、4类中医疾病及381种常见中药。评测任务设立两个子任务:中医多标签辨证辨病与中药处方推荐,旨在系统评估大模型在中医诊疗全过程中的建模与推理能力。本次评测收到了学术界与产业界的广泛关注,评测共吸引123支队伍参与,35支队伍晋级复赛,最终提交了8份高质量技术报告。评测结果表明,大语言模型在中医任务中展现出良好的适应性与发展潜力,为中医智能化提供了可行路径与技术参考。详细信息可以从网址查看我们的评测任务。

Overview of CCL25-Eval Task 10: Fine-grained Chinese Hate Speech Identification Evaluation Task
pdf bib Lu Junyu, Bai Zewen, Yin Shengdi, Yang Liang and Lin Hongfei

This paper provides an overview of the CCL25-Eval Task 10, i.e., Fine-grained Chinese Hate Speech Identification Evaluation. The primary objective of this task is to perform a fine-grained analysis of hateful samples. In addition to binary classification, systems are required to identify and extract the comment target, argument span, and the associated targeted group within each sample, thereby enhancing the model’s capability in fine-grained detection and improving the interpretability of its decisions. In total, more than 300 teams registered for the task, with 100 teams submitting valid results. We present the submitted results and provide a comprehensive analysis of the technical approaches adopted by the top-performing teams. The dataset used in this task has been available.

System Report for CCL25-Eval Task 10: SRAG-MAV for Fine-Grained Chinese Hate Speech Recognition
pdf bib Wang Jiahao, Liu Ramen, Zhang Longhui and Li Jing

This paper presents our system for CCL25-Eval Task 10, addressing Fine-Grained Chinese Hate Speech Recognition (FGCHSR). We propose a novel SRAG-MAV framework that synergistically integrates task reformulation(TR), Self-Retrieval-Augmented Generation (SRAG), and Multi-Round Accumulative Voting (MAV). Our method reformulates the quadruplet extraction task into triplet extraction, uses dynamic retrieval from the training set to create contextual prompts,and applies multi-round inference with voting to improve output stability and performance. Our system, based on the Qwen2.5-7B model, achieves a Hard Score of 26.66, a Soft Score of 48.35,and an Average Score of 37.505 on the STATE ToxiCN dataset, significantly outperforming base-lines such as GPT-4o (Average Score 15.63) and fine-tuned Qwen2.5-7B (Average Score 35.365).The code is available at https://github.com/king-wang123/CCL25-SRAG-MAV.

System Report for CCL25-Eval Task 10: Prompt-Driven Large Language Model Merge for Fine-Grained Chinese Hate Speech Detection
pdf bib Wu Binglin, Zou Jiaxiu and Li Xianneng

The proliferation of hate speech on Chinese social media poses urgent societal risks, yet traditional systems struggle to decode context-dependent rhetorical strategies and evolving slang. To bridge this gap, we propose a novel three-stage LLM-based framework: Prompt Engineering, Supervised Fine-tuning, and LLM Merging. First, context-aware prompts are designed to guide LLMs in extracting implicit hate patterns. Next, task-specific features are integrated during supervised fine-tuning to enhance domain adaptation. Finally, merging fine-tuned LLMs improves robustness against out-of-distribution cases. Evaluations on the STATE-ToxiCN benchmark validate the framework’s effectiveness, demonstrating superior performance over baseline methods in detecting fine-grained hate speech.

CCL25-Eval任务10系统报告:面向细粒度中文仇恨言论识别的大语言模型增强
pdf bib 林凡钧, 张晏玮, 黄杨, 姚之远

本文介绍了我们在第二十四届中文计算语言学大会细粒度中文仇恨言论识别任务中的参赛系统。该任务要求构建结构化仇恨四元组(评论对象、论点、目标群体、是否仇恨),提升模型的细粒度检测与可解释性。我们基于大语言模型,首先评估了LoRA参数高效微调效果,优化了超参数配置;其次对标注数据进行结构化处理,增强数据规范性;最后优化提示词设计,引导模型生成准确的结构化输出。实验表明,三阶段优化提升了模型性能。

CCL25-Eval任务10系统报告:基于动态线索增强提示与多阶段渐进优化的中文仇恨言论检测方法
pdf bib 阮禄, 翟波, 张蕾, 鲍烈, 王泽宇, 危枫, 王晨子

随着社交媒体的迅速普及,用户生成内容呈指数级增长,同时也助长了仇恨言论的扩散。因此,有效检测仇恨言论已成为自然语言处理研究领域的一项关键挑战。为推动中文仇恨言论检测技术的发展,本文提出了一种新颖的大语言模型微调框架,该框架融合了动态线索增强提示和多阶段渐进优化方法。所提出的方法将复杂的细粒度仇恨言论识别任务分解为两个相辅相成的子任务:仇恨倾向分类和仇恨信息提取。为此采用了两种专门的训练策略:动态线索增强提示微调(DCA-SFT)用于优化模型的分类性能,而动态线索增强强化学习(DCA-RL)则用于提升模型的信息提取能力。具体而言,在DCA-SFT阶段,引入判别式分类并采用多标签独热(Multi-Hot)编码作为输出表示形式,以提高模型的多类别分类准确率。在DCA-RL阶段,通过知识蒸馏的方式,将闭源大语言模型在执行仇恨信息提取任务时的思维链(CoT)知识迁移至小参数模型,同时引入基于规则奖励的强化微调策略来增强小参数模型在信息提取任务中的逻辑推理能力。实验结果证明了该方法的有效性,在CCL25-Eval任务10的初赛排行榜上以0.3864的F1值,排名第二;在决赛排行榜上以0.3591的F1值,位列第三。

CCL25-Eval任务11系统报告:基于大模型微调的汉字硬笔书写质量自动评价
pdf bib 孔露露, 昝红英, 宋金旺, 刘海芯, 李一帆, 罗哲伟

本技术报告探讨了通过微调本地视觉语言模型,实现汉字硬笔书写质量自动评价的技术方案。针对传统评价方法难以提供准确性反馈的问题,我们团队采用精心设计的prompt并结合微调的方式构建了一个高效的汉字硬笔书写质量自动评价系统。我们采用Qwen2.5-VL-7B-Instruct模型作为基础,通过LoRA微调技术实现了汉字书写质量等级分类(子任务一)和个性化评语生成(子任务二)的功能。系统地融合了视觉特征分析与语言生成能力,在训练过程中采用了梯度检查点、BF16混合精度训练等技术优化显存使用,并设计了针对性的损失函数和评估指标。实验结果表明,我们的方法能够有效实现汉字书写质量的细粒度评价。

System Report for CCL25-Eval Task 11: Enhancing Chinese Character Handwriting Evaluation with Multimodal Large Language Models
pdf bib Hong Xiaoqing, Li Yunhan and Ni Lyu

With the development of smart devices, students’ ability to handwrite Chinese characters has generally been decreasing. Chinese character handwriting receives increasing attention because the standardization of Chinese character handwriting is one of the most important components of national education in China. Due to inadequate professional teachers and labor-intensive evaluation means, it is difficult to provide large-scale,personalized, and low-latency evaluation feedback in Chinese character handwriting education. Recently, large language models (LLMs) have made outstanding achievements in natural language understanding and generation. Thus, the multimodal large language model(MLLM) is an efficient method to resolve the difficulties. We introduce an enhanced neural network architecture, referred to as ACBAM-VGG16, which is developed by augmenting the CBAM-VGG16 framework with adversarially generated examples. Leveraging this model, we propose customized training and inference mechanisms for MLLMs, specifically targeting two downstream tasks: quality assessment of handwritten Chinese character images and generation of descriptive textual comments. We introduce an effective inference strategy that allows an MLLM to maintain high performance in scenarios where limited training data are available for model fine-tuning, resulting in the average F1 score can be improved by 6.74%. Moreover, we design a hierarchical MLLM fine-tuning framework to ensure the precision and diversity of generated comments. In the comparison of various MLLMs, the proposed framework increases the weighted aver-age of ROUGE-1, ROUGE-2, and ROUGE-L by 2.33%-9.94%.

System Report for CCL25-Eval Task 11: Aesthetic Assessment of Chinese Handwritings Based on Vision Language Models
pdf bib Zheng Chen, Lai Yuxuan, Lu Haoyang, Ma Wentao, Yang Jitao and Wang Jian

The handwriting of Chinese characters is a fundamental aspect of learning the Chinese language. Previous automated assessment methods often framed scoring as a regression problem. However, this score-only feedback lacks actionable guidance, which limits its effectiveness in helping learners improve their handwriting skills.In this paper, we leverage vision-language models(VLMs) to analyze the quality of handwritten Chinese characters and generate multi-level feedback. Specifically, we investigate two feedback generation tasks: simple grade feedback (Task 1)and enriched, descriptive feedback (Task 2). We explore both low-rank adaptation (LoRA)-based fine-tuning strategies and in-context learning methods to integrate aesthetic assessment knowl-edge into VLMs. Experimental results show that our approach achieves state-of-the-art performances across multiple evaluation tracks in the CCL 2025 workshop on evaluation of handwrittenChinese character quality.

Overview of CCL25-Eval Task 11: Evaluation of the Quality of Handwritten Chinese Characters
pdf bib Wang Meng, Lu Shicong, Hu Zhidan, Su Chen and Cao Yujie

As an important means of disseminating Chinese cultural heritage, the development of Chinese handwriting skills faces dual challenges in the digital era: insufficient pedagogical resources anda lack of personalized feedback. At the 24th China National Conference on Computational Linguistics (CCL 2025), we organized a handwritten Chinese character evaluation task focusing on writing quality grading and comments generation. This benchmark utilized an expert-annotated calligraphic dataset to enhance task efficacy. Eight teams participated in the evaluation, three ofwhich submitted valid entries. In the character grading subtask, the top-performing team achieved an F1-score of 90.5%, whereas the optimal system in the comments generation subtask attained a score of 52.8%.

CCL25-Eval任务12系统报告:基于语音识别与大语言模型的中文语音实体关系三元组抽取方法
pdf bib 乔志善

本文针对中文语音实体关系三元组抽取任务,提出了一种基于语音识别模型与大语言模型相结合的Pipeline解决方案。该方法首先利用SenseVoice语音识别模型将语音转换为文本,通过热词检测与拼音相似度匹配技术对转录文本进行纠错优化,然后采用微调后的Qwen2.5-7B-Instruct进行实体关系三元组抽取。在数据预处理阶段,我们设计了一套完整的流水线,包括:(1)基于HanLP的命名实体识别构建热词库;(2)拼音相似度匹配算法进行音近字纠错;(3)阿拉伯数字到中文数字的转换;(4)热词引导的语音识别优化。在模型训练方面,我们构建了高质量的指令微调数据集,采用统一的prompt模板对大语言模型进行监督微调,使其能够从语音转录文本中准确提取结构化的三元组信息。实验结果表明,我们的方法在中文语音实体关系三元组抽取任务上取得了良好的性能。热词引导机制显著提升了语音识别在专有名词上的准确率,拼音相似度匹配有效解决了语音识别中的同音字错误问题,基于大语言模型的三元组抽取模块则展现出优秀的泛化能力和推理性能。

System Report for CCL25-Eval Task 12: Surpassing LLMs with a Simple Pipeline for Mandarin Spoken Entity-Relation Extraction
pdf bib Song Wuganjing

We present a strong and practical pipeline system for Mandarin spoken entity and relation extraction (Spoken-ERE), which integrates an industrial-grade ASR module (FireRedASR) with a span-based joint entity-relation extraction model. Unlike recent approaches that rely on large language models (LLMs) for end-to-end spoken information extraction, our method uses a modular pipeline design that is lightweight, interpretable, and easy to deploy. Despite its simplicity,our system achieves top-tier performance in a recent shared task workshop, outperform-ing several 5× larger LLM-based systems for 20% on F1-score. We demonstrate through experiments that with robust ASR and a well-designed span-based model, classical pipelines re-main competitive and, in some scenarios, even preferable to LLM-based solutions for spoken information extraction in Mandarin.

CCL25-Eval任务12系统报告:基于端到端模型以及指令微调方法的面向中文语音的实体关系三元组抽取研究
pdf bib 李南书

传统的关系三元组抽取任务主要集中于书面文本,通过识别实体及其相互关系来构建结构化的知识图谱。然而,语音作为人机交互的主要形式之一,在智能助手、智能客服、语音搜索等诸多应用中发挥着日益重要的作用。因此,如何高效、准确地从语音数据中提取有价值的结构化信息成为研究的热点之一。本研究通过测试模型在数据集上的性能,探究如何增强模型在三元抽取任务中的能力。本研究使用的训练框架是LLamafactory,使用的大模型是两个7B量级的开源模型(qwen2-audio,qwen2.5-omin(Qwen Team, 2025)),首先任取其中的一个模型(本研究选取的为qwen2-audio)设置lora参数为LLamafactory默认参数,修改参数中验证集比例为0.2,epoch为5,进行lora监督微调,获得验证集最佳的epoch。然后,设置lora参数为默认,修改其中的epoch参数为验证集最佳epoch+1,同时对两个模型进行全数据lora监督微调,择其中更优胜者,最后进行进一步的lora调参,以期模型在该任务上达到相对最优性能。最终在B榜获得了end-to-end赛道的第二名,分数为0.5292。

CCL25-Eval任务12总结报告:面向中文语音的实体关系三元组抽取
pdf bib 穆文轩, 宁金忠, 潘怡霖, 帕尔哈提.吐拉江, 孙媛媛, 李松涛, 尹伟鸣, 季延旭, 张益嘉, 林鸿飞

中文语音实体关系三元组抽取任务(Chinese Speech Entity-Relation Triple Extraction Task, CSRTE)是第二十四届中国计算语言学大会中的一项技术评测,旨在从中文语音数据中自动识别并提取实体及其相互关系,构建结构化的语音关系三元组(头实体、关系、尾实体)。本任务的目标是提升中文语音关系三元组抽取的准确性与效率,增强模型在不同语境和复杂语音场景下的鲁棒性,实现从语音输入到文本三元组输出的全流程自动化处理。通过本次评测,有助于推动中文语音信息抽取技术的发展,促进语音与自然语言处理技术的深度融合,为智能应用提供更加丰富且精准的基础数据支持。此次评测共有257支队伍报名参赛,其中59支队伍提交了A榜成绩。成绩排名前15的队伍晋级A榜,并且表现突出的前朷支队伍提交了技术报告。