Evaluation Tasks


CCL25-Eval Evaluation Results Announced

The 24th China National Conference on Computational Linguistics (CCL 2025)will be held from August 11 to 14, 2025, in Jinan, Shandong Province. The conference is organized by the Chinese Information Processing Society of China and hosted by Qilu University of Technology (Shandong Academy of Sciences).

As part of the conference, the CCL25-Eval shared tasks on Chinese language processing were once again organized. After a call for tasks, the CCL25-Eval Organizing Committee selected 12 evaluation tasks covering areas such as semantic analysis, medical NLP, classical Chinese poetry understanding, essay rhetoric recognition, domain-specific applications, multimodal learning, and Chinese speech processing.

Following nearly six months of evaluation, each task awarded first, second, and third prizes based on the final results. Official certificates of honor will be issued by the Chinese Information Processing Society of China. Summary papers and outstanding technical reports from the evaluation will be included in both the CCL Anthology and the ACL Anthology.

Evaluation Chairs:

  • Hongfei Lin (Dalian University of Technology)
  • Bin Li (Nanjing Normal University, libin.njnu@gmail.com)
  • Hongye Tan (Shanxi University, tanhongye@sxu.edu.cn)

Evaluation Tasks

Task 1: The Fifth Evaluation on Spatial Semantic Understanding (SpaCE 2025)

Task Description

Spatial expressions describe the positional relationships between objects and are a frequent phenomenon in natural language. Accurately interpreting the semantics of spatial expressions in text requires not only linguistic knowledge but also spatial cognition, the ability to mentally construct spatial scenes, and the capacity to perform spatial reasoning grounded in world knowledge.

The Spatial Cognition Evaluation (SpaCE) series, launched in 2021, aims to assess the level of spatial semantic understanding in machines and has successfully held four previous editions. Existing evaluation results show that large language models (LLMs) have achieved human-level performance on tasks involving clear formal features and straightforward mappings between form and meaning—such as semantic role labeling. However, there is still considerable room for improvement in tasks requiring deeper cognitive and semantic understanding.

To further assess the spatial cognition abilities of LLMs, SpaCE 2025 presents the fifth edition of the evaluation. Compared to previous editions, this year’s evaluation features an expanded dataset, balanced distribution, and a greater focus on cognitively demanding tasks. In addition, bilingual (Chinese-English) datasets are provided for the reasoning subtask, enabling cross-lingual evaluation of spatial semantic understanding.

SpaCE 2025 includes four subtasks:

  • Spatial Consistency Judgment: Determine whether a given text describes a spatially coherent and plausible scene.
  • Spatial Referent Identification: Identify the omitted reference object for directional terms in the text.
  • Paraphrase Judgment of Spatial Expressions: Distinguish whether two spatial expressions describe the same or different scenes.
  • Spatial Relation Inference: Infer the position of an entity or unknown spatial relations based on given spatial scenes and known relations among entities.

Organizers and Contacts

  • Evaluation Organizers: Weidong Zhan, Zhifang Sui (Peking University)
  • Task Contact: Liming Xiao (Ph.D. student, Peking University), lmxiao@stu.pku.edu.cn

Evaluation Overview

  • Registered Teams: 38
  • Result Submissions: 12
  • Awarded Teams: 6
  • Accepted Papers: 4
  • Oral Presentations: 4
Award Rank Team Member(s) & Affiliation(s) Overall Accuracy (%)
First Prize 1 Zhishan Qiao (China University of Petroleum, East China); Weihong Liu, Chao Zhang, Zhanyang Liu (Jiangnan University) 79.31
Second Prize 2 Yiyang Zheng (Shanghai University) 66.06
Second Prize 3 Yongqing Huang (Individual); Yulin Liu (Chongqing College of Foreign Studies) 63.36
Third Prize 4 Yongquan Lai, Tilei Peng (Ping An Property & Casualty Insurance Company); Ruifeng Xu (Advisor, Harbin Institute of Technology, Shenzhen Graduate School) 61.40
Third Prize 5 Haixin Liu, Jinwang Song, Yifan Li, Lulu Kong (Zhengzhou University); Hongying Zan (Advisor, Zhengzhou University) 59.83
Third Prize 6 Zhongtian Hua, Yi Luo, Mengyuan Wang (School of Computer Science and Artificial Intelligence, Zhengzhou University); Meijia Yu (School of Agricultural Equipment Engineering, Henan University of Science and Technology); Yingjie Han (Advisor, Zhengzhou University) 58.54

Task 2: The Third Evaluation on Chinese Frame Semantic Parsing

Task Description

Frame Semantic Parsing (FSP) is a fine-grained semantic analysis task grounded in Frame Semantics. Its goal is to extract frame-semantic structures from sentences, enabling deep understanding of events or situations described in text. FSP plays an important role in downstream tasks such as reading comprehension, text summarization, and relation extraction.

However, semantic role nesting is common among linguistic elements in sentences. For example, in the sentence “我的眼睛什么也看不见了” (“My eyes can’t see anything anymore”), the phrase “我的眼睛” (“my eyes”) serves as the “body part” role in the [Free perception] frame, while “我” (“I”) simultaneously acts as the “perceiver” with agency. Traditional semantic role labeling approaches often prioritize the coarser role (“body part”), leading to the omission of finer-grained roles like “agentive perceiver.” Furthermore, due to rapid advances in large language models (LLMs), coarse-grained tasks like Semantic Role Labeling are largely considered solved, and LLMs already perform well on them in real-world applications. However, for more fine-grained and complex semantic scenarios, LLMs still fall short.

To further evaluate and enhance the ability of models to understand fine-grained linguistic semantics, we launched this evaluation task based on the CFN 2.1 corpus. Compared to the previous two editions based on CFN 1.0 and CFN 2.0, this year’s task focuses more on model performance regarding semantic nesting. It also improves current tools that often miss certain roles when facing nested or merged frame elements (e.g., in “hiring a secretary,” “secretary” is both the employee and the position).

This evaluation includes the following three subtasks:

       
  • Frame Identification: Identify the frames evoked by given target words or constructions in sentences.
  •    
  • Argument Identification: Identify the boundary spans of arguments governed by target words or constructions.
  •    
  • Role Classification: Predict the semantic role labels of arguments identified in the previous subtask.

Organizers and Contact

  • Evaluation Organizers: Ru Li, Hongye Tan (Shanxi University); Baobao Chang (Peking University); Jeff Z. Pan (University of Edinburgh)
  • Task Contact: Hao Xu (Ph.D. student, Shanxi University), 202322407052@email.sxu.edu.cn

Evaluation Overview

  • Registered Teams: 156
  • Result Submissions: 16
  • Awarded Teams: 3
  • Accepted Papers: 4
  • Oral Presentations: 2
Award Rank Team Member Affiliation Task1_ACC Task2_F1 Task3_F1 Final Score
First Prize 1 Yongqing Huang Individual 71.80 85.96 58.59 70.76
Second Prize 2 Yahui Liu, Ziheng Qiao; Advisors: Zhenghua Li, Chen Gong, Min Zhang Soochow University 71.90 85.52 57.62 70.27
Third Prize 3 Jingtao Du Lianyungang Daily 71.08 85.23 57.78 70.01

Task 3: The Fifth Chinese Abstract Meaning Representation Parsing Evaluation (CAMRP 2025)

Task Description

Semantic analysis remains a central and challenging task in natural language processing. Abstract Meaning Representation (AMR) encodes sentences as single-rooted directed graphs with strong semantic expressiveness, widely applied in downstream tasks such as machine reading comprehension and text summarization. Chinese Abstract Meaning Representation (CAMR), built upon AMR, incorporates annotations tailored to the characteristics of the Chinese language, including concept alignment and relation alignment, and addresses AMR’s lack of function word representation.

Since CoNLL 2020, the Chinese AMR Parsing Evaluation (CAMRP) has been held for four consecutive years. The performance of CAMR parsers has approached that of English AMR, achieving a high level of sentence-level semantic understanding. To further extend AMR parsing from the sentence level to discourse-level coreference resolution, this year’s evaluation introduces 500 new CAMR discourse documents. These documents are derived from 6,237 sentences in the Penn Chinese Treebank and cover multiple genres such as economics, sports, and daily life, aiming to assess the capability of parsing systems in discourse-level coreference resolution.

CAMRP 2025 includes the following two subtasks:

       
  • CAMR Parsing: Given a word-segmented sentence, output its corresponding CAMR graph, including concept and relation alignment information.
  •    
  • Discourse Coreference Resolution: Given a passage composed of several word-segmented sentences, output all coreference chains in the passage, including the coreference relation, sentence ID, and coreferent expression.

Organizers and Contact

  • Evaluation Organizers: Bin Li, Weiguang Qu, Junsheng Zhou (Nanjing Normal University)
  • Task Contact: Zhixing Xu (Ph.D. student, Nanjing Normal University), xzx0828@live.com

Evaluation Overview

  • Registered Teams: 96
  • Result Submissions: 4
  • Awarded Teams: 1
  • Accepted Papers: 2
  • Oral Presentations: 2
Award Rank Team member Affiliation TaskA(F1%) TaskB(F1%)
testA testB
First Prize 1 Rongbo Chen, Xuefeng Bai, Kehai Chen Harbin Institute of Technology (Shenzhen) 82.03 76.80 61.15

Task 4: The First Evaluation on Chinese Factivity Inference (FIE 2025)

Task Description

Factivity Inference (FI) is a semantic understanding task related to judging the truth of events, and is a specific form of Factuality Inference (FactI). In human communication, factivity inference reflects a speaker’s ability to infer the factuality (true or false) of an event based on certain verbal elements (e.g., “believe,” “falsely claim,” “realize”). For example, from both the affirmative sentence “They realized the situation was irreversible” and its negative form “They didn’t realize the situation was irreversible,” one can still infer the same fact: “The situation was irreversible.”

The knowledge used in factivity inference is a type of analytical linguistic knowledge—focused on semantic relationships between components within the language—rather than world knowledge. For instance, in the example above, the verb “realize” presupposes the truth of its complement clause (“the situation was irreversible”), regardless of whether the verb is negated.

As a key mechanism in linguistic reasoning, factivity inference provides clear formal linguistic cues and serves as an essential semantic foundation for downstream tasks such as textual entailment recognition, hallucination detection, and belief revision. It also holds value in applications such as information retrieval, information extraction, question answering, and sentiment analysis. In the age of large language models (LLMs), which are increasingly functioning as interactive agents, the ability to extract factual information and interpret the speaker’s subjective attitude toward event truthfulness is critical for autonomous reasoning and human-AI interaction.

To enhance the semantic understanding ability of LLMs in Chinese and promote deeper comprehension of human discourse, we introduce the 1st Evaluation on Chinese Factivity Inference (FIE 2025).

This evaluation focuses on two central questions:

       
  • How do LLMs perform on factivity inference in Chinese? How does their performance vary under different contextual conditions?
  •    
  • How do different prompt engineering strategies—such as varying the number of shots, using chain-of-thought (CoT) prompting, or altering question phrasing—impact LLM performance in factivity inference? What are the most effective prompt design techniques for optimizing performance in this task?

Participants are required to design their own prompts based on the provided test set and independently select suitable models for evaluation. Questions must be posed to the models via API. There are no restrictions on model choice, prompt format, or questioning method—diverse and composite testing strategies are encouraged to achieve the best results.

Organizers and Contact

  • Evaluation Organizers: Yulin Yuan (University of Macau); Bin Li (Nanjing Normal University)
  • Task Contact: Guanliang Cong (Ph.D. student, University of Macau), guanliang.cong@connect.um.edu.mo

Evaluation Overview

  • Registered Teams: 218
  • Result Submissions: 70
  • Awarded Teams: 10
  • Accepted Papers: 10
  • Oral Presentations: 4
No finetune track
Rank Award Team Member Overall Accuracy(%) Human Annotated Accuracy(%) Natural Corpus Accuracy(%)
1 First Prize Zequn Li, Yuanhao Zhong (School of Computer Science, BIT); Advisors: Chengliang Chai, Xin Xin (School of Computer Science, BIT) 94.01 97.8 92.58
2 Second Prize Qiang Yan (Institute of Computing Technology, CAS, Key Laboratory of Network Data Science and Technology); Yunfei Zhong (School of Artificial Intelligence, Beijing Normal University); Yixing Fan (Institute of Computing Technology, CAS, Key Laboratory of Network Data Science and Technology) 93.76 97.8 92.24
3 Second Prize Hongyu Li, Zhihui Yang, Renfen Hu (School of International Chinese Language Education, Beijing Normal University) 93.51 98.17 91.75
4 Third Prize Peixiang Zhao, Mingzhu Li, Liya Mei, Fang Wang, Nianxin Gao (School of Humanities, Kunming University); Lang Zhao (School of Journalism and Communication, Hunan Normal University); Advisors: Yao Deng, Shihua Xie, Wei Bao, Jingjing Feng (School of Humanities, Kunming University) 93.41 97.61 91.82
5 Third Prize Wenyuan Zhang, Shuaiyi Nie (Institute of Information Engineering, CAS); Advisor: Tingwen Liu (Institute of Information Engineering, CAS) 92.66 97.8 90.71
6 Third Prize Xiaoyi Zhang, Jiaqi Lu, Da Zhang, Xiaoyu Chen, Dawei Lu (School of Liberal Arts, Renmin University of China) 92.61 95.41 91.55
Finetune track
Rank Award Team Member Overall Accuracy(%) Human Annotated Accuracy(%) Natural Corpus Accuracy(%)
1 First Prize Zequn Li, Yuanhao Zhong (School of Computer Science, BIT); Advisors: Chengliang Chai, Xin Xin (School of Computer Science, BIT) 94.01 97.8 92.58
2 Second Prize Qiang Yan (Institute of Computing Technology, CAS, Key Laboratory of Network Data Science and Technology); Yunfei Zhong (School of Artificial Intelligence, Beijing Normal University); Yixing Fan (Institute of Computing Technology, CAS, Key Laboratory of Network Data Science and Technology) 93.76 97.8 92.24
3 Second Prize Hongyu Li, Zhihui Yang, Renfen Hu (School of International Chinese Language Education, Beijing Normal University) 93.51 98.17 91.75
4 Third Prize Peixiang Zhao, Mingzhu Li, Liya Mei, Fang Wang, Nianxin Gao (School of Humanities, Kunming University); Lang Zhao (School of Journalism and Communication, Hunan Normal University); Advisors: Yao Deng, Shihua Xie, Wei Bao, Jingjing Feng (School of Humanities, Kunming University) 93.41 97.61 91.82
5 Third Prize Wenyuan Zhang, Shuaiyi Nie (Institute of Information Engineering, CAS); Advisor: Tingwen Liu (Institute of Information Engineering, CAS) 92.66 97.8 90.71
6 Third Prize Xiaoyi Zhang, Jiaqi Lu, Da Zhang, Xiaoyu Chen, Dawei Lu (School of Liberal Arts, Renmin University of China) 92.61 95.41 91.55

Task 5: The First Evaluation on Chinese Ancient Poetry Appreciation (CAPA)

Task Description

Chinese ancient poetry is a cultural treasure known for its linguistic conciseness, musical beauty, and structural elegance, such as parallelism, tonal patterns, and rhyming schemes. Understanding the semantics of Chinese classical poetry requires not only knowledge of its unique linguistic features but also the ability to integrate historical and cultural context, as well as the perception of nature and human emotions conveyed through the poems.

The Evaluation on Chinese Ancient Poetry Appreciation (CAPA) aims to assess the ability of natural language processing models to deeply interpret and appreciate the content and emotional undertones of ancient Chinese poetry. This evaluation includes two subtasks:

  • Poetic Text Understanding
  • This subtask requires explanation of the semantics of phrases and complete lines in poems. For example, given the line “The singing girl knows not the grief of a fallen nation, and still sings ‘The Courtyard Flowers’ across the river,” the system must interpret the term “singing girl” as well as the entire line. This task is evaluated in a question-answering format.

  • Emotional Appreciation of Poetry
  • This subtask requires inferring the emotions conveyed by the poet, such as patriotism, homesickness, or admiration, based on a comprehensive understanding of the poem. This task is evaluated in a multiple-choice format.

Final scores are based on the combined performance across both subtasks. This evaluation focuses on assessing a model’s intrinsic understanding of Chinese poetry. Participants are allowed to fine-tune open-source large language models, but external knowledge access methods such as Retrieval-Augmented Generation (RAG) are prohibited.

Organizers and Contact

  • Evaluation Organizers: Xuefeng Bai, Kehai Chen (Harbin Institute of Technology, Shenzhen)
  • Task Contact: Zhenwu Pei (Harbin Institute of Technology, Shenzhen), 23S151077@stu.hit.edu.cn

Evaluation Overview

  • Registered Teams: 55
  • Result Submissions: 6
  • Awarded Teams: 3
  • Accepted Papers: 6
  • Oral Presentations: 4
Rank Award Affiliation Team Member Final Score TaskA TaskB
1 First Prize Beihang University Haotao Xie (Beihang University, Hangzhou Institute for Advanced Study) 0.757 0.847 0.666
2 Second Prize Qilu Normal University Hanlin Li, Wenya Zhang (Qilu Normal University); Advisors: Chengfei Li, Bin Liu (Institute of AI Education, Qilu Normal University); Chunyu Wang (School of Geography and Tourism, Qilu Normal University) 0.755 0.865 0.644
3 Third Prize China Unicom Data Intelligence Co., Ltd. Jiangze Yan (China Unicom Data Science and AI Research Institute); Haoting Zhuang (Beijing JD Century Trading Co., Ltd.) 0.725 0.903 0.389

Task 6: The Second Evaluation on Chinese Essay Rhetorical Device Recognition and Understanding

Task Description

In Chinese essay writing, stylistic expression—often manifested through the use of rhetorical devices—is a key indicator of linguistic expressiveness. Accurately identifying and understanding rhetorical usage in student essays not only reflects writing proficiency and expressive ability but also supports teachers in assessing essay quality and guiding students in improving their language skills.

In recent years, studies on rhetorical device recognition in essays have generally relied on feature matching and alignment techniques, performing coarse-grained identification of rhetorical types such as parallelism and metaphor based on sentence structure and semantic features. Some works have focused on single rhetorical types (e.g., similes) with dedicated model designs. A small number of recent studies have started to explore fine-grained classification and component extraction for four types of rhetoric: metaphor, personification, hyperbole, and parallelism.

As a continuation of the 2024 CCL Shared Task on Rhetorical Device Recognition in Chinese Student Essays, this year’s evaluation again uses data collected from real-world teaching scenarios. The essays, written by native Chinese-speaking primary and secondary students, were digitized using OCR provided by CamScanner’s BeeClassroom. The dataset includes narrative and argumentative compositions.

Compared to the first edition, this year’s evaluation features the following improvements:

  1. In addition to metaphor, personification, hyperbole, and parallelism, four new rhetorical types are included: repetition, rhetorical question, interrogative question, and descriptive imitation—thus expanding the coverage of expressive linguistic forms.
  2. The previous evaluation focused on sentence-level recognition, while this year’s evaluation targets paragraph- and document-level recognition, enabling the handling of cross-sentence rhetorical structures.

This evaluation covers 8 rhetorical device types: metaphor, personification, hyperbole, parallelism, repetition, rhetorical question, interrogative question, and descriptive imitation, and includes three subtasks:

  1. Rhetorical Form Type Classification
    (on all 8 rhetorical types)
  2. Rhetorical Content Type Classification
    (on 4 high-frequency types: metaphor, personification, hyperbole, parallelism)
  3. Rhetorical Component Extraction
    (on the same 4 high-frequency types)

Organizers and Contact

  • Evaluation Organizers: Yujiang Lu, Nuowei Liu, Yupei Ren, Yicheng Zhu, Man Lan (School of Computer Science and Technology, East China Normal University); Xiaopeng Bai, Mofan Xu (Department of Chinese Language and Literature, ECNU); Qingyu Liao (Shanghai Linguan Data Technology Co., Ltd.)
  • Task Contact: Yujiang Lu (East China Normal University), yujianglu@stu.ecnu.edu.cn

Evaluation Overview

  • Registered Teams: 29
  • Result Submissions: 8
  • Awarded Teams: 3
  • Accepted Papers: 4
  • Oral Presentations: 4
Award Rank Affiliation Team Member Track1 Score Track2 Score Track3 Score Final Score
First Prize 1 The Open University of China Yuxuan Lai, Xiajing Wang, Huan Zhang, Zequn Niu, Chen Zheng 64.81 63.07 63.94 63.94
Second Prize 2 Beijing Language and Culture University Xuquan Zong, Jiyuan An, Xiang Fu, Luming Lu, Haonan Zhu, Liner Yang, Erhong Yang 59.71 60.36 61.26 60.45
Third Prize 3 Yunnan University Jingjun Tang, Zhiwen Tang 55.36 60.97 42.02 52.78

Task 7: The First Chinese Literary Language Understanding Evaluation (ZhengMing)

Task Description

Chinese literature embodies rich artistic expression, historical and cultural depth, and profound emotions, posing major challenges to natural language processing (NLP) models. These models must deeply understand the unique linguistic features, cultural background, and rhetorical techniques of literary texts, while distinguishing between classical and modern literary styles. This includes handling polysemy, symbolic language, and the interplay between social realities and emotional expression.

Advanced semantic understanding and reasoning capabilities are essential. Models need to analyze the stylistic and emotional characteristics of different historical periods and literary schools, especially the accurate recognition of rhetorical strategies such as parallelism and metaphor. Cultural sensitivity and emotional perception are also crucial for handling the complexity and diversity of Chinese literature.

The Chinese Literary Language Understanding Evaluation – ZhengMing aims to comprehensively assess a model’s literary language understanding through five core subtasks:

       
  • Classical Literature Knowledge Understanding: Given a set of multiple-choice questions based on classical Chinese texts, the model must select the best answer, testing knowledge of ancient literature and language comprehension.
  •    
  • Literary Cloze Test: Given a literary text and context, the model must complete missing parts, evaluating its understanding of literary style, linguistic features, and overall coherence.
  •    
  • Literary Named Entity Recognition: The model must identify named entities and their types (e.g., person, location) from literary texts, assessing its text analysis capabilities.
  •    
  • Literary Style Prediction: Given a literary excerpt, the model must predict the likely author based on stylistic and linguistic features, evaluating its recognition of literary style and authorial traits.
  •    
  • Literary Style Transfer: The model is required to translate Classical Chinese into Modern Chinese while preserving original meaning, intent, and emotion—testing accuracy in style transfer and semantic retention.

Additionally, ZhengMing provides two out-of-domain tasks to evaluate the generalization ability of models across different texts and tasks:

       
  • Modern Literary Criticism Stance Detection: Given a literary critique, the model must determine the sentiment polarity (positive, negative, or neutral), assessing sentiment analysis and opinion recognition.
  •    
  • Modern Literary Criticism Target Extraction: The model must extract the targets of critique (e.g., works, characters) from modern literary criticism texts, evaluating performance in identifying relevant commentary content.

Organizers and Contact

  • Evaluation Organizers: Gang Hu, Kun Yue (Yunnan University); Min Peng (Wuhan University); Siguang Chen (Sichuan University); Shan He (Yunnan Normal University)
  • Task Contact: Gang Hu (Yunnan University), hugang@ynu.edu.cn

Evaluation Overview

  • Registered Teams: 89
  • Result Submissions: 6
  • Awarded Teams: 3
  • Accepted Papers: 5
  • Oral Presentations: 4
Award Team Member Rank Affiliation
First Prize Chenrui Zheng, Yicheng Zhu, Xinyu Wang, Weilin Jiang (Department of Chinese Language and Literature, East China Normal University); Huiteng Wu (School of Law, Shandong University); Advisor: Xiaopeng Bai (Department of Chinese Language and Literature, East China Normal University) 1 East China Normal University
Second Prize Fan Su, Siqi Lü, Wei Wan, Yiming Qin (School of Information, Yunnan University); Advisor: Liang Duan (School of Information, Yunnan University) 2 Yunnan University
Third Prize Qingyi Yang, Panpan Zhou (School of International Chinese Language Education, Beijing Normal University) 3 Beijing Normal University

Task 8: Evaluation on ICD Diagnostic Coding for Chinese Electronic Medical Records

Task Description

In recent years, as population aging accelerates and health awareness increases, healthcare systems are facing mounting service demands. The widespread adoption of electronic medical records (EMRs) in the process of healthcare informatization offers new solutions to these challenges. To standardize and facilitate the sharing of medical data, the World Health Organization has developed the International Classification of Diseases (ICD)—a system that translates tens of thousands of diseases and combinations into a standardized alphanumeric coding framework. This standard forms the basis for medical data exchange and analysis across regions and institutions.

However, manually converting EMR text into ICD codes is both labor-intensive and error-prone. Developing automated ICD coding systems can significantly enhance coding efficiency and consistency, while also providing more reliable data support for disease research and medical administration.

To this end, the current evaluation introduces a specialized dataset designed to assess ICD diagnostic coding for Chinese EMRs. The dataset, constructed from de-identified medical records, includes 1,485 instances, covering 5 primary diagnoses and 32 secondary diagnoses under the ICD-10 coding standard.

Organizers and Contact

  • Evaluation Organizers: Hongjiao Guan, Wenpeng Lu (Qilu University of Technology [Shandong Academy of Sciences]); Ying Lian, Guoqiang Chen (The First Affiliated Hospital of Shandong First Medical University)
  • Task Contact: Zhenpeng Liang (Master’s student, Qilu University of Technology), icdevaluator@163.com

Evaluation Overview

  • Registered Teams: 445
  • Result Submissions: 36
  • Awarded Teams: 5
  • Accepted Papers: 5
  • Oral Presentations: 3
Award Team Member Rank Affiliation
First Prize Kaiyuan Zhang, Chong Feng, Ge Shi, Bo Wang, Jinhua Ye (School of Computer Science, Beijing Institute of Technology) 1 Beijing Institute of Technology
Second Prize Tengxiao Lü, Juntao Li, Chao Liu, Haobin Yuan; Advisor: Ling Luo (Information Retrieval Lab, Dalian University of Technology) 2 Dalian University of Technology
Second Prize You Zou, Lei Zhang, Xiaodong Liang, Meng Ran, Jiajun Li (China Telecom Chongqing Branch) 3 China Telecom Chongqing Branch
Third Prize Zhiwen Jiang (individual); Advisors: Yihang Wang, Yuying Qiu (East China Normal University) 4 individual
Third Prize Yongqing Huang (individual), Yulin Liu (Chongqing College of Foreign Studies) 5 Chongqing College of Foreign Studies

Task 9: Evaluation on Traditional Chinese Medicine (TCM) Syndrome-Disease Diagnosis and Herbal Prescription Generation

Task Description

Traditional Chinese Medicine (TCM), as a fundamental component of Chinese traditional medicine, has developed over thousands of years into a distinctive theoretical and diagnostic system. It has made significant contributions to public health both in China and globally. At the heart of TCM is the principle of syndrome differentiation and treatment, which involves collecting clinical information through the four diagnostic methods—inspection, listening/smelling, inquiry, and palpation (pulse diagnosis)—to analyze symptoms, tongue coating, and pulse condition. Practitioners then synthesize this information to determine the underlying syndrome (证, zheng), identify the disease (病, bing), and formulate personalized herbal prescriptions.

To promote the integration of artificial intelligence into TCM and to advance its modernization, this evaluation task introduces a new dataset designed to assess the performance of algorithms in syndrome-disease differentiation and herbal prescription recommendation. The dataset is based on de-identified clinical records and includes:

  • 10 types of TCM syndromes (referred to as syndromes)
  • 4 types of TCM diseases (referred to as diseases)
  • 381 types of Chinese herbs

In total, the dataset comprises 1,500 instances.

The evaluation consists of two subtasks:

  • Multi-label Syndrome and Disease Classification
    Given a clinical document, the system must predict both the syndromes and diseases. (Details available on the task website.)
  • Herbal Prescription Recommendation
    Given a clinical document, the system must recommend an appropriate herbal prescription. (Details available on the task website.)

Organizers and Contact

  • Evaluation Organizers: Wenpeng Lu, Hongjiao Guan (Qilu University of Technology [Shandong Academy of Sciences]); Yifei Wang (Affiliated Hospital of Shandong University of Traditional Chinese Medicine)
  • Task Contact: Cong Wang (Master’s student, Qilu University of Technology), tcmtbosd@163.com

Evaluation Overview

  • Registered Teams: 123
  • Result Submissions: 35
  • Awarded Teams: 7
  • Accepted Papers: 5
  • Oral Presentations: 4
Award Rank Team Member Subtask1 Score Subtask2 Score Final Score
First Prize 1 Nanshu Li (Ganxi Tumor Hospital) 0.648 0.4259 0.5369
Second Prize 2 Yan Li, Yun Zhang 0.602 0.4348 0.5184
Second Prize 3 Yiyang Kang, Jiaqian Yao (School of Computer Science and Technology, Dalian University of Technology); Advisors: Yuanyuan Sun, Bo Xu (School of Computer Science and Technology, Dalian University of Technology) 0.571 0.4632 0.5171
Third Prize 4 Hongyu Liu (School of Computer Science and Technology, Dalian University of Technology) 0.577 0.43 0.5035
Third Prize 5 Zicheng Zuo, Jiamin Ren (Xinjiang University); Advisor: Turdi Tuohuti (Xinjiang University) 0.553 0.4515 0.5022
Third Prize 6 Jiankun Zhang (Institute of Software, Chinese Academy of Sciences), Xinxin Zhang (Beijing Zhilan Medical Technology Co., Ltd.); Advisors: Longlong Ma (Institute of Software, CAS), Bo An (Institute of Ethnology and Anthropology, Chinese Academy of Social Sciences) 0.584 0.4188 0.5014
Third Prize 7 Han Li, Shuang Li (Shandong Vocational College) 0.569 0.4306 0.4998

Task 10: Evaluation on Fine-grained Chinese Hate Speech Detection

Task Description

With the widespread adoption of social media, user-generated content has exploded in volume, leading to the proliferation of hate speech. Hate speech refers to harmful language that expresses hatred or incites violence against specific individuals or groups based on attributes such as race, religion, gender, region, sexual orientation, or physical condition. Laws and regulations such as the Law of the People’s Republic of China on Public Security Administration Punishments and the Administrative Measures for Internet Information Services explicitly prohibit hate speech.

Effectively identifying hate speech has become a key concern in natural language processing (NLP) research. The Evaluation on Fine-grained Chinese Hate Speech Detection aims to advance this area by constructing structured hate speech quadruples consisting of:

  • Comment Target
  • Argument
  • Target Group
  • Hate/Non-hate Label

The goal is to promote the development of Chinese hate speech detection technologies, enhance the governance of online misconduct, and support the creation of a more civil and respectful internet environment.

Organizers and Contact

  • Evaluation Organizers: Hongfei Lin, Liang Yang (Dalian University of Technology); Junyu Lu, Zewen Bai (Ph.D. candidates, Dalian University of Technology); Shengdi Yin (Master’s student, Dalian University of Technology)
  • Task Contact: Zewen Bai (Ph.D. candidate, Dalian University of Technology), dlutbzw@mail.dlut.edu.cn

Evaluation Overview

  • Registered Teams: 394
  • Result Submissions: 140
  • Awarded Teams: 8
  • Accepted Papers: 5
  • Oral Presentations: 4
Award Rank Affiliation Team Member Final Score
First Prize 1 School of Data Science and Engineering, East China Normal University Fanjun Lin, Yanwei Zhang, Yang Huang, Zhiyuan Yao 0.3641
Second Prize 2 School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen Jiahao Wang 0.3636
Second Prize 3 Big Data and AI Center, China Telecom Chongqing Branch Lu Ruan, Bo Zhai, Kundong Mo, Yi Shen, Zeyu Wang; Advisors: Feng Wei, Chenzi Wang 0.3591
Third Prize 4 School of Computer Science, Zhuhai College of Science and Technology Weijie Zhou, Yuxiang Fan, Xinyu Guan, Canjie Zhu; Advisor: Hushuang Ma 0.3566
Third Prize 5 School of Economics and Management, Dalian University of Technology Binglin Wu, Jiaxiu Zou; Advisor: Xianneng Li 0.3555
Third Prize 6 School of Cyberspace Security, Zhongyuan University of Technology Junshuai Zhang, Zhongyang Zhao, Chenyang Li, Qinghao Zhu, Mingyang Ji; Advisors: Long Zhang, Qiusheng Zheng 0.3545
Honorable Prize 7 School of Computer Science, Beijing Institute of Technology Cheng Yang, Xingchen Zhang, Yicheng Liu, Youchao Zhou; Advisor: Shumin Shi 0.349
Honorable Prize 8 School of Computer Science and Technology, Dalian University of Technology Qi Lin, Junjia Shang 0.3352

Task 11: Evaluation on Undergraduates’ Handwritten Chinese Character Quality

Task Description

Handwriting proficiency in Chinese characters is a key aspect of undergraduate students’ language competence and reflects their humanistic literacy. With the rise of digital tools in the information age, students’ ability to write Chinese characters by hand has significantly declined. Meanwhile, handwriting instruction in universities faces challenges such as limited teaching staff and the lack of effective assessment methods. Inadequate training and feedback outside the classroom further hinder students' progress in handwriting skills.

In the field of handwriting quality evaluation, traditional deep learning approaches often fall short in providing fine-grained, personalized textual assessments. Large language models (LLMs), with their advanced natural language understanding and generation capabilities, offer promising solutions. LLMs can generate detailed, individualized feedback based on input features, simulating the evaluation style of human experts.

The Evaluation on Undergraduates’ Handwritten Chinese Character Quality aims to leverage multimodal LLMs for both image understanding and text generation, addressing current limitations in handwriting evaluation. The goal is to move from manual, uniform assessment to intelligent, personalized feedback. The task includes two subtasks:

  • Handwritten Chinese Characters Grading
  • Given an image of handwritten Chinese characters, the system must classify the writing quality into appropriate levels.

  • Comment Generation of Handwritten Chinese Characters
  • Based on the handwriting image, the system must generate personalized, fine-grained evaluation comments and feedback.

All handwriting samples are collected from teaching practice assignments submitted by teacher-education students at the hosting university. Quality ratings and feedback comments were manually annotated by professional calligraphy instructors.

Organizers and Contact

  • Evaluation Organizers: Meng Wang, Zhidan Hu, Na Tian (Jiangnan University)
  • Task Contact: Shicong Lu (Master’s student, Jiangnan University), zsdlsc@163.com

Evaluation Overview

  • Registered Teams: 8
  • Result Submissions: 3
  • Awarded Teams: 3
  • Accepted Papers: 4
  • Oral Presentations: 4
Award Rank Team Member Affiliation Accurary(%)
First Prize 1 Jinwang Song, Lulu Kong, Haixin Liu, Yifan Li, Zhewei Luo; Advisor: Hongying Zan Zhengzhou University 64.1
Second Prize 2 Xiaoqing Hong, Yunhan Li; Advisor: Lü Ni East China Normal University 62.8
Third Prize 3 Yuxuan Lai (The Open University of China), Jian Wang, Jitao Yang, Wentao Ma, Haoyang Lu (Open University Online Education Technology Co., Ltd.); Advisor: Chen Zheng (The Open University of China) The Open University of China; Open University Online Education Technology Co., Ltd. 59.3

Task 12: Evaluation on Entity-Relation Triple Extraction from Chinese Speech (CSRTE)

Task Description

Traditional entity-relation triple extraction tasks have primarily focused on written texts, identifying entities and their relations to construct structured knowledge graphs. However, as speech becomes an increasingly dominant modality in human-computer interaction—through applications like intelligent assistants, customer service, and voice search—extracting structured information from speech has become a critical research challenge.

The Chinese Speech Entity-Relation Triple Extraction Task (CSRTE) aims to enable end-to-end automatic recognition and extraction of named entities and their relations from Chinese speech data, ultimately forming structured triples (head entity, relation, tail entity). The task focuses on improving both the accuracy and efficiency of triple extraction in spoken Chinese, while enhancing the robustness of systems in diverse discourse contexts and noisy speech environments. The ultimate goal is to achieve seamless processing from spoken input to structured output.

For this evaluation, we annotated nearly 20,000 human-read Chinese speech samples from the Common Voice 17 and AISHELL datasets. These include over 40,000 named entities and more than 20,000 relation triples. By organizing this task, we aim to promote advancements in Chinese speech information extraction, facilitate the integration of speech and natural language processing technologies, and provide high-quality foundational data for intelligent applications.

Organizers and Contact

  • Evaluation Organizers: Jinzhong Ning, Wenxuan Mu, Yilin Pan (Dalian Maritime University); Yijia Zhang (Professor, Dalian Maritime University); Paerhati Tulajiang (Dalian University of Technology / Xinjiang Normal University); Yuanyuan Sun (Professor, Dalian University of Technology); Hongfei Lin (Professor, Dalian University of Technology)
  • Task Contact: Jinzhong Ning, ningjinzhong@dlmu.edu.cn

Evaluation Overview

  • Registered Teams: 257
  • Result Submissions: 59
  • Awarded Teams: 7
  • Accepted Papers: 4
  • Oral Presentations: 2
Award Rank Team Member Affiliation F1(%) Track
First Prize 3 Yongqing Huang (individual) Individual 0.5302 E2E
Second Prize 1 Zhishan Qiao, Chao Zhang, Weihong Liu, Zhanyang Liu China University of Petroleum (East China); Jiangnan University 0.5937 Pipeline
Second Prize 4 Nanshu Li Ganxi Tumor Hospital 0.5292 E2E
Third Prize 2 Wugan-jing Song The Hong Kong University of Science and Technology 0.5308 Pipeline
Third Prize 5 Haodong Ren, Yuyao Sun Xidian University 0.5181 E2E
Third Prize 6 Jingtao Du Lianyungang Daily 0.5152 E2E
Third Prize 8 Haotao Xie, Mengting Li, Xiaolin Hong, Canyu Chen Beihang University; North China University of Technology 0.4846 Pipeline