Evaluation Tasks

The 25th China National Conference on Computational Linguistics (CCL26-Eval)
Technical Evaluation Task Release

Conference Website: http://cips-cl.org/static/CCL2026/en/index.html

The 25th China National Conference on Computational Linguistics (CCL 2026) will be held in Yichang, Hubei Province, from October 15 to 18, 2026, organized by the Chinese Information Processing Society of China (CIPS). CCL is a premier conference of CIPS and the largest community of natural language processing scholars and experts in China. After thirty years of development, CCL is widely regarded as the most authoritative, influential, and largest NLP conference in China. With the advancement of computational language processing in China, CCL has become the primary forum for disseminating new academic and technical work in computational linguistics nationwide.

This conference will continue to organize the Chinese language processing technology evaluation CCL26-Eval. After the initial collection of evaluation tasks, the CCL26-Eval organizing committee has confirmed 14 evaluation tasks, covering research directions such as semantic analysis, discourse and pragmatic analysis, cross-lingual and low-resource NLP, knowledge graphs, NLP applications in healthcare, education, humanities, and legal domains, as well as generative AI and core capabilities of large language models. Researchers are welcome to participate in the evaluation competition. Each evaluation task will establish first, second, and third prizes based on the competition results, and CIPS will issue official honorary certificates. Summary papers and outstanding technical reports will be included in the CCL Anthology and ACL Anthology.

Evaluation Chairs: Hongfei Lin (Dalian University of Technology, hflin@dlut.edu.cn), Hongye Tan (Shanxi University, tanhongye@sxu.edu.cn), Liang Yang (Dalian University of Technology, liang@dlut.edu.cn)

I. Fundamental NLP Tasks

1. Semantic Analysis / Discourse and Pragmatic Analysis

Task 1: The Second Chinese Factivity Inference Evaluation

Task Overview

Factivity Inference (FI) is a semantic understanding task related to judging the truthfulness of events, and is a form of Factuality Inference (FactI). In human communication, factivity inference ability is mainly manifested in language users' ability to acquire the psychological states of the speaker and the subject of a sentence from the use of certain verbal components (such as "believe," "falsely claim," "realize," etc.), and to infer the truthfulness (true or false) of related events accordingly. For example, from both the affirmative sentence "They realized the situation was irreversible" and the corresponding negative sentence "They did not realize the situation was irreversible," one can infer that in the speaker's view, the fact exists that "the situation was irreversible." The knowledge used for such inference is analytical knowledge of language, which is less influenced by world knowledge and mainly involves semantic relationships between internal components of language.

To further enhance the semantic understanding capabilities of large language models for Chinese and achieve deep machine understanding of human communicative discourse, we will continue to launch the "Second Chinese Factivity Inference Evaluation" based on FIE2025. This evaluation will focus on examining the factivity inference performance of large language models under complex contextual conditions and few-shot prompting.

Compared to FIE2025, this evaluation's dataset covers a larger number of factive predicates (approximately 500) and more diverse contextual conditions, such as negation words "不, 没有, 差点," negative intentions "不敢, 不想, 不愿, 难以," passivization operations "被, 被迫," evaluative adverbs "正确地, 错误地," polyphonic markers "并不, 绝不," etc.

Task Description

Participating teams need to design their own prompts using the sample set and evaluation set released by the organizers, and organize the responses from LLMs into a unified output format. Each evaluation data entry is presented as a textual entailment pair <Aa, a>, and the dataset is saved in JSON format.

The model needs to judge the truth value of the entailed sentence a based on the content of the main entailing sentence Aa, and provide a confidence level for the judgment. This evaluation continues to set up two tracks: non-finetuning and finetuning. The non-finetuning track does not allow any modifications to the model itself; the finetuning track allows fine-tuning model parameters using sample set data.

Organizers and Contact Persons

Task Organizers: Yulin Yuan (Professor, University of Macau), Bin Li (Professor, Nanjing Normal University)
Task Contact: Guanliang Cong (Ph.D. student, University of Macau, guanliang.cong@connect.um.edu.mo); Tianqi Xun (Ph.D. student, University of Macau, tianqi.xun@connect.um.edu.mo)

Task Awards

This evaluation will set up first, second, and third prizes for both the non-finetuning and finetuning tracks. First Prize: 0-1, Second Prize: 0-2, Third Prize: 0-3. Prize amounts to be determined.

Task Website

https://github.com/UM-FAH-Yuan/FIE2026

Task 2: Non-Literal Meaning Translation and Understanding Evaluation

Task Overview

This evaluation focuses on Chinese-English translation and identification of non-literal expressions such as proverbs, idioms, slang, and allusions, examining models' understanding of non-literal meaning, cross-lingual cultural mapping ability, and pragmatic effect preservation. The task constructs a complementary "generation + discrimination" evaluation framework to test models' non-literal expression generation ability and standard non-literal meaning recognition ability. The evaluation data comprises 5,000 high-quality samples, covering Gold (idiomatic/proverbial equivalent expressions) and Silver (explanatory paraphrases) references. This evaluation includes two subtasks.

Subtask 1: Non-Literal Chinese-to-English Translation
Given a Chinese sentence containing non-literal expressions such as proverbs, idioms, etc., the model needs to generate a natural, idiomatic English translation with cultural mapping characteristics, prioritizing equivalent substitution using existing English idioms, proverbs, maxims, or fixed collocations.
Subtask 2: Non-Literal Chinese-English Selection
Given a Chinese sentence with non-literal expression and several English candidates, the model needs to perform multi-choice selection, identifying and outputting Gold-labeled items that constitute recognized equivalent substitution relationships with the Chinese in the English context.

Organizers and Contact Persons

Organizers: Dongyu Zhang (Professor, Dalian University of Technology)
Task Contact: Senqi Yang (Ph.D. student, Dalian University of Technology, ysq1997@mail.dlut.edu.cn)

Task Awards

This evaluation sets up first, second, and third prizes, with honorary certificates provided by CIPS.

Task Website

https://github.com/DUTIR-YSQ/CCL2026-Non-literal-Translation-Task

2. Cross-Lingual, Minority Language, and Low-Resource NLP

Task 3: Cross-Lingual Financial Evaluation Benchmark for LLMs (MapFinBen)

Task Overview

MapFinBen is the first multilingual financial evaluation benchmark specifically designed to assess the cross-lingual capabilities of large language models between high-resource and low-resource languages. The benchmark covers five representative financial tasks, comprehensively reflecting the diverse needs in real financial application scenarios.

In terms of language settings, MapFinBen covers both high-resource languages (English and Chinese) and multiple low-resource languages (Indonesian, Spanish, Greek, and Japanese), effectively addressing the over-reliance on high-resource languages in existing financial language model evaluations. Through unified task design and evaluation standards, this framework can systematically assess the financial task processing capabilities of large models across languages and resource conditions.

MapFinBen consists of five subtasks:

Subtask 1: Financial Answer Selection (FinAS) — Given a financial text with corresponding questions and candidate options, the model selects the correct answer that best matches the question semantics and financial context.
Subtask 2: Financial Question Answering (FinQA) — Given a financial text, the model answers related financial questions based on the text content.
Subtask 3: Financial Sentiment Analysis (FinSA) — Given a financial text, the model identifies the emotional tendency expressed and classifies it as positive, neutral, or negative.
Subtask 4: Financial Topic Classification (FinTC) — Given a financial text and candidate topic categories, the model categorizes the text into the most appropriate financial topic category.
Subtask 5: Financial Text Summarization (FinTS) — Given a financial text, the model extracts and generates a concise, accurate summary covering the core information.

Organizers and Contact Persons

Organizers: Gang Hu, Kun Yue (Yunnan University), Min Peng (Wuhan University), Lei Shi (Yunnan Normal University)
Task Contact: Xiaoyong Kong (kongxiaoyong@stu.ynu.edu.cn)

Task Awards

This evaluation sets up first, second, and third prizes, with honorary certificates provided by CIPS.

Task Website

https://github.com/HgITSE/MapFinBen

Task 4: Low-Resource Burmese Template Sentence Inference Evaluation

Task Overview

In translation for low-resource languages such as Burmese, the inference of fixed template sentences, as a domain-specific task, is significantly influenced by internal linguistic factors such as parts of speech, place names, and diverse cultural values, all of which affect the final translation quality.

Format and convention differences: For example, Chinese expressions like "第1名" (1st place) and "第3章" (Chapter 3) translate to Burmese as "number" or "no.," which must be immediately followed by Burmese numerals. Place name transliteration conflicts: Place name transliteration often conflicts with Burmese-specific pronunciation and historical conventions, causing confusion in direct Chinese transliteration. Diverse cultural values: Translation is influenced by race, religion, and collectivism and cannot be simply translated literally. Local cultural sensitivity and religious background must be fully considered.

As a template sentence inference task, this evaluation aims to improve the translation quality of large translation models for Burmese and achieve deeper machine understanding of human fixed template sentences.

Organizers and Contact Persons

Organizers: Ziyan Chen, Jinsong Liu (Transn Information Technology Co., Ltd.), Shaolin Zhu (Tianjin University)
Task Contact: Hong Ren (Ph.D. student, Tianjin University, rhong@tju.edu.cn), Chuan Wu (Master's student, Tianjin University, wuchuan@tju.edu.cn)

Task Awards

This evaluation sets up first, second, and third prizes, with honorary certificates provided by CIPS. Prize money sponsored by Transn Information Technology Co., Ltd.

Task Website

https://github.com/merc11/CCL-2026

II. NLP Applications

1. Knowledge Graphs

Task 5: Minor Grain Breeding Information Extraction Evaluation (MGBIE)

Task Overview

The field of minor grain breeding has accumulated a vast amount of knowledge presented in natural language, widely distributed across papers, variety approval documents, and cultivation technical regulations. These texts record breeding material sources, target traits, and measurement results, as well as cultivation management conditions, stress treatment information, and molecular marker evidence. Due to the dense professional terminology, diverse concept expressions, and frequent nested expressions of material names and experimental elements, key information extraction and unified structuring remain challenging, limiting the development of knowledge retrieval, evidence synthesis, and breeding decision support applications.

The Minor Grain Breeding Information Extraction Evaluation (MGBIE) aims to systematically evaluate information extraction models' capabilities in professional terminology recognition, breeding context understanding, key information extraction, and structured expression. The MGBIE dataset contains 2,000 samples total (1,000 training, 400 validation, 600 test).

MGBIE 2026 includes two subtasks:

Named Entity Recognition for Minor Grain Breeding: Identify and extract key entity information from breeding-related texts, outputting entity boundaries and type labels. The entity type system covers 12 categories: crop, variety, trait, growth period, gene, QTL, molecular marker, chromosome, breeding method, parent/cross combination, abiotic stress, and biotic stress.
Relation Extraction for Minor Grain Breeding: Extract semantic relations between identified entities, represented as relation triplets. The relation type system includes 6 semantic relations: contains, adopts, has, affects, occurs_in, and locates_in.

Organizers and Contact Persons

Organizers: Zhiwei Hu, Zhaosheng Kong, Jianhua Gao (Houji Laboratory of Shanxi Province, Shanxi Agricultural University); Hongye Tan, Zhichao Yan, Ru Li (Shanxi University); Qianqian Xie (Wuhan University)
Task Contact: Senjie Yang (Master's student, Shanxi University, yangsenjie1@sxu.edu.cn)

Task Awards

This evaluation sets up first, second, and third prizes, with honorary certificates provided by CIPS.

Task Website

https://github.com/zhiweihu1103/CCL2026-MGBIE

2. NLP Applications in Healthcare, Education, Humanities, and Legal Domains

Task 6: Automatic ICD Coding for Chinese Electronic Medical Records

Task Overview

In recent years, with the intensification of population aging and increased health awareness, the healthcare system faces growing service pressure. In the process of healthcare informatization, the widespread application of electronic medical records provides new possibilities for addressing this challenge. To achieve standardized management and sharing of medical data, the World Health Organization developed the International Classification of Diseases (ICD) standard, converting tens of thousands of diseases and their combinations into a standardized alphanumeric coding system.

However, manual ICD coding of electronic medical record texts is not only time-consuming and labor-intensive but also prone to coding errors due to differences in professional skills. Developing automatic ICD coding systems can improve both coding efficiency and accuracy while providing more reliable data support for disease research and medical management. Based on this background, this task has constructed a Chinese electronic medical record ICD automatic coding dataset based on de-identified medical record data, covering 10 departments, 19 main disease codes, various other disease codes, 16 main procedure codes, various other procedure codes, totaling 2,200 records.

Organizers and Contact Persons

Organizers: Hongjiao Guan, Wenpeng Lu (Qilu University of Technology / Shandong Academy of Sciences), Ying Lian, Guoqiang Chen (First Affiliated Hospital of Shandong First Medical University)
Task Contact: Chuanlong Li (Master's student, Qilu University of Technology, icdevaluator@163.com)

Task Awards

This evaluation sets up 1 first prize, 3 second prizes, and 6 third prizes, with honorary certificates provided by CIPS.

Task Website

https://github.com/QLU-NLP/icdevaluator-26

Task 7: Cross-Lingual Sentiment Analysis Consistency for Literary Texts (BCCL-CSA)

Task Overview

With the rapid development of Multilingual Large Language Models (MLLMs), NLP technology has matured in handling modern general-purpose corpora. However, existing sentiment analysis techniques still face significant challenges when dealing with Chinese classical literature, which is characterized by high context dependency and deep cultural heritage. Chinese classical literature's emotional expression features typical "implicit expression" and "expressing aspirations through objects," relying on specific imagery, historical allusions, and complex rhetoric to convey emotions rather than directly using emotional adjectives. To this end, this evaluation proposes the Bilingual Chinese Classical Literature Cross-lingual Sentiment Analysis evaluation task (BCCL-CSA).

Subtask 1: Fine-Grained Sentiment Recognition
Participating systems need to independently capture sentiment features from given Chinese classical original texts and their corresponding English translations. The evaluation assesses:
1. Sentiment polarity accuracy (Acc_pol): Accurately identifying text sentiment polarity (positive, neutral, negative).
2. Emotion distribution precision (F1_emo): Accurately predicting probability distribution across six basic emotions (happiness, sadness, fear, anger, surprise, disgust).
3. SubScore1 = 0.4 × Acc_pol + 0.6 × F1_emo
Subtask 2: Cross-Lingual Sentiment Representation Consistency
This task focuses on the stability of sentiment mapping across languages. Metrics include:
Polarity judgment consistency (Con_label) and emotion distribution similarity (Sim_dist).

Dataset: CCL-SEL, sourced from 12 Chinese classical works, with 250 Chinese-English sentence pairs per work.

Final Ranking Score: Total_Score = 0.5 × Sub_Score_1 + 0.5 × Sub_Score_2

Organizers and Contact Persons

Organizers: Haiyang Zhang, Xiaojun Zhang (Xi'an Jiaotong-Liverpool University); Ruifeng Xu (Harbin Institute of Technology, Shenzhen)
Task Contact: Jingshi Zhou (Jingshi.Zhou@outlook.com)

Task Awards

First Prize: 1, Second Prize: 2, Third Prize: 3.

Task Website

https://github.com/Jingshi-Zhou/-BCCL-CSA-2026-

Task 8: Evidence-Based Fact-Checking for LLM-Generated Chinese Medical Content

Task Overview

Evidence-based Medical Fact-checking is a critical task aimed at verifying the authenticity of online medical content. As the internet becomes the primary channel for the public to obtain health information, the spread of medical misinformation poses severe challenges to public health safety. This task requires models to not only understand medical claims but also combine retrieved relevant evidence to determine the degree of support for claims (e.g., supported, refuted, or insufficient evidence).

Given a set of medical assertions generated by large language models and their corresponding evidence, the model should predict the correct label (i.e., veracity):

Supported: Evidence fully supports the claim content;
Partially Supported: Evidence supports part of the claim but with uncertainty or uncovered details;
Refuted: Evidence contradicts the claim content;
Uncertain: Evidence is related to the claim but insufficient to confirm or refute it;
Not Applicable: Evidence is completely unrelated to the claim.

Organizers and Contact Persons

Organizers: Jionglong Su, Zhengyong Jiang, Wei Wang (Xi'an Jiaotong-Liverpool University)
Task Contact: Tong Chen (Xi'an Jiaotong-Liverpool University, Tong.Chen19@student.xjtlu.edu.cn)

Task Awards

This evaluation sets up first, second, and third prizes, with honorary certificates provided by CIPS.

Task Website

https://github.com/AshleyChenNLP/MedFact

Task 9: The Second Chinese Classical Poetry Appreciation Evaluation

Task Overview

Chinese classical poetry is characterized by high conciseness and linguistic musicality, emphasizing parallelism, tonal patterns, and rhyme. Accurate understanding of classical poetry semantics requires not only mastery of the linguistic features of classical poetry but also knowledge of historical and cultural backgrounds, combined with cognition of the natural scenes and emotional expressions depicted in the poems, for comprehensive reasoning and understanding.

To further measure models' depth of language understanding and cultural reasoning ability in Chinese classical poetry appreciation, we launch the Second Chinese Classical Poetry Appreciation Evaluation. Building on the first edition, this evaluation further focuses on deep understanding and complex reasoning abilities, introducing more challenging advanced tasks. The specific task settings are:

Task 1: Poetry Comprehension

Word-level understanding: Explaining phrase-level semantics in classical poetry (Q&A format).
Verse-level understanding: Explaining verse-level semantics in classical poetry (Q&A format).
Sentiment understanding: Inferring the emotions conveyed by the poet through the work (multiple-choice format).
Allusion recognition: Determining whether verses contain allusions and providing explanations (Q&A format).

Task 2: Poetry Reasoning

Poetry analogy: Discovering identical relationships between different things in classical poetry, associations of imagery (Q&A format).
Poetry analysis: Based on poetry content and context, analyzing given options and determining the most reasonable statement (multiple-choice format).

The final ranking will be determined by comprehensive performance across both tasks. Participating teams may use open-source LLMs for fine-tuning, but RAG and other techniques for utilizing external knowledge to answer questions are prohibited.

Organizers and Contact Persons

Organizers: Xuefeng Bai, Kehai Chen (Harbin Institute of Technology, Shenzhen)
Task Contact: Yingjie Zhu, Zhenwu Pei (Harbin Institute of Technology, Shenzhen, zhuyj@stu.hit.edu.cn)

Task Awards

First Prize: 1, total prize of 3,000 RMB;
Second Prize: 1, total prize of 2,000 RMB;
Third Prize: 1, total prize of 1,000 RMB.

All prizes will be distributed within 10 working days after the announcement.

Task Website

https://github.com/HITICI-NLPGroup/CCPA-EvalTask

III. Generative AI and LLM Core Capabilities

Task 10: Scenario-Based Commonsense Reasoning Evaluation (SCoRE)

Task Overview

Reasoning is an advanced cognitive function involving the analysis, induction, and deduction of new information based on existing knowledge. It plays a fundamental role in human intelligence. While previous benchmarks have primarily focused on evaluating LLMs' reasoning capabilities in complex, specialized domains, they often overlook a key aspect of human-like cognition: commonsense reasoning. Evaluating this commonsense reasoning capability in large language models is crucial for AI development, as it significantly influences LLMs' decision-making in everyday situations and is essential for moving toward human-like intelligence in Artificial General Intelligence (AGI).

To comprehensively and fine-grainedly diagnose LLMs' commonsense reasoning capabilities, we propose the Scenario-based Commonsense Reasoning Evaluation dataset (SCoRE). Tasks in this dataset can be divided into five categories based on the commonsense domain involved:

Spatial Commonsense Reasoning: Given a spatial scene and known spatial relationships between entities, this task requires machines to reason about entities' positions and unknown spatial relationships.
Temporal Commonsense Reasoning: Given a temporal narrative scene with events and known temporal relationships (e.g., sequential order, duration, relative or absolute time points), this task requires machines to reason about specific moments on a timeline and unknown temporal spans or sequential relationships.
Social Commonsense Reasoning: Given a social interaction scene with known interpersonal relationships (e.g., kinship, workplace, friendship, or mentorship), this task requires machines to reason about individuals' roles in social networks and implicit or unknown social relationships.
Natural Commonsense Reasoning: Given a set of natural objects and known attribute constraints (e.g., category, physical properties, functions, or sensory features), this task requires machines to reason about one-to-one correspondences between objects and descriptions, and unknown attributes or classification features.
Integrated Commonsense Reasoning: This task constructs multi-dimensional reasoning problems requiring machines to simultaneously process constraints and commonsense from spatial, temporal, natural attribute, and social relation domains, building unified reasoning models for collaborative analysis and decision-making.

Organizers and Contact Persons

Organizers: Weidong Zhan, Zhifang Sui (Peking University)
Task Contact: Nan Hu (Ph.D. student, Peking University, hunan@stu.pku.edu.cn)

Task Awards

First Prize: 0-1;
Second Prize: 0-2;
Third Prize: 0-4.

Task Website

https://pku-space.github.io/SCoRE2026/

Task 11: Automated Hazard Analysis and Risk Assessment for Autonomous Driving

Task Overview

As automotive E/E architecture evolves toward intelligence and connectivity, functional safety has evolved into a systematic safety engineering system covering software-hardware co-design, becoming a key cornerstone for autonomous driving technology deployment and mass production. Within this system, Hazard Analysis and Risk Assessment (HARA) serves the core function of risk identification and top-level safety requirement definition. This process systematically models vehicle operating scenarios, potential functional failure modes, and environmental factors, extracts key features such as vehicle motion states, road topology, and traffic participant distributions, and quantitatively assesses risks based on Severity (S), Exposure (E), and Controllability (C) dimensions to determine Automotive Safety Integrity Levels (ASIL).

To promote the application of large models and AI technology in functional safety, we propose this evaluation task and have constructed a structured dataset focusing on evaluating autonomous driving safety logic reasoning and requirement generation. The dataset is derived from de-identified real industrial project data, focusing on the core high-risk failure mode "unintended driving force/torque output," containing 3,000 high-quality annotated data points.

This evaluation includes two subtasks:

Hazard Event Identification and Scenario Description Generation: The model must accurately identify potential hazard events based on given vehicle operating conditions and environmental parameters, and generate structured hazard scenario descriptions compliant with engineering standards.
Risk Parameter Assessment and Level Reasoning: The model must reason and output key HARA risk indicators (S/E/C) based on scenario features, and determine the corresponding safety integrity level.

Organizers and Contact Persons

Organizers: Xu Yang (Beijing Institute of Technology), Haiyang Zhang (Xi'an Jiaotong-Liverpool University), Wei Wang (Xi'an Jiaotong-Liverpool University)
Task Contact: Zimu Wang (Ph.D. student, Xi'an Jiaotong-Liverpool University, Zimu.Wang19@student.xjtlu.edu.cn)

Task Awards

First Prize: 1, total prize of 5,000 RMB
Second Prize: 1, total prize of 3,000 RMB
Third Prize: 1, total prize of 2,000 RMB

Sponsorship

Prize money sponsored by UCharts Technology (Fuzhou) Co., Ltd.

Task Website

https://ccl2026-hara.github.io

Task 12: Youku Accessible Theater Cup — Accessible Structured Subtitle Generation for Hearing-Impaired Groups

Task Overview

In the context of China's information accessibility construction entering the "institutional guarantee" stage, subtitles have become a key accessibility service for hearing-impaired and elderly groups to access audio-visual information. However, existing technical evaluations lack benchmarks that target real application scenarios while comprehensively considering "readability," "core information accuracy," and "response speed." This task systematically evaluates the complete pipeline from "speech/video input" to generating "structured subtitle documents for human reading," particularly focusing on solving the two major pain points of "social time lag" and "critical information loss" in high-information-density real scenarios (e.g., healthcare, finance, government services).

The evaluation task is designed with two parallel tracks:

Track A: PC — Simulates cloud or high-performance desktop environments to explore performance upper bounds with no computing resource limitations.
Track B: Mobile — Simulates mobile device (phone, AR glasses) real-time communication scenarios with explicit constraints on model size, memory usage, and real-time performance.

Each track includes two subtasks:

Subtask 1: Foundation Subtitle Generation (Foundation Track)
Evaluates speech transcription, timestamp alignment, noise robustness, and other fundamental capabilities.
Subtask 2: Structured Readable Subtitle Generation (Structured Track)
Evaluates the model's comprehensive ability to generate structured subtitles that conform to human reading habits, including reasonable segmentation, punctuation, speaker identification, and core keyword accuracy.

Data Scale: The evaluation has constructed a multi-scenario real speech/video test set of approximately 30-50 hours, covering four typical scenarios: news speeches, film and variety shows, real-life conversations, and multi-person meetings.

Organizers and Contact Persons

Organizers: Dengfeng Yao (Beijing Union University / Tsinghua University)
Task Contact: Jie Shi (Master's student, Beijing Union University, 20251083510951@buu.edu.cn)

Task Awards

This evaluation sets up first, second, and third prizes, with honorary certificates issued by CIPS; sponsored awards are also established, with prizes supported by leading technology companies such as Alibaba.

Task Website

https://github.com/ALINOSJ/IASSGE-2026

Task 13: In-Image Translation Quality Evaluation

Task Overview

With the acceleration of globalization and the growing demand for cross-lingual communication, In-Image Translation has become an important branch of machine translation. Unlike traditional text translation, in-image translation requires simultaneously processing visual and linguistic information, covering text detection, recognition, translation, and rendering, with broad application value in cross-border e-commerce, travel guides, and multilingual content localization. Chinese in-image translation faces unique challenges: high visual complexity of Chinese characters, diverse writing directions (horizontal/vertical), significant text length differences with target languages, and rich cultural connotations.

This evaluation focuses on designing and training automatic evaluation systems that can accurately score image translation results from multiple dimensions. It aims to: establish standardized benchmarks, promote methodological innovation, explore evaluation paradigms through open competition, and establish reproducible, comparable evaluation standards for the community.

Organizers and Contact Persons

Organizers: Haijun Li, Zifu Shang, Jie Liang, Zhao Xu, Weihua Luo
Task Contact: Yuxuan Han (Alibaba Cloud Technical Expert, baileng.hyx@alibaba-inc.com)

Task Awards

First Prize: 1, total prize of 20,000 RMB
Second Prize: 1, total prize of 10,000 RMB
Third Prize: 2, total prize of 5,000 RMB

Sponsorship

Prize money sponsored by Alibaba Cloud, with honorary certificates issued by CIPS.

Task Website

https://tianchi.aliyun.com/competition/entrance/532463

Task 14: Evaluation of Conversational Implicature and Metaphor Ability in Chinese

Task Overview

Understanding implied meaning is a core ability in human communication. Previous evaluations of large language models have focused more on performance in specific domains, while relatively little attention has been paid to pragmatic inference and metaphor comprehension. This evaluation is organized into two main tracks to systematically assess large language models' ability to understand conversational meaning and metaphor in Chinese contexts.

Track 1: Conversational Implicature Understanding

To achieve communicative goals, participants in a conversation usually follow a set of basic principles, which Grice summarized as the Cooperative Principle. This theory proposes four maxims: quantity, quality, relation, and manner. The maxim of quantity requires speakers to provide an appropriate amount of information, neither too much nor too little; the maxim of quality requires speakers to be truthful and adequately supported by evidence; the maxim of relation requires utterances to be relevant to the current topic; and the maxim of manner requires expression to be clear and orderly, while avoiding obscurity and ambiguity. Based on these maxims, Grice proposed the theory of conversational implicature: when speakers violate these maxims or sub-maxims, listeners are driven to go beyond the literal meaning of an utterance and infer the speaker's implicit meaning. This track evaluates a model's ability to identify and understand conversational implicature.

Subtask 1: Conversational Implicature Identification
Given a multi-turn dialogue, the model is required to identify which utterance spoken by a specified character contains conversational implicature.
Subtask 2: Conversational Implicature Selection
Given a multi-turn dialogue, the model is required to choose from four options the correct meaning of the utterance that contains conversational implicature.
Subtask 3: Conversational Implicature Explanation
Given a multi-turn dialogue with the utterance containing conversational implicature explicitly marked, the model is required to generate an explanation.

Track 2: Metaphor Understanding and Generation

As an important way for humans to understand the world, metaphor plays a key role in concept formation and thinking. Metaphorical ability is related not only to language expression itself, but also to higher-level cognitive processes such as creative thinking, abstract reasoning, and knowledge transfer. People use concrete and familiar source domains to understand abstract and unfamiliar target domains, and this mapping mechanism runs through everyday language and thought. This track evaluates a model's ability to identify, interpret, and creatively use metaphors.

Subtask 1: Metaphor Identification
Given a passage, the model is required to determine whether a sentence in the passage uses metaphor. If a metaphor is present, the model should extract the tenor and vehicle of the metaphorical expression.
Subtask 2: Metaphor Explanation Generation
Given a passage, the model is required to explain the meaning of the metaphorical sentence in non-metaphorical language.
Subtask 3: Metaphor Sentence Generation
Without restricting the topic, the model is required to generate an appropriate metaphorical expression on its own.

Organizers and Contact Persons

Organizers: Erhong Yang, Tianlin Yang, Yan Yue, Weihua An (Beijing Language and Culture University)
Task Contact: Yixuan Zhang (Ph.D. student, Beijing Language and Culture University, blcuicall@163.com)

Task Awards

This evaluation sets up first, second, and third prizes, with honorary certificates provided by CIPS.

Task Website

https://github.com/blcuicall/CCIME2026

Please contact the task organizers or evaluation chairs if you have any questions.