Evaluation Tasks
The 25th China National Conference on Computational Linguistics (CCL26-Eval)
Technical Evaluation Task Release
Conference Website: http://cips-cl.org/static/CCL2026/en/index.html
The 25th China National Conference on Computational Linguistics (CCL 2026) will be held in Yichang, Hubei Province, from October 15 to 18, 2026, organized by the Chinese Information Processing Society of China (CIPS). CCL is a premier conference of CIPS and the largest community of natural language processing scholars and experts in China. After thirty years of development, CCL is widely regarded as the most authoritative, influential, and largest NLP conference in China. With the advancement of computational language processing in China, CCL has become the primary forum for disseminating new academic and technical work in computational linguistics nationwide.
This conference will continue to organize the Chinese language processing technology evaluation CCL26-Eval. After the initial collection of evaluation tasks, the CCL26-Eval organizing committee has confirmed 13 evaluation tasks, covering research directions such as semantic analysis, discourse and pragmatic analysis, cross-lingual and low-resource NLP, knowledge graphs, NLP applications in healthcare, education, humanities, and legal domains, as well as generative AI and core capabilities of large language models. Researchers are welcome to participate in the evaluation competition. Each evaluation task will establish first, second, and third prizes based on the competition results, and CIPS will issue official honorary certificates. Summary papers and outstanding technical reports will be included in the CCL Anthology and ACL Anthology.
Evaluation Chairs: Hongfei Lin (Dalian University of Technology, hflin@dlut.edu.cn), Hongye Tan (Shanxi University, tanhongye@sxu.edu.cn), Liang Yang (Dalian University of Technology, liang@dlut.edu.cn)
I. Fundamental NLP Tasks
1. Semantic Analysis / Discourse and Pragmatic Analysis
Task 1: The Second Chinese Factivity Inference Evaluation
Task 1: The Second Chinese Factivity Inference Evaluation
Task Overview
Factivity Inference (FI) is a semantic understanding task related to judging the truthfulness of events, and is a form of Factuality Inference (FactI). In human communication, factivity inference ability is mainly manifested in language users' ability to acquire the psychological states of the speaker and the subject of a sentence from the use of certain verbal components (such as "believe," "falsely claim," "realize," etc.), and to infer the truthfulness (true or false) of related events accordingly. For example, from both the affirmative sentence "They realized the situation was irreversible" and the corresponding negative sentence "They did not realize the situation was irreversible," one can infer that in the speaker's view, the fact exists that "the situation was irreversible." The knowledge used for such inference is analytical knowledge of language, which is less influenced by world knowledge and mainly involves semantic relationships between internal components of language.
To further enhance the semantic understanding capabilities of large language models for Chinese and achieve deep machine understanding of human communicative discourse, we will continue to launch the "Second Chinese Factivity Inference Evaluation" based on FIE2025. This evaluation will focus on examining the factivity inference performance of large language models under complex contextual conditions and few-shot prompting.
Compared to FIE2025, this evaluation's dataset covers a larger number of factive predicates (approximately 500) and more diverse contextual conditions, such as negation words "不, 没有, 差点," negative intentions "不敢, 不想, 不愿, 难以," passivization operations "被, 被迫," evaluative adverbs "正确地, 错误地," polyphonic markers "并不, 绝不," etc.
Task Description
Participating teams need to design their own prompts using the sample set and evaluation set released by the organizers, and organize the responses from LLMs into a unified output format. Each evaluation data entry is presented as a textual entailment pair <Aa, a>, and the dataset is saved in JSON format.
The model needs to judge the truth value of the entailed sentence a based on the content of the main entailing sentence Aa, and provide a confidence level for the judgment. This evaluation continues to set up two tracks: non-finetuning and finetuning. The non-finetuning track does not allow any modifications to the model itself; the finetuning track allows fine-tuning model parameters using sample set data.
Organizers and Contact Persons
- Task Organizers: Yulin Yuan (Professor, University of Macau), Bin Li (Professor, Nanjing Normal University)
- Task Contact: Guanliang Cong (Ph.D. student, University of Macau, guanliang.cong@connect.um.edu.mo); Tianqi Xun (Ph.D. student, University of Macau, tianqi.xun@connect.um.edu.mo)
Task Awards
This evaluation will set up first, second, and third prizes for both the non-finetuning and finetuning tracks. First Prize: 0-1, Second Prize: 0-2, Third Prize: 0-3. Prize amounts to be determined.
Task Website
https://github.com/UM-FAH-Yuan/FIE2026
Task 2: Non-Literal Meaning Translation and Understanding Evaluation
Task 2: Non-Literal Meaning Translation and Understanding Evaluation
Task Overview
This evaluation focuses on Chinese-English translation and identification of non-literal expressions such as proverbs, idioms, slang, and allusions, examining models' understanding of non-literal meaning, cross-lingual cultural mapping ability, and pragmatic effect preservation. The task constructs a complementary "generation + discrimination" evaluation framework to test models' non-literal expression generation ability and standard non-literal meaning recognition ability. The evaluation data comprises 5,000 high-quality samples, covering Gold (idiomatic/proverbial equivalent expressions) and Silver (explanatory paraphrases) references. This evaluation includes two subtasks.
- Subtask 1: Non-Literal Chinese-to-English Translation
Given a Chinese sentence containing non-literal expressions such as proverbs, idioms, etc., the model needs to generate a natural, idiomatic English translation with cultural mapping characteristics, prioritizing equivalent substitution using existing English idioms, proverbs, maxims, or fixed collocations. - Subtask 2: Non-Literal Chinese-English Selection
Given a Chinese sentence with non-literal expression and several English candidates, the model needs to perform multi-choice selection, identifying and outputting Gold-labeled items that constitute recognized equivalent substitution relationships with the Chinese in the English context.
Organizers and Contact Persons
- Organizers: Dongyu Zhang (Professor, Dalian University of Technology)
- Task Contact: Senqi Yang (Ph.D. student, Dalian University of Technology, ysq1997@mail.dlut.edu.cn)
Task Awards
This evaluation sets up first, second, and third prizes, with honorary certificates provided by CIPS.
Task Website
https://github.com/DUTIR-YSQ/CCL2026-Non-literal-Translation-Task
2. Cross-Lingual, Minority Language, and Low-Resource NLP
Task 3: Cross-Lingual Financial Evaluation Benchmark for LLMs (MapFinBen)
Task 3: Cross-Lingual Financial Evaluation Benchmark for LLMs (MapFinBen)
Task Overview
MapFinBen is the first multilingual financial evaluation benchmark specifically designed to assess the cross-lingual capabilities of large language models between high-resource and low-resource languages. The benchmark covers five representative financial tasks, comprehensively reflecting the diverse needs in real financial application scenarios.
In terms of language settings, MapFinBen covers both high-resource languages (English and Chinese) and multiple low-resource languages (Indonesian, Spanish, Greek, and Japanese), effectively addressing the over-reliance on high-resource languages in existing financial language model evaluations. Through unified task design and evaluation standards, this framework can systematically assess the financial task processing capabilities of large models across languages and resource conditions.
MapFinBen consists of five subtasks:
- Subtask 1: Financial Answer Selection (FinAS) — Given a financial text with corresponding questions and candidate options, the model selects the correct answer that best matches the question semantics and financial context.
- Subtask 2: Financial Question Answering (FinQA) — Given a financial text, the model answers related financial questions based on the text content.
- Subtask 3: Financial Sentiment Analysis (FinSA) — Given a financial text, the model identifies the emotional tendency expressed and classifies it as positive, neutral, or negative.
- Subtask 4: Financial Topic Classification (FinTC) — Given a financial text and candidate topic categories, the model categorizes the text into the most appropriate financial topic category.
- Subtask 5: Financial Text Summarization (FinTS) — Given a financial text, the model extracts and generates a concise, accurate summary covering the core information.
Organizers and Contact Persons
- Organizers: Gang Hu, Kun Yue (Yunnan University), Min Peng (Wuhan University), Lei Shi (Yunnan Normal University)
- Task Contact: Xiaoyong Kong (kongxiaoyong@stu.ynu.edu.cn)
Task Awards
This evaluation sets up first, second, and third prizes, with honorary certificates provided by CIPS.
Task Website
https://github.com/HgITSE/MapFinBen
Task 4: Low-Resource Burmese Template Sentence Inference Evaluation
Task 4: Low-Resource Burmese Template Sentence Inference Evaluation
Task Overview
In translation for low-resource languages such as Burmese, the inference of fixed template sentences, as a domain-specific task, is significantly influenced by internal linguistic factors such as parts of speech, place names, and diverse cultural values, all of which affect the final translation quality.
Format and convention differences: For example, Chinese expressions like "第1名" (1st place) and "第3章" (Chapter 3) translate to Burmese as "number" or "no.," which must be immediately followed by Burmese numerals. Place name transliteration conflicts: Place name transliteration often conflicts with Burmese-specific pronunciation and historical conventions, causing confusion in direct Chinese transliteration. Diverse cultural values: Translation is influenced by race, religion, and collectivism and cannot be simply translated literally. Local cultural sensitivity and religious background must be fully considered.
As a template sentence inference task, this evaluation aims to improve the translation quality of large translation models for Burmese and achieve deeper machine understanding of human fixed template sentences.
Organizers and Contact Persons
- Organizers: Ziyan Chen, Jinsong Liu (Transn Information Technology Co., Ltd.), Shaolin Zhu (Tianjin University)
- Task Contact: Hong Ren (Ph.D. student, Tianjin University, rhong@tju.edu.cn), Chuan Wu (Master's student, Tianjin University, wuchuan@tju.edu.cn)
Task Awards
This evaluation sets up first, second, and third prizes, with honorary certificates provided by CIPS. Prize money sponsored by Transn Information Technology Co., Ltd.
Task Website
https://github.com/merc11/CCL-2026
II. NLP Applications
1. Knowledge Graphs
Task 5: Minor Grain Breeding Information Extraction Evaluation (MGBIE)
Task 5: Minor Grain Breeding Information Extraction Evaluation (MGBIE)
Task Overview
The field of minor grain breeding has accumulated a vast amount of knowledge presented in natural language, widely distributed across papers, variety approval documents, and cultivation technical regulations. These texts record breeding material sources, target traits, and measurement results, as well as cultivation management conditions, stress treatment information, and molecular marker evidence. Due to the dense professional terminology, diverse concept expressions, and frequent nested expressions of material names and experimental elements, key information extraction and unified structuring remain challenging, limiting the development of knowledge retrieval, evidence synthesis, and breeding decision support applications.
The Minor Grain Breeding Information Extraction Evaluation (MGBIE) aims to systematically evaluate information extraction models' capabilities in professional terminology recognition, breeding context understanding, key information extraction, and structured expression. The MGBIE dataset contains 2,000 samples total (1,000 training, 400 validation, 600 test).
MGBIE 2026 includes two subtasks:
- Named Entity Recognition for Minor Grain Breeding: Identify and extract key entity information from breeding-related texts, outputting entity boundaries and type labels. The entity type system covers 12 categories: crop, variety, trait, growth period, gene, QTL, molecular marker, chromosome, breeding method, parent/cross combination, abiotic stress, and biotic stress.
- Relation Extraction for Minor Grain Breeding: Extract semantic relations between identified entities, represented as relation triplets. The relation type system includes 6 semantic relations: contains, adopts, has, affects, occurs_in, and locates_in.
Organizers and Contact Persons
- Organizers: Zhiwei Hu, Zhaosheng Kong, Jianhua Gao (Houji Laboratory of Shanxi Province, Shanxi Agricultural University); Hongye Tan, Zhichao Yan, Ru Li (Shanxi University); Qianqian Xie (Wuhan University)
- Task Contact: Senjie Yang (Master's student, Shanxi University, yangsenjie1@sxu.edu.cn)
Task Awards
This evaluation sets up first, second, and third prizes, with honorary certificates provided by CIPS.
Task Website
https://github.com/zhiweihu1103/CCL2026-MGBIE
2. NLP Applications in Healthcare, Education, Humanities, and Legal Domains
Task 6: Automatic ICD Coding for Chinese Electronic Medical Records
Task 6: Automatic ICD Coding for Chinese Electronic Medical Records
Task Overview
In recent years, with the intensification of population aging and increased health awareness, the healthcare system faces growing service pressure. In the process of healthcare informatization, the widespread application of electronic medical records provides new possibilities for addressing this challenge. To achieve standardized management and sharing of medical data, the World Health Organization developed the International Classification of Diseases (ICD) standard, converting tens of thousands of diseases and their combinations into a standardized alphanumeric coding system.
However, manual ICD coding of electronic medical record texts is not only time-consuming and labor-intensive but also prone to coding errors due to differences in professional skills. Developing automatic ICD coding systems can improve both coding efficiency and accuracy while providing more reliable data support for disease research and medical management. Based on this background, this task has constructed a Chinese electronic medical record ICD automatic coding dataset based on de-identified medical record data, covering 10 departments, 19 main disease codes, various other disease codes, 16 main procedure codes, various other procedure codes, totaling 2,200 records.
Organizers and Contact Persons
- Organizers: Hongjiao Guan, Wenpeng Lu (Qilu University of Technology / Shandong Academy of Sciences), Ying Lian, Guoqiang Chen (First Affiliated Hospital of Shandong First Medical University)
- Task Contact: Chuanlong Li (Master's student, Qilu University of Technology, icdevaluator@163.com)
Task Awards
This evaluation sets up 1 first prize, 3 second prizes, and 6 third prizes, with honorary certificates provided by CIPS.
Task Website
https://github.com/QLU-NLP/icdevaluator-26
Task 7: Cross-Lingual Sentiment Analysis Consistency for Literary Texts (BCCL-CSA)
Task 7: Cross-Lingual Sentiment Analysis Consistency for Literary Texts (BCCL-CSA)
Task Overview
With the rapid development of Multilingual Large Language Models (MLLMs), NLP technology has matured in handling modern general-purpose corpora. However, existing sentiment analysis techniques still face significant challenges when dealing with Chinese classical literature, which is characterized by high context dependency and deep cultural heritage. Chinese classical literature's emotional expression features typical "implicit expression" and "expressing aspirations through objects," relying on specific imagery, historical allusions, and complex rhetoric to convey emotions rather than directly using emotional adjectives. To this end, this evaluation proposes the Bilingual Chinese Classical Literature Cross-lingual Sentiment Analysis evaluation task (BCCL-CSA).
- Subtask 1: Fine-Grained Sentiment Recognition
Participating systems need to independently capture sentiment features from given Chinese classical original texts and their corresponding English translations. The evaluation assesses:
1. Sentiment polarity accuracy (Acc_pol): Accurately identifying text sentiment polarity (positive, neutral, negative).
2. Emotion distribution precision (F1_emo): Accurately predicting probability distribution across six basic emotions (happiness, sadness, fear, anger, surprise, disgust).
3. SubScore1 = 0.4 × Acc_pol + 0.6 × F1_emo - Subtask 2: Cross-Lingual Sentiment Representation Consistency
This task focuses on the stability of sentiment mapping across languages. Metrics include:
Polarity judgment consistency (Con_label) and emotion distribution similarity (Sim_dist).
Dataset: CCL-SEL, sourced from 12 Chinese classical works, with 250 Chinese-English sentence pairs per work.
Final Ranking Score: Total_Score = 0.5 × Sub_Score_1 + 0.5 × Sub_Score_2
Organizers and Contact Persons
- Organizers: Haiyang Zhang, Xiaojun Zhang (Xi'an Jiaotong-Liverpool University); Ruifeng Xu (Harbin Institute of Technology, Shenzhen)
- Task Contact: Jingshi Zhou (Jingshi.Zhou@outlook.com)
Task Awards
First Prize: 1, Second Prize: 2, Third Prize: 3.
Task Website
https://github.com/Jingshi-Zhou/-BCCL-CSA-2026-
Task 8: Evidence-Based Fact-Checking for LLM-Generated Chinese Medical Content
Task 8: Evidence-Based Fact-Checking for LLM-Generated Chinese Medical Content
Task Overview
Evidence-based Medical Fact-checking is a critical task aimed at verifying the authenticity of online medical content. As the internet becomes the primary channel for the public to obtain health information, the spread of medical misinformation poses severe challenges to public health safety. This task requires models to not only understand medical claims but also combine retrieved relevant evidence to determine the degree of support for claims (e.g., supported, refuted, or insufficient evidence).
Given a set of medical assertions generated by large language models and their corresponding evidence, the model should predict the correct label (i.e., veracity):
- Supported: Evidence fully supports the claim content;
- Partially Supported: Evidence supports part of the claim but with uncertainty or uncovered details;
- Refuted: Evidence contradicts the claim content;
- Uncertain: Evidence is related to the claim but insufficient to confirm or refute it;
- Not Applicable: Evidence is completely unrelated to the claim.
Organizers and Contact Persons
- Organizers: Jionglong Su, Zhengyong Jiang, Wei Wang (Xi'an Jiaotong-Liverpool University)
- Task Contact: Tong Chen (Xi'an Jiaotong-Liverpool University, Tong.Chen19@student.xjtlu.edu.cn)
Task Awards
This evaluation sets up first, second, and third prizes, with honorary certificates provided by CIPS.
Task Website
https://github.com/AshleyChenNLP/MedFact
Task 9: The Second Chinese Classical Poetry Appreciation Evaluation
Task 9: The Second Chinese Classical Poetry Appreciation Evaluation
Task Overview
Chinese classical poetry is characterized by high conciseness and linguistic musicality, emphasizing parallelism, tonal patterns, and rhyme. Accurate understanding of classical poetry semantics requires not only mastery of the linguistic features of classical poetry but also knowledge of historical and cultural backgrounds, combined with cognition of the natural scenes and emotional expressions depicted in the poems, for comprehensive reasoning and understanding.
To further measure models' depth of language understanding and cultural reasoning ability in Chinese classical poetry appreciation, we launch the Second Chinese Classical Poetry Appreciation Evaluation. Building on the first edition, this evaluation further focuses on deep understanding and complex reasoning abilities, introducing more challenging advanced tasks. The specific task settings are:
Task 1: Poetry Comprehension
- Word-level understanding: Explaining phrase-level semantics in classical poetry (Q&A format).
- Verse-level understanding: Explaining verse-level semantics in classical poetry (Q&A format).
- Sentiment understanding: Inferring the emotions conveyed by the poet through the work (multiple-choice format).
- Allusion recognition: Determining whether verses contain allusions and providing explanations (Q&A format).
Task 2: Poetry Reasoning
- Poetry analogy: Discovering identical relationships between different things in classical poetry, associations of imagery (Q&A format).
- Poetry analysis: Based on poetry content and context, analyzing given options and determining the most reasonable statement (multiple-choice format).
The final ranking will be determined by comprehensive performance across both tasks. Participating teams may use open-source LLMs for fine-tuning, but RAG and other techniques for utilizing external knowledge to answer questions are prohibited.
Organizers and Contact Persons
- Organizers: Xuefeng Bai, Kehai Chen (Harbin Institute of Technology, Shenzhen)
- Task Contact: Yingjie Zhu, Zhenwu Pei (Harbin Institute of Technology, Shenzhen, zhuyj@stu.hit.edu.cn)
Task Awards
- First Prize: 1, total prize of 3,000 RMB;
- Second Prize: 1, total prize of 2,000 RMB;
- Third Prize: 1, total prize of 1,000 RMB.
All prizes will be distributed within 10 working days after the announcement.
Task Website
https://github.com/HITICI-NLPGroup/CCPA-EvalTask
III. Generative AI and LLM Core Capabilities
Task 10: Scenario-Based Commonsense Reasoning Evaluation (SCoRE)
Task 10: Scenario-Based Commonsense Reasoning Evaluation (SCoRE)
Task Overview
Reasoning is an advanced cognitive function involving the analysis, induction, and deduction of new information based on existing knowledge. It plays a fundamental role in human intelligence. While previous benchmarks have primarily focused on evaluating LLMs' reasoning capabilities in complex, specialized domains, they often overlook a key aspect of human-like cognition: commonsense reasoning. Evaluating this commonsense reasoning capability in large language models is crucial for AI development, as it significantly influences LLMs' decision-making in everyday situations and is essential for moving toward human-like intelligence in Artificial General Intelligence (AGI).
To comprehensively and fine-grainedly diagnose LLMs' commonsense reasoning capabilities, we propose the Scenario-based Commonsense Reasoning Evaluation dataset (SCoRE). Tasks in this dataset can be divided into five categories based on the commonsense domain involved:
- Spatial Commonsense Reasoning: Given a spatial scene and known spatial relationships between entities, this task requires machines to reason about entities' positions and unknown spatial relationships.
- Temporal Commonsense Reasoning: Given a temporal narrative scene with events and known temporal relationships (e.g., sequential order, duration, relative or absolute time points), this task requires machines to reason about specific moments on a timeline and unknown temporal spans or sequential relationships.
- Social Commonsense Reasoning: Given a social interaction scene with known interpersonal relationships (e.g., kinship, workplace, friendship, or mentorship), this task requires machines to reason about individuals' roles in social networks and implicit or unknown social relationships.
- Natural Commonsense Reasoning: Given a set of natural objects and known attribute constraints (e.g., category, physical properties, functions, or sensory features), this task requires machines to reason about one-to-one correspondences between objects and descriptions, and unknown attributes or classification features.
- Integrated Commonsense Reasoning: This task constructs multi-dimensional reasoning problems requiring machines to simultaneously process constraints and commonsense from spatial, temporal, natural attribute, and social relation domains, building unified reasoning models for collaborative analysis and decision-making.
Organizers and Contact Persons
- Organizers: Weidong Zhan, Zhifang Sui (Peking University)
- Task Contact: Nan Hu (Ph.D. student, Peking University, hunan@stu.pku.edu.cn)
Task Awards
- First Prize: 0-1;
- Second Prize: 0-2;
- Third Prize: 0-4.
Task Website
https://pku-space.github.io/SCoRE2026/
Task 11: Automated Hazard Analysis and Risk Assessment for Autonomous Driving
Task 11: Automated Hazard Analysis and Risk Assessment for Autonomous Driving
Task Overview
As automotive E/E architecture evolves toward intelligence and connectivity, functional safety has evolved into a systematic safety engineering system covering software-hardware co-design, becoming a key cornerstone for autonomous driving technology deployment and mass production. Within this system, Hazard Analysis and Risk Assessment (HARA) serves the core function of risk identification and top-level safety requirement definition. This process systematically models vehicle operating scenarios, potential functional failure modes, and environmental factors, extracts key features such as vehicle motion states, road topology, and traffic participant distributions, and quantitatively assesses risks based on Severity (S), Exposure (E), and Controllability (C) dimensions to determine Automotive Safety Integrity Levels (ASIL).
To promote the application of large models and AI technology in functional safety, we propose this evaluation task and have constructed a structured dataset focusing on evaluating autonomous driving safety logic reasoning and requirement generation. The dataset is derived from de-identified real industrial project data, focusing on the core high-risk failure mode "unintended driving force/torque output," containing 3,000 high-quality annotated data points.
This evaluation includes two subtasks:
- Hazard Event Identification and Scenario Description Generation: The model must accurately identify potential hazard events based on given vehicle operating conditions and environmental parameters, and generate structured hazard scenario descriptions compliant with engineering standards.
- Risk Parameter Assessment and Level Reasoning: The model must reason and output key HARA risk indicators (S/E/C) based on scenario features, and determine the corresponding safety integrity level.
Organizers and Contact Persons
- Organizers: Xu Yang (Beijing Institute of Technology), Haiyang Zhang (Xi'an Jiaotong-Liverpool University), Wei Wang (Xi'an Jiaotong-Liverpool University)
- Task Contact: Zimu Wang (Ph.D. student, Xi'an Jiaotong-Liverpool University, Zimu.Wang19@student.xjtlu.edu.cn)
Task Awards
- First Prize: 1, total prize of 5,000 RMB
- Second Prize: 1, total prize of 3,000 RMB
- Third Prize: 1, total prize of 2,000 RMB
Sponsorship
Prize money sponsored by UCharts Technology (Fuzhou) Co., Ltd.
Task Website
https://ccl2026-hara.github.io
Task 12: Youku Accessible Theater Cup — Accessible Structured Subtitle Generation for Hearing-Impaired Groups
Task 12: Youku Accessible Theater Cup — Accessible Structured Subtitle Generation for Hearing-Impaired Groups
Task Overview
In the context of China's information accessibility construction entering the "institutional guarantee" stage, subtitles have become a key accessibility service for hearing-impaired and elderly groups to access audio-visual information. However, existing technical evaluations lack benchmarks that target real application scenarios while comprehensively considering "readability," "core information accuracy," and "response speed." This task systematically evaluates the complete pipeline from "speech/video input" to generating "structured subtitle documents for human reading," particularly focusing on solving the two major pain points of "social time lag" and "critical information loss" in high-information-density real scenarios (e.g., healthcare, finance, government services).
The evaluation task is designed with two parallel tracks:
- Track A: PC — Simulates cloud or high-performance desktop environments to explore performance upper bounds with no computing resource limitations.
- Track B: Mobile — Simulates mobile device (phone, AR glasses) real-time communication scenarios with explicit constraints on model size, memory usage, and real-time performance.
Each track includes two subtasks:
- Subtask 1: Foundation Subtitle Generation (Foundation Track)
Evaluates speech transcription, timestamp alignment, noise robustness, and other fundamental capabilities. - Subtask 2: Structured Readable Subtitle Generation (Structured Track)
Evaluates the model's comprehensive ability to generate structured subtitles that conform to human reading habits, including reasonable segmentation, punctuation, speaker identification, and core keyword accuracy.
Data Scale: The evaluation has constructed a multi-scenario real speech/video test set of approximately 30-50 hours, covering four typical scenarios: news speeches, film and variety shows, real-life conversations, and multi-person meetings.
Organizers and Contact Persons
- Organizers: Dengfeng Yao (Beijing Union University / Tsinghua University)
- Task Contact: Jie Shi (Master's student, Beijing Union University, 20251083510951@buu.edu.cn)
Task Awards
This evaluation sets up first, second, and third prizes, with honorary certificates issued by CIPS; sponsored awards are also established, with prizes supported by leading technology companies such as Alibaba.
Task Website
https://github.com/ALINOSJ/IASSGE-2026
Task 13: In-Image Translation Quality Evaluation
Task 13: In-Image Translation Quality Evaluation
Task Overview
With the acceleration of globalization and the growing demand for cross-lingual communication, In-Image Translation has become an important branch of machine translation. Unlike traditional text translation, in-image translation requires simultaneously processing visual and linguistic information, covering text detection, recognition, translation, and rendering, with broad application value in cross-border e-commerce, travel guides, and multilingual content localization. Chinese in-image translation faces unique challenges: high visual complexity of Chinese characters, diverse writing directions (horizontal/vertical), significant text length differences with target languages, and rich cultural connotations.
This evaluation focuses on designing and training automatic evaluation systems that can accurately score image translation results from multiple dimensions. It aims to: establish standardized benchmarks, promote methodological innovation, explore evaluation paradigms through open competition, and establish reproducible, comparable evaluation standards for the community.
Organizers and Contact Persons
- Organizers: Haijun Li, Zifu Shang, Jie Liang, Zhao Xu, Weihua Luo
- Task Contact: Yuxuan Han (Alibaba Cloud Technical Expert, baileng.hyx@alibaba-inc.com)
Task Awards
- First Prize: 1, total prize of 20,000 RMB
- Second Prize: 1, total prize of 10,000 RMB
- Third Prize: 2, total prize of 5,000 RMB
Sponsorship
Prize money sponsored by Alibaba Cloud, with honorary certificates issued by CIPS.
Task Website
https://tianchi.aliyun.com/competition/entrance/532463
Please contact the task organizers or evaluation chairs if you have any questions.