Computation and Language
☆ MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task
Juraj Juraska, Tobias Domhan, Mara Finkelstein, Tetsuji Nakagawa, Geza Kovacs, Daniel Deutsch, Pidong Wang, Markus Freitag
In this paper, we present our submissions to the unified WMT25 Translation
Evaluation Shared Task. For the Quality Score Prediction subtask, we create a
new generation of MetricX with improvements in the input format and the
training protocol, while for the Error Span Detection subtask we develop a new
model, GemSpanEval, trained to predict error spans along with their severities
and categories. Both systems are based on the state-of-the-art multilingual
open-weights model Gemma 3, fine-tuned on publicly available WMT data. We
demonstrate that MetricX-25, adapting Gemma 3 to an encoder-only architecture
with a regression head on top, can be trained to effectively predict both MQM
and ESA quality scores, and significantly outperforms its predecessor. Our
decoder-only GemSpanEval model, on the other hand, we show to be competitive in
error span detection with xCOMET, a strong encoder-only sequence-tagging
baseline. With error span detection formulated as a generative task, we
instruct the model to also output the context for each predicted error span,
thus ensuring that error spans are identified unambiguously.
comment: Accepted to WMT25
☆ ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?
Shuqing Li, Jiayi Yan, Chenyu Niu, Jen-tse Huang, Yun Peng, Wenxuan Wang, Yepang Liu, Michael R. Lyu
Virtual Reality (VR) games require players to translate high-level semantic
actions into precise device manipulations using controllers and head-mounted
displays (HMDs). While humans intuitively perform this translation based on
common sense and embodied understanding, whether Large Language Models (LLMs)
can effectively replicate this ability remains underexplored. This paper
introduces a benchmark, ComboBench, evaluating LLMs' capability to translate
semantic actions into VR device manipulation sequences across 262 scenarios
from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II,
and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o,
Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against
annotated ground truth and human performance. Our results reveal that while
top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition
capabilities, they still struggle with procedural reasoning and spatial
understanding compared to humans. Performance varies significantly across
games, suggesting sensitivity to interaction complexity. Few-shot examples
substantially improve performance, indicating potential for targeted
enhancement of LLMs' VR manipulation capabilities. We release all materials at
https://sites.google.com/view/combobench.
☆ Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents
Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, Joseph Liu, Tianyue Ou, Zhihao Yuan, Frank Xu, Shuyan Zhou, Xingyao Wang, Xiang Yue, Tao Yu, Huan Sun, Yu Su, Graham Neubig
Public research results on large-scale supervised finetuning of AI agents
remain relatively rare, since the collection of agent training data presents
unique challenges. In this work, we argue that the bottleneck is not a lack of
underlying data sources, but that a large variety of data is fragmented across
heterogeneous formats, tools, and interfaces. To this end, we introduce the
agent data protocol (ADP), a light-weight representation language that serves
as an "interlingua" between agent datasets in diverse formats and unified agent
training pipelines downstream. The design of ADP is expressive enough to
capture a large variety of tasks, including API/tool use, browsing, coding,
software engineering, and general agentic workflows, while remaining simple to
parse and train on without engineering at a per-dataset level. In experiments,
we unified a broad collection of 13 existing agent training datasets into ADP
format, and converted the standardized ADP data into training-ready formats for
multiple agent frameworks. We performed SFT on these data, and demonstrated an
average performance gain of ~20% over corresponding base models, and delivers
state-of-the-art or near-SOTA performance on standard coding, browsing, tool
use, and research benchmarks, without domain-specific tuning. All code and data
are released publicly, in the hope that ADP could help lower the barrier to
standardized, scalable, and reproducible agent training.
☆ Tongyi DeepResearch Technical Report
Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayin Yang, Jun Lin, Junkai Zhang, Kui Zeng, Li Yang, Hailong Yin, Maojia Song, Ming Yan, Peng Xia, Qian Xiao, Rui Min, Ruixue Ding, Runnan Fang, Shaowei Chen, Shen Huang, Shihang Wang, Shihao Cai, Weizhou Shen, Xiaobin Wang, Xin Guan, Xinyu Geng, Yingcheng Shi, Yuning Wu, Zhuo Chen, Zijian Li, Yong Jiang
We present Tongyi DeepResearch, an agentic large language model, which is
specifically designed for long-horizon, deep information-seeking research
tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is
developed through an end-to-end training framework that combines agentic
mid-training and agentic post-training, enabling scalable reasoning and
information seeking across complex tasks. We design a highly scalable data
synthesis pipeline that is fully automatic, without relying on costly human
annotation, and empowers all training stages. By constructing customized
environments for each stage, our system enables stable and consistent
interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total
parameters, with only 3.3 billion activated per token, achieves
state-of-the-art performance across a range of agentic deep research
benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH,
WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We
open-source the model, framework, and complete solutions to empower the
community.
comment: https://tongyi-agent.github.io/blog
☆ ParallelMuse: Agentic Parallel Thinking for Deep Information Seeking
Baixuan Li, Dingchu Zhang, Jialong Wu, Wenbiao Yin, Zhengwei Tao, Yida Zhao, Liwen Zhang, Haiyang Shen, Runnan Fang, Pengjun Xie, Jingren Zhou, Yong Jiang
Parallel thinking expands exploration breadth, complementing the deep
exploration of information-seeking (IS) agents to further enhance
problem-solving capability. However, conventional parallel thinking faces two
key challenges in this setting: inefficiency from repeatedly rolling out from
scratch, and difficulty in integrating long-horizon reasoning trajectories
during answer generation, as limited context capacity prevents full
consideration of the reasoning process. To address these issues, we propose
ParallelMuse, a two-stage paradigm designed for deep IS agents. The first
stage, Functionality-Specified Partial Rollout, partitions generated sequences
into functional regions and performs uncertainty-guided path reuse and
branching to enhance exploration efficiency. The second stage, Compressed
Reasoning Aggregation, exploits reasoning redundancy to losslessly compress
information relevant to answer derivation and synthesize a coherent final
answer. Experiments across multiple open-source agents and benchmarks
demonstrate up to 62% performance improvement with a 10--30% reduction in
exploratory token consumption.
☆ AgentFold: Long-Horizon Web Agents with Proactive Context Management
Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, Pengjun Xie, Fei Huang, Siheng Chen, Jingren Zhou, Yong Jiang
LLM-based web agents show immense promise for information seeking, yet their
effectiveness on long-horizon tasks is hindered by a fundamental trade-off in
context management. Prevailing ReAct-based agents suffer from context
saturation as they accumulate noisy, raw histories, while methods that fixedly
summarize the full history at each step risk the irreversible loss of critical
details. Addressing these, we introduce AgentFold, a novel agent paradigm
centered on proactive context management, inspired by the human cognitive
process of retrospective consolidation. AgentFold treats its context as a
dynamic cognitive workspace to be actively sculpted, rather than a passive log
to be filled. At each step, it learns to execute a `folding' operation, which
manages its historical trajectory at multiple scales: it can perform granular
condensations to preserve vital, fine-grained details, or deep consolidations
to abstract away entire multi-step sub-tasks. The results on prominent
benchmarks are striking: with simple supervised fine-tuning (without continual
pre-training or RL), our AgentFold-30B-A3B agent achieves 36.2% on BrowseComp
and 47.3% on BrowseComp-ZH. Notably, this performance not only surpasses or
matches open-source models of a dramatically larger scale, such as the
DeepSeek-V3.1-671B-A37B, but also surpasses leading proprietary agents like
OpenAI's o4-mini.
comment: 26 pages, 9 figures
☆ WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking
Zhengwei Tao, Haiyang Shen, Baixuan Li, Wenbiao Yin, Jialong Wu, Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Liwen Zhang, Xinyu Wang, Pengjun Xie, Jingren Zhou, Yong Jiang
Large Language Model (LLM)-based agents have emerged as a transformative
approach for open-ended problem solving, with information seeking (IS) being a
core capability that enables autonomous reasoning and decision-making. While
prior research has largely focused on improving retrieval depth, we observe
that current IS agents often suffer from low search efficiency, which in turn
constrains overall performance. A key factor underlying this inefficiency is
the sparsity of target entities in training tasks, which limits opportunities
for agents to learn and generalize efficient search behaviors. To address these
challenges, we propose WebLeaper, a framework for constructing high-coverage IS
tasks and generating efficient solution trajectories. We formulate IS as a
tree-structured reasoning problem, enabling a substantially larger set of
target entities to be embedded within a constrained context. Leveraging curated
Wikipedia tables, we propose three variants for synthesizing IS tasks, Basic,
Union, and Reverse-Union, to systematically increase both IS efficiency and
efficacy. Finally, we curate training trajectories by retaining only those that
are simultaneously accurate and efficient, ensuring that the model is optimized
for both correctness and search performance. Extensive experiments on both
basic and comprehensive settings, conducted on five IS benchmarks, BrowserComp,
GAIA, xbench-DeepSearch, WideSearch, and Seal-0, demonstrate that our method
consistently achieves improvements in both effectiveness and efficiency over
strong baselines.
☆ AgentFrontier: Expanding the Capability Frontier of LLM Agents with ZPD-Guided Data Synthesis
Xuanzhong Chen, Zile Qiao, Guoxin Chen, Liangcai Su, Zhen Zhang, Xinyu Wang, Pengjun Xie, Fei Huang, Jingren Zhou, Yong Jiang
Training large language model agents on tasks at the frontier of their
capabilities is key to unlocking advanced reasoning. We introduce a data
synthesis approach inspired by the educational theory of the Zone of Proximal
Development (ZPD), which defines this frontier as tasks an LLM cannot solve
alone but can master with guidance. To operationalize this, we present the
AgentFrontier Engine, an automated pipeline that synthesizes high-quality,
multidisciplinary data situated precisely within the LLM's ZPD. This engine
supports both continued pre-training with knowledge-intensive data and targeted
post-training on complex reasoning tasks. From the same framework, we derive
the ZPD Exam, a dynamic and automated benchmark designed to evaluate agent
capabilities on these frontier tasks. We train AgentFrontier-30B-A3B model on
our synthesized data, which achieves state-of-the-art results on demanding
benchmarks like Humanity's Last Exam, even surpassing some leading proprietary
agents. Our work demonstrates that a ZPD-guided approach to data synthesis
offers a scalable and effective path toward building more capable LLM agents.
comment: https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/
☆ Repurposing Synthetic Data for Fine-grained Search Agent Supervision
Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song, Zhuo Chen, Chenxi Wang, Xinyu Wang, Kewei Tu, Pengjun Xie, Jingren Zhou, Yong Jiang
LLM-based search agents are increasingly trained on entity-centric synthetic
data to solve complex, knowledge-intensive tasks. However, prevailing training
methods like Group Relative Policy Optimization (GRPO) discard this rich entity
information, relying instead on sparse, outcome-based rewards. This critical
limitation renders them unable to distinguish informative "near-miss"
samples-those with substantially correct reasoning but a flawed final
answer-from complete failures, thus discarding valuable learning signals. We
address this by leveraging the very entities discarded during training. Our
empirical analysis reveals a strong positive correlation between the number of
ground-truth entities identified during an agent's reasoning process and final
answer accuracy. Building on this insight, we introduce Entity-aware Group
Relative Policy Optimization (E-GRPO), a novel framework that formulates a
dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect
samples proportional to their entity match rate, enabling the model to
effectively learn from these "near-misses". Experiments on diverse
question-answering (QA) and deep research benchmarks show that E-GRPO
consistently and significantly outperforms the GRPO baseline. Furthermore, our
analysis reveals that E-GRPO not only achieves superior accuracy but also
induces more efficient reasoning policies that require fewer tool calls,
demonstrating a more effective and sample-efficient approach to aligning search
agents.
☆ STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence
Zihan Liu, Zhikang Niu, Qiuyang Xiao, Zhisheng Zheng, Ruoqi Yuan, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Jianze Liang, Xie Chen, Leilei Sun, Dahua Lin, Jiaqi Wang
Despite rapid progress in Multi-modal Large Language Models and Large
Audio-Language Models, existing audio benchmarks largely test semantics that
can be recovered from text captions, masking deficits in fine-grained
perceptual reasoning. We formalize audio 4D intelligence that is defined as
reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to
measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six
attributes under absolute and relative regimes) with a Holistic Spatio-Temporal
Reasoning setting that includes segment reordering for continuous and discrete
processes and spatial tasks spanning static localization, multi-source
relations, and dynamic trajectories. Our data curation pipeline uses two
methods to ensure high-quality samples. For foundational tasks, we use
procedurally synthesized and physics-simulated audio. For holistic data, we
follow a four-stage process that includes human annotation and final selection
based on human performance. Unlike prior benchmarks where caption-only
answering reduces accuracy slightly, STAR-Bench induces far larger drops
(-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically
hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared
with humans and a capability hierarchy: closed-source models are bottlenecked
by fine-grained perception, while open-source models lag across perception,
knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear
path forward for developing future models with a more robust understanding of
the physical world.
comment: Homepage: https://internlm.github.io/StarBench/
☆ SPICE: Self-Play In Corpus Environments Improves Reasoning
Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, Jason Weston
Self-improving systems require environmental interaction for continuous
adaptation. We introduce SPICE (Self-Play In Corpus Environments), a
reinforcement learning framework where a single model acts in two roles: a
Challenger that mines documents from a large corpus to generate diverse
reasoning tasks, and a Reasoner that solves them. Through adversarial dynamics,
the Challenger creates an automatic curriculum at the frontier of the
Reasoner's capability, while corpus grounding provides the rich,
near-inexhaustible external signal necessary for sustained improvement. Unlike
existing ungrounded self-play methods that offer more limited benefits, SPICE
achieves consistent gains across mathematical (+8.9%) and general reasoning
(+9.8%) benchmarks on multiple model families. Our analysis reveals how
document grounding is a key ingredient in SPICE to continuously generate its
own increasingly challenging goals and achieve them, enabling sustained
self-improvement.
☆ Dissecting Role Cognition in Medical LLMs via Neuronal Ablation
Large language models (LLMs) have gained significant traction in medical
decision support systems, particularly in the
context of medical question answering and role-playing simulations. A common
practice, Prompt-Based Role Playing (PBRP),
instructs models to adopt different clinical roles (e.g., medical students,
residents, attending physicians) to simulate varied
professional behaviors. However, the impact of such role prompts on model
reasoning capabilities remains unclear. This
study introduces the RP-Neuron-Activated Evaluation Framework(RPNA) to
evaluate whether role prompts induce distinct,
role-specific cognitive processes in LLMs or merely modify linguistic style.
We test this framework on three medical QA
datasets, employing neuron ablation and representation analysis techniques to
assess changes in reasoning pathways. Our
results demonstrate that role prompts do not significantly enhance the
medical reasoning abilities of LLMs. Instead, they
primarily affect surface-level linguistic features, with no evidence of
distinct reasoning pathways or cognitive differentiation
across clinical roles. Despite superficial stylistic changes, the core
decision-making mechanisms of LLMs remain uniform
across roles, indicating that current PBRP methods fail to replicate the
cognitive complexity found in real-world medical
practice. This highlights the limitations of role-playing in medical AI and
emphasizes the need for models that simulate genuine
cognitive processes rather than linguistic imitation.We have released the
related code in the following repository:https:
//github.com/IAAR-Shanghai/RolePlay_LLMDoctor
comment: 15 pages, 9 figures
☆ InteractComp: Evaluating Search Agents With Ambiguous Queries
Mingyi Deng, Lijun Huang, Yani Fan, Jiayi Zhang, Fashen Ren, Jinyi Bai, Fuzhen Yang, Dayi Miao, Zhaoyang Yu, Yifan Wu, Yanfei Zhang, Fengwei Teng, Yingjia Wan, Song Hu, Yude Li, Xin Jin, Conghao Hu, Haoyu Li, Qirui Fu, Tai Zhong, Xinyu Wang, Xiangru Tang, Nan Tang, Chenglin Wu, Yuyu Luo
Language agents have demonstrated remarkable potential in web search and
information retrieval. However, these search agents assume user queries are
complete and unambiguous, an assumption that diverges from reality where users
begin with incomplete queries requiring clarification through interaction. Yet
most agents lack interactive mechanisms during the search process, and existing
benchmarks cannot assess this capability. To address this gap, we introduce
InteractComp, a benchmark designed to evaluate whether search agents can
recognize query ambiguity and actively interact to resolve it during search.
Following the principle of easy to verify, interact to disambiguate, we
construct 210 expert-curated questions across 9 domains through a
target-distractor methodology that creates genuine ambiguity resolvable only
through interaction. Evaluation of 17 models reveals striking failure: the best
model achieves only 13.73% accuracy despite 71.50% with complete context,
exposing systematic overconfidence rather than reasoning deficits. Forced
interaction produces dramatic gains, demonstrating latent capability current
strategies fail to engage. Longitudinal analysis shows interaction capabilities
stagnated over 15 months while search performance improved seven-fold,
revealing a critical blind spot. This stagnation, coupled with the immediate
feedback inherent to search tasks, makes InteractComp a valuable resource for
both evaluating and training interaction capabilities in search agents. The
code is available at https://github.com/FoundationAgents/InteractComp.
☆ MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation
Human evaluation of machine translation is in an arms race with translation
model quality: as our models get better, our evaluation methods need to be
improved to ensure that quality gains are not lost in evaluation noise. To this
end, we experiment with a two-stage version of the current state-of-the-art
translation evaluation paradigm (MQM), which we call MQM re-annotation. In this
setup, an MQM annotator reviews and edits a set of pre-existing MQM
annotations, that may have come from themselves, another human annotator, or an
automatic MQM annotation system. We demonstrate that rater behavior in
re-annotation aligns with our goals, and that re-annotation results in
higher-quality annotations, mostly due to finding errors that were missed
during the first pass.
☆ Evolving Diagnostic Agents in a Virtual Clinical Environment
Pengcheng Qiu, Chaoyi Wu, Junwei Liu, Qiaoyu Zheng, Yusheng Liao, Haowen Wang, Yun Yue, Qianrui Fan, Shuai Zhen, Jian Wang, Jinjie Gu, Yanfeng Wang, Ya Zhang, Weidi Xie
In this paper, we present a framework for training large language models
(LLMs) as diagnostic agents with reinforcement learning, enabling them to
manage multi-turn diagnostic processes, adaptively select examinations, and
commit to final diagnoses. Unlike instruction-tuned models trained on static
case summaries, our method acquires diagnostic strategies through interactive
exploration and outcome-based feedback. Our contributions are fourfold: (i) We
present DiagGym, a diagnostics world model trained with electronic health
records that emits examination outcomes conditioned on patient history and
recommended examination, serving as a virtual clinical environment for
realistic diagnosis training and evaluation; (ii) We train DiagAgent via
end-to-end, multi-turn reinforcement learning to learn diagnostic policies that
optimize both information yield and diagnostic accuracy; (iii) We introduce
DiagBench, a diagnostic benchmark comprising 750 cases with physician-validated
examination recommendations and 99 cases annotated with 973 physician-written
rubrics on diagnosis process; (iv) we demonstrate superior performance across
diverse diagnostic settings. DiagAgent significantly outperforms 10
state-of-the-art LLMs, including DeepSeek-v3 and GPT-4o, as well as two
prompt-engineered agents. In single-turn settings, DiagAgent achieves 9.34%
higher diagnostic accuracy and 44.03% improvement in examination recommendation
hit ratio. In end-to-end settings, it delivers 15.12% increase in diagnostic
accuracy and 23.09% boost in examination recommendation F1 score. In
rubric-based evaluation, it surpasses the next-best model, Claude-sonnet-4, by
7.1% in weighted rubric score. These findings indicate that learning policies
in interactive clinical environments confers dynamic and clinically meaningful
diagnostic management abilities unattainable through passive training alone.
☆ Optimizing Retrieval for RAG via Reinforced Contrastive Learning
As retrieval-augmented generation (RAG) becomes increasingly widespread, the
role of information retrieval (IR) is shifting from retrieving information for
human users to retrieving contextual knowledge for artificial intelligence (AI)
systems, where relevance becomes difficult to define or annotate beforehand. To
address this challenge, we propose R3, a Retrieval framework optimized for RAG
through trialand-feedback Reinforced contrastive learning. Unlike prior
approaches that rely on annotated or synthetic data for supervised fine-tuning,
R3 enables the retriever to dynamically explore and optimize relevance within
the RAG environment. During training, the retrieved results interact with the
environment to produce contrastive signals that automatically guide the
retriever's self-improvement. Extensive experiments across diverse tasks
demonstrate that R3 improves RAG performance by 5.2% over the original
retriever and surpasses state-of-the-art retrievers by 4.9%, while achieving
comparable results to LLM-augmented retrieval and RAG systems built on
post-trained or instruction-tuned LLMs. It is both efficient and practical,
requiring only 4 GPUs and completing training within a single day.
☆ Quantifying the Effects of Word Length, Frequency, and Predictability on Dyslexia
We ask where, and under what conditions, dyslexic reading costs arise in a
large-scale naturalistic reading dataset. Using eye-tracking aligned to
word-level features (word length, frequency, and predictability), we model how
each feature influences dyslexic time costs. We find that all three features
robustly change reading times in both typical and dyslexic readers, and that
dyslexic readers show stronger sensitivities to each, especially
predictability. Counterfactual manipulations of these features substantially
narrow the dyslexic-control gap by about one third, with predictability showing
the strongest effect, followed by length and frequency. These patterns align
with dyslexia theories that posit heightened demands on linguistic working
memory and phonological encoding, and they motivate further work on lexical
complexity and parafoveal preview benefits to explain the remaining gap. In
short, we quantify when extra dyslexic costs arise, how large they are, and
offer actionable guidance for interventions and computational models for
dyslexics.
☆ OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning
Ziyou Hu, Zhengliang Shi, Minghang Zhu, Haitao Li, Teng Sun, Pengjie Ren, Suzan Verberne, Zhaochun Ren
Reward models (RMs) have become essential for aligning large language models
(LLMs), serving as scalable proxies for human evaluation in both training and
inference. However, existing RMs struggle on knowledge-intensive and long-form
tasks, where evaluating correctness requires grounding beyond the model's
internal knowledge. This limitation hinders them from reliably discriminating
subtle quality differences, especially when external evidence is necessary. To
address this, we introduce OpenRM, a tool-augmented long-form reward model that
systematically judges open-ended responses by invoking external tools to gather
relevant evidence. We train OpenRM with Group Relative Policy Optimization
(GRPO) on over 27K synthesized pairwise examples generated through a
controllable data synthesis framework. The training objective jointly
supervises intermediate tool usage and final outcome accuracy, incentivizing
our reward model to learn effective evidence-based judgment strategies.
Extensive experiments on three newly-collected datasets and two widely-used
benchmarks demonstrate that OpenRM substantially outperforms existing reward
modeling approaches. As a further step, we integrate OpenRM into both
inference-time response selection and training-time data selection. This yields
consistent gains in downstream LLM alignment tasks, highlighting the potential
of tool-augmented reward models for scaling reliable long-form evaluation.
☆ "Mm, Wat?" Detecting Other-initiated Repair Requests in Dialogue
Maintaining mutual understanding is a key component in human-human
conversation to avoid conversation breakdowns, in which repair, particularly
Other-Initiated Repair (OIR, when one speaker signals trouble and prompts the
other to resolve), plays a vital role. However, Conversational Agents (CAs)
still fail to recognize user repair initiation, leading to breakdowns or
disengagement. This work proposes a multimodal model to automatically detect
repair initiation in Dutch dialogues by integrating linguistic and prosodic
features grounded in Conversation Analysis. The results show that prosodic cues
complement linguistic features and significantly improve the results of
pretrained text and audio embeddings, offering insights into how different
features interact. Future directions include incorporating visual cues,
exploring multilingual and cross-context corpora to assess the robustness and
generalizability.
comment: 9 pages
☆ Relative Scaling Laws for LLMs
Scaling laws describe how language models improve with additional data,
parameters, and compute. While widely used, they are typically measured on
aggregate test sets. Aggregate evaluations yield clean trends but average over
heterogeneous subpopulations, obscuring performance disparities. We introduce
relative scaling laws, which track how performance gaps between test
distributions evolve with scale rather than focusing solely on absolute error.
Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP)
budgets from $10^{18}$--$10^{20}$ FLOPs on standard pretraining datasets, we
find diverse trajectories: academic domains on MMLU converge toward parity;
regional English dialects shift depending on population size; and clusters of
AI risk behaviours split, with capability- and influence-related risks
increasing during pretraining while adversarial risks do not. These results
show that although scaling improves overall performance, it is not a universal
equalizer. To support further study, we release all model checkpoints from this
work to enable practitioners to measure relative alongside traditional scaling
laws, in order to better prioritize robustness challenges in light of the
bitter lesson.
☆ Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation
With the release of new large language models (LLMs) like Llama and Mistral,
zero-shot cross-lingual transfer has become increasingly feasible due to their
multilingual pretraining and strong generalization capabilities. However,
adapting these decoder-only LLMs to new tasks across languages remains
challenging. While parameter-efficient fine-tuning (PeFT) techniques like
Low-Rank Adaptation (LoRA) are widely used, prefix-based techniques such as
soft prompt tuning, prefix tuning, and Llama Adapter are less explored,
especially for zero-shot transfer in decoder-only models. We present a
comprehensive study of three prefix-based methods for zero-shot cross-lingual
transfer from English to 35+ high- and low-resource languages. Our analysis
further explores transfer across linguistic families and scripts, as well as
the impact of scaling model sizes from 1B to 24B. With Llama 3.1 8B, prefix
methods outperform LoRA-baselines by up to 6% on the Belebele benchmark.
Similar improvements were observed with Mistral v0.3 7B as well. Despite using
only 1.23M learning parameters with prefix tuning, we achieve consistent
improvements across diverse benchmarks. These findings highlight the potential
of prefix-based techniques as an effective and scalable alternative to LoRA,
particularly in low-resource multilingual settings.
comment: 12 Pages
☆ Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs NeurIPS 2025
The quadratic cost of attention hinders the scalability of long-context LLMs,
especially in resource-constrained settings. Existing static sparse methods
such as sliding windows or global tokens utilizes the sparsity of attention to
reduce the cost of attention, but poorly adapts to the content-dependent
variations in attention due to their staticity. While previous work has
proposed several dynamic approaches to improve flexibility, they still depend
on predefined templates or heuristic mechanisms. Such strategies reduce
generality and prune tokens that remain contextually important, limiting their
accuracy across diverse tasks. To tackle these bottlenecks of existing methods
for long-context modeling, we introduce Dynamic Hierarchical Sparse Attention
(DHSA), a data-driven framework that dynamically predicts attention sparsity
online without retraining. Our proposed DHSA adaptively segments sequences into
variable-length chunks, then computes chunk representations by aggregating the
token embeddings within each chunk. To avoid the bias introduced by varying
chunk lengths, we apply length-normalized aggregation that scales the averaged
embeddings by the square root of the chunk size. Finally, DHSA upsamples the
chunk-level similarity scores to token level similarities to calculate
importance scores that determine which token-level interactions should be
preserved. Our experiments on Gemma2 with Needle-in-a-Haystack Test and
LongBench show that DHSA matches dense attention in accuracy, while reducing
prefill latency by 20-60% and peak memory usage by 35%. Compared to other
representative baselines such as block sparse attention, DHSA achieves
consistently higher accuracy (6-18% relative gains) with comparable or lower
cost, offering an efficient and adaptable solution for long-context on-device
LLMs.
comment: Accepted to NeurIPS 2025 Workshop on Efficient Reasoning
☆ Diffusion LLM with Native Variable Generation Lengths: Let [EOS] Lead the Way
Diffusion-based large language models (dLLMs) have exhibited substantial
potential for parallel text generation, which may enable more efficient
generation compared to autoregressive models. However, current dLLMs suffer
from fixed generation lengths, which indicates the generation lengths of dLLMs
have to be determined before decoding as a hyper-parameter, leading to issues
in efficiency and flexibility. To solve these problems, in this work, we
propose to train a diffusion LLM with native variable generation lengths,
abbreviated as dLLM-Var. Concretely, we aim to train a model to accurately
predict the [EOS] token in the generated text, which makes a dLLM be able to
natively infer in a block diffusion manner, while still maintaining the ability
of global bi-directional (full) attention and high parallelism. Experiments on
standard benchmarks demonstrate that our method achieves a 30.1x speedup over
traditional dLLM inference paradigms and a 2.4x speedup relative to
autoregressive models such as Qwen and Llama. Our method achieves higher
accuracy and faster inference, elevating dLLMs beyond mere academic novelty and
supporting their practical use in real-world applications. Codes and models
have been released.
☆ ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization
Guoxin Chen, Jing Wu, Xinjie Chen, Wayne Xin Zhao, Ruihua Song, Chengxi Li, Kai Fan, Dayiheng Liu, Minpeng Liao
Autoformalization, which translates natural language mathematics into
machine-verifiable formal statements, is critical for using formal mathematical
reasoning to solve math problems stated in natural language. While Large
Language Models can generate syntactically correct formal statements, they
often fail to preserve the original problem's semantic intent. This limitation
arises from the LLM approaches' treating autoformalization as a simplistic
translation task which lacks mechanisms for self-reflection and iterative
refinement that human experts naturally employ. To address these issues, we
propose ReForm, a Reflective Autoformalization method that tightly integrates
semantic consistency evaluation into the autoformalization process. This
enables the model to iteratively generate formal statements, assess its
semantic fidelity, and self-correct identified errors through progressive
refinement. To effectively train this reflective model, we introduce
Prospective Bounded Sequence Optimization (PBSO), which employs different
rewards at different sequence positions to ensure that the model develops both
accurate autoformalization and correct semantic validations, preventing
superficial critiques that would undermine the purpose of reflection. Extensive
experiments across four autoformalization benchmarks demonstrate that ReForm
achieves an average improvement of 17.2 percentage points over the strongest
baselines. To further ensure evaluation reliability, we introduce
ConsistencyCheck, a benchmark of 859 expert-annotated items that not only
validates LLMs as judges but also reveals that autoformalization is inherently
difficult: even human experts produce semantic errors in up to 38.5% of cases.
comment: Ongoing Work
☆ ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?
Christine Ye, Sihan Yuan, Suchetha Cooray, Steven Dillmann, Ian L. V. Roque, Dalya Baron, Philipp Frank, Sergio Martin-Alvarez, Nolan Koblischke, Frank J Qu, Diyi Yang, Risa Wechsler, Ioana Ciuca
Frontier AI agents show increasing promise as scientific research assistants,
and may eventually be useful for extended, open-ended research workflows.
However, in order to use agents for novel research, we must first assess the
underlying faithfulness and correctness of their work. To evaluate agents as
research assistants, we introduce ReplicationBench, an evaluation framework
that tests whether agents can replicate entire research papers drawn from the
astrophysics literature. Astrophysics, where research relies heavily on
archival data and computational study while requiring little real-world
experimentation, is a particularly useful testbed for AI agents in scientific
research. We split each paper into tasks which require agents to replicate the
paper's core contributions, including the experimental setup, derivations, data
analysis, and codebase. Each task is co-developed with the original paper
authors and targets a key scientific result, enabling objective evaluation of
both faithfulness (adherence to original methods) and correctness (technical
accuracy of results). ReplicationBench is extremely challenging for current
frontier language models: even the best-performing language models score under
20%. We analyze ReplicationBench trajectories in collaboration with domain
experts and find a rich, diverse set of failure modes for agents in scientific
research. ReplicationBench establishes the first benchmark of paper-scale,
expert-validated astrophysics research tasks, reveals insights about agent
performance generalizable to other domains of data-driven science, and provides
a scalable framework for measuring AI agents' reliability in scientific
research.
☆ BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation ICASSP 2026
Automatic Speech Recognition (ASR) systems, despite large multilingual
training, struggle in out-of-domain and low-resource scenarios where labeled
data is scarce. We propose BEARD (BEST-RQ Encoder Adaptation with Re-training
and Distillation), a novel framework designed to adapt Whisper's encoder using
unlabeled data. Unlike traditional self-supervised learning methods, BEARD
uniquely combines a BEST-RQ objective with knowledge distillation from a frozen
teacher encoder, ensuring the encoder's complementarity with the pre-trained
decoder. Our experiments focus on the ATCO2 corpus from the challenging Air
Traffic Control (ATC) communications domain, characterized by non-native
speech, noise, and specialized phraseology. Using about 5,000 hours of
untranscribed speech for BEARD and 2 hours of transcribed speech for
fine-tuning, the proposed approach significantly outperforms previous baseline
and fine-tuned model, achieving a relative improvement of 12% compared to the
fine-tuned model. To the best of our knowledge, this is the first work to use a
self-supervised learning objective for domain adaptation of Whisper.
comment: Submitted to ICASSP 2026
☆ Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts
The history of the Korean language is characterized by a discrepancy between
its spoken and written forms and a pivotal shift from Chinese characters to the
Hangul alphabet. However, this linguistic evolution has remained largely
unexplored in NLP due to a lack of accessible historical corpora. To address
this gap, we introduce the Open Korean Historical Corpus, a large-scale, openly
licensed dataset spanning 1,300 years and 6 languages, as well as
under-represented writing systems like Korean-style Sinitic (Idu) and
Hanja-Hangul mixed script. This corpus contains 18 million documents and 5
billion tokens from 19 sources, ranging from the 7th century to 2025. We
leverage this resource to quantitatively analyze major linguistic shifts: (1)
Idu usage peaked in the 1860s before declining sharply; (2) the transition from
Hanja to Hangul was a rapid transformation starting around 1890; and (3) North
Korea's lexical divergence causes modern tokenizers to produce up to 51 times
higher out-of-vocabulary rates. This work provides a foundational resource for
quantitative diachronic analysis by capturing the history of the Korean
language. Moreover, it can serve as a pre-training corpus for large language
models, potentially improving their understanding of Sino-Korean vocabulary in
modern Hangul as well as archaic writing systems.
comment: Dataset and code available at https://github.com/seyoungsong/OKHC
☆ Dark & Stormy: Modeling Humor in the Worst Sentences Ever Written
Textual humor is enormously diverse and computational studies need to account
for this range, including intentionally bad humor. In this paper, we curate and
analyze a novel corpus of sentences from the Bulwer-Lytton Fiction Contest to
better understand "bad" humor in English. Standard humor detection models
perform poorly on our corpus, and an analysis of literary devices finds that
these sentences combine features common in existing humor datasets (e.g., puns,
irony) with metaphor, metafiction and simile. LLMs prompted to synthesize
contest-style sentences imitate the form but exaggerate the effect by
over-using certain literary devices, and including far more novel
adjective-noun bigrams than human writers. Data, code and analysis are
available at https://github.com/venkatasg/bulwer-lytton
☆ Levée d'ambiguïtés par grammaires locales
Many words are ambiguous in terms of their part of speech (POS). However,
when a word appears in a text, this ambiguity is generally much reduced.
Disambiguating POS involves using context to reduce the number of POS
associated with words, and is one of the main challenges of lexical tagging.
The problem of labeling words by POS frequently arises in natural language
processing, for example for spelling correction, grammar or style checking,
expression recognition, text-to-speech conversion, text corpus analysis, etc.
Lexical tagging systems are thus useful as an initial component of many natural
language processing systems. A number of recent lexical tagging systems produce
multiple solutions when the text is lexically ambiguous or the uniquely correct
solution cannot be found. These contributions aim to guarantee a zero silence
rate: the correct tag(s) for a word must never be discarded. This objective is
unrealistic for systems that tag each word uniquely. This article concerns a
lexical disambiguation method adapted to the objective of a zero silence rate
and implemented in Silberztein's INTEX system (1993). We present here a formal
description of this method. We show that to verify a local disambiguation
grammar in this framework, it is not sufficient to consider the transducer
paths separately: one needs to verify their interactions. Similarly, if a
combination of multiple transducers is used, the result cannot be predicted by
considering them in isolation. Furthermore, when examining the initial labeling
of a text as produced by INTEX, ideas for disambiguation rules come
spontaneously, but grammatical intuitions may turn out to be inaccurate, often
due to an unforeseen construction or ambiguity. If a zero silence rate is
targeted, local grammars must be carefully tested. This is where a detailed
specification of what a grammar will do once applied to texts would be
necessary.
comment: in French language
☆ Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs
Huanyu Zhang, Wenshan Wu, Chengzu Li, Ning Shang, Yan Xia, Yangyu Huang, Yifan Zhang, Li Dong, Zhang Zhang, Liang Wang, Tieniu Tan, Furu Wei
While Multimodal Large Language Models (MLLMs) excel at visual understanding,
they often struggle in complex scenarios that require visual planning and
imagination. Inspired by how humans use sketching as a form of visual thinking
to develop and communicate ideas, we introduce Latent Sketchpad, a framework
that equips MLLMs with an internal visual scratchpad. The internal visual
representations of MLLMs have traditionally been confined to perceptual
understanding. We repurpose them to support generative visual thought without
compromising reasoning ability. Building on frontier MLLMs, our approach
integrates visual generation directly into their native autoregressive
reasoning process. It allows the model to interleave textual reasoning with the
generation of visual latents. These latents guide the internal thought process
and can be translated into sketch images for interpretability. To realize this,
we introduce two components: a Context-Aware Vision Head autoregressively
produces visual representations, and a pretrained Sketch Decoder renders these
into human-interpretable images. We evaluate the framework on our new dataset
MazePlanning. Experiments across various MLLMs show that Latent Sketchpad
delivers comparable or even superior reasoning performance to their backbone.
It further generalizes across distinct frontier MLLMs, including Gemma3 and
Qwen2.5-VL. By extending model's textual reasoning to visual thinking, our
framework opens new opportunities for richer human-computer interaction and
broader applications. More details and resources are available on our project
page: https://latent-sketchpad.github.io/.
☆ CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration?
Qing Zong, Jiayu Liu, Tianshi Zheng, Chunyang Li, Baixuan Xu, Haochen Shi, Weiqi Wang, Zhaowei Wang, Chunkit Chan, Yangqiu Song
Accurate confidence calibration in Large Language Models (LLMs) is critical
for safe use in high-stakes domains, where clear verbalized confidence enhances
user trust. Traditional methods that mimic reference confidence expressions
often fail to capture the reasoning needed for accurate confidence assessment.
We propose natural language critiques as a solution, ideally suited for
confidence calibration, as precise gold confidence labels are hard to obtain
and often require multiple generations. This paper studies how natural language
critiques can enhance verbalized confidence, addressing: (1) What to critique:
uncertainty (question-focused) or confidence (answer-specific)? Analysis shows
confidence suits multiple-choice tasks, while uncertainty excels in open-ended
scenarios. (2) How to critique: self-critique or critique calibration training?
We propose Self-Critique, enabling LLMs to critique and optimize their
confidence beyond mere accuracy, and CritiCal, a novel Critique Calibration
training method that leverages natural language critiques to improve confidence
calibration, moving beyond direct numerical optimization. Experiments show that
CritiCal significantly outperforms Self-Critique and other competitive
baselines, even surpassing its teacher model, GPT-4o, in complex reasoning
tasks. CritiCal also shows robust generalization in out-of-distribution
settings, advancing LLM's reliability.
☆ A word association network methodology for evaluating implicit biases in LLMs compared to humans
As Large language models (LLMs) become increasingly integrated into our
lives, their inherent social biases remain a pressing concern. Detecting and
evaluating these biases can be challenging because they are often implicit
rather than explicit in nature, so developing evaluation methods that assess
the implicit knowledge representations of LLMs is essential. We present a novel
word association network methodology for evaluating implicit biases in LLMs
based on simulating semantic priming within LLM-generated word association
networks. Our prompt-based approach taps into the implicit relational
structures encoded in LLMs, providing both quantitative and qualitative
assessments of bias. Unlike most prompt-based evaluation methods, our method
enables direct comparisons between various LLMs and humans, providing a
valuable point of reference and offering new insights into the alignment of
LLMs with human cognition. To demonstrate the utility of our methodology, we
apply it to both humans and several widely used LLMs to investigate social
biases related to gender, religion, ethnicity, sexual orientation, and
political party. Our results reveal both convergences and divergences between
LLM and human biases, providing new perspectives on the potential risks of
using LLMs. Our methodology contributes to a systematic, scalable, and
generalizable framework for evaluating and comparing biases across multiple
LLMs and humans, advancing the goal of transparent and socially responsible
language technologies.
comment: 24 pages, 13 figures, 3 tables
☆ Talk2Ref: A Dataset for Reference Prediction from Scientific Talks
Scientific talks are a growing medium for disseminating research, and
automatically identifying relevant literature that grounds or enriches a talk
would be highly valuable for researchers and students alike. We introduce
Reference Prediction from Talks (RPT), a new task that maps long, and
unstructured scientific presentations to relevant papers. To support research
on RPT, we present Talk2Ref, the first large-scale dataset of its kind,
containing 6,279 talks and 43,429 cited papers (26 per talk on average), where
relevance is approximated by the papers cited in the talk's corresponding
source publication. We establish strong baselines by evaluating
state-of-the-art text embedding models in zero-shot retrieval scenarios, and
propose a dual-encoder architecture trained on Talk2Ref. We further explore
strategies for handling long transcripts, as well as training for domain
adaptation. Our results show that fine-tuning on Talk2Ref significantly
improves citation prediction performance, demonstrating both the challenges of
the task and the effectiveness of our dataset for learning semantic
representations from spoken scientific content. The dataset and trained models
are released under an open license to foster future research on integrating
spoken scientific communication into citation recommendation systems.
☆ Mitigating Hallucination in Large Language Models (LLMs): An Application-Oriented Survey on RAG, Reasoning, and Agentic Systems
Hallucination remains one of the key obstacles to the reliable deployment of
large language models (LLMs), particularly in real-world applications. Among
various mitigation strategies, Retrieval-Augmented Generation (RAG) and
reasoning enhancement have emerged as two of the most effective and widely
adopted approaches, marking a shift from merely suppressing hallucinations to
balancing creativity and reliability. However, their synergistic potential and
underlying mechanisms for hallucination mitigation have not yet been
systematically examined. This survey adopts an application-oriented perspective
of capability enhancement to analyze how RAG, reasoning enhancement, and their
integration in Agentic Systems mitigate hallucinations. We propose a taxonomy
distinguishing knowledge-based and logic-based hallucinations, systematically
examine how RAG and reasoning address each, and present a unified framework
supported by real-world applications, evaluations, and benchmarks.
comment: 25 pages, 7 figures, 3 tables
☆ Iterative Critique-Refine Framework for Enhancing LLM Personalization
Durga Prasad Maram, Dhruvin Gandhi, Zonghai Yao, Gayathri Akkinapalli, Franck Dernoncourt, Yu Wang, Ryan A. Rossi, Nesreen K. Ahmed
Personalized text generation requires models not only to produce coherent
text but also to align with a target user's style, tone, and topical focus.
Existing retrieval-augmented approaches such as LaMP and PGraphRAG enrich
profiles with user and neighbor histories, but they stop at generation and
often yield outputs that drift in tone, topic, or style. We present PerFine, a
unified, training-free critique-refine framework that enhances personalization
through iterative, profile-grounded feedback. In each iteration, an LLM
generator produces a draft conditioned on the retrieved profile, and a critic
LLM - also conditioned on the same profile - provides structured feedback on
tone, vocabulary, sentence structure, and topicality. The generator then
revises, while a novel knockout strategy retains the stronger draft across
iterations. We further study additional inference-time strategies such as
Best-of-N and Topic Extraction to balance quality and efficiency. Across Yelp,
Goodreads, and Amazon datasets, PerFine consistently improves personalization
over PGraphRAG, with GEval gains of +7-13%, steady improvements over 3-5
refinement iterations, and scalability with increasing critic size. These
results highlight that post-hoc, profile-aware feedback offers a powerful
paradigm for personalized LLM generation that is both training-free and
model-agnostic.
☆ Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices LREC 2026
While new benchmarks for large language models (LLMs) are being developed
continuously to catch up with the growing capabilities of new models and AI in
general, using and evaluating LLMs in non-English languages remains a
little-charted landscape. We give a concise overview of recent developments in
LLM benchmarking, and then propose a new taxonomy for the categorization of
benchmarks that is tailored to multilingual or non-English use scenarios. We
further propose a set of best practices and quality standards that could lead
to a more coordinated development of benchmarks for European languages. Among
other recommendations, we advocate for a higher language and culture
sensitivity of evaluation methods.
comment: 12 pages, 1 figure. Submitted to the LREC 2026 conference
☆ SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space
Viktoriia Zinkovich, Anton Antonov, Andrei Spiridonov, Denis Shepelev, Andrey Moskalenko, Daria Pugacheva, Elena Tutubalina, Andrey Kuznetsov, Vlad Shakhuro
Multimodal large language models (MLLMs) have shown impressive capabilities
in vision-language tasks such as reasoning segmentation, where models generate
segmentation masks based on textual queries. While prior work has primarily
focused on perturbing image inputs, semantically equivalent textual
paraphrases-crucial in real-world applications where users express the same
intent in varied ways-remain underexplored. To address this gap, we introduce a
novel adversarial paraphrasing task: generating grammatically correct
paraphrases that preserve the original query meaning while degrading
segmentation performance. To evaluate the quality of adversarial paraphrases,
we develop a comprehensive automatic evaluation protocol validated with human
studies. Furthermore, we introduce SPARTA-a black-box, sentence-level
optimization method that operates in the low-dimensional semantic latent space
of a text autoencoder, guided by reinforcement learning. SPARTA achieves
significantly higher success rates, outperforming prior methods by up to 2x on
both the ReasonSeg and LLMSeg-40k datasets. We use SPARTA and competitive
baselines to assess the robustness of advanced reasoning segmentation models.
We reveal that they remain vulnerable to adversarial paraphrasing-even under
strict semantic and grammatical constraints. All code and data will be released
publicly upon acceptance.
☆ Law in Silico: Simulating Legal Society with LLM-Based Agents
Since real-world legal experiments are often costly or infeasible, simulating
legal societies with Artificial Intelligence (AI) systems provides an effective
alternative for verifying and developing legal theory, as well as supporting
legal administration. Large Language Models (LLMs), with their world knowledge
and role-playing capabilities, are strong candidates to serve as the foundation
for legal society simulation. However, the application of LLMs to simulate
legal systems remains underexplored. In this work, we introduce Law in Silico,
an LLM-based agent framework for simulating legal scenarios with individual
decision-making and institutional mechanisms of legislation, adjudication, and
enforcement. Our experiments, which compare simulated crime rates with
real-world data, demonstrate that LLM-based agents can largely reproduce
macro-level crime trends and provide insights that align with real-world
observations. At the same time, micro-level simulations reveal that a
well-functioning, transparent, and adaptive legal system offers better
protection of the rights of vulnerable individuals.
☆ Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content NeurIPS 2025
Abdullah Mushtaq, Rafay Naeem, Ezieddin Elmahjub, Ibrahim Ghaznavi, Shawqi Al-Maliki, Mohamed Abdallah, Ala Al-Fuqaha, Junaid Qadir
Large language models are increasingly used for Islamic guidance, but risk
misquoting texts, misapplying jurisprudence, or producing culturally
inconsistent responses. We pilot an evaluation of GPT-4o, Ansari AI, and Fanar
on prompts from authentic Islamic blogs. Our dual-agent framework uses a
quantitative agent for citation verification and six-dimensional scoring (e.g.,
Structure, Islamic Consistency, Citations) and a qualitative agent for
five-dimensional side-by-side comparison (e.g., Tone, Depth, Originality).
GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI
followed (3.68, 3.32), and Fanar lagged (2.76, 1.82). Despite relatively strong
performance, models still fall short in reliably producing accurate Islamic
content and citations -- a paramount requirement in faith-sensitive writing.
GPT-4o had the highest mean quantitative score (3.90/5), while Ansari AI led
qualitative pairwise wins (116/200). Fanar, though trailing, introduces
innovations for Islamic and Arabic contexts. This study underscores the need
for community-driven benchmarks centering Muslim perspectives, offering an
early step toward more reliable AI in Islamic knowledge and other high-stakes
domains such as medicine, law, and journalism.
comment: Accepted at 39th Conference on Neural Information Processing Systems
(NeurIPS 2025) Workshop: 5th Muslims in Machine Learning (MusIML) Workshop
☆ LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data
The effectiveness of instruction-tuned Large Language Models (LLMs) is often
limited in low-resource linguistic settings due to a lack of high-quality
training data. We introduce LuxIT, a novel, monolingual instruction tuning
dataset for Luxembourgish developed to mitigate this challenge. We synthesize
the dataset from a corpus of native Luxembourgish texts, utilizing
DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish. Following
generation, we apply a quality assurance process, employing an LLM-as-a-judge
approach. To investigate the practical utility of the dataset, we fine-tune
several smaller-scale LLMs on LuxIT. Subsequent benchmarking against their base
models on Luxembourgish language proficiency examinations, however, yields
mixed results, with performance varying significantly across different models.
LuxIT represents a critical contribution to Luxembourgish natural language
processing and offers a replicable monolingual methodology, though our findings
highlight the need for further research to optimize its application.
☆ SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models
Evaluating the reasoning ability of language models (LMs) is complicated by
their extensive parametric world knowledge, where benchmark performance often
reflects factual recall rather than genuine reasoning. Existing datasets and
approaches (e.g., temporal filtering, paraphrasing, adversarial substitution)
cannot cleanly separate the two. We present SynthWorlds, a framework that
disentangles task reasoning complexity from factual knowledge. In SynthWorlds,
we construct parallel corpora representing two worlds with identical
interconnected structure: a real-mapped world, where models may exploit
parametric knowledge, and a synthetic-mapped world, where such knowledge is
meaningless. On top of these corpora, we design two mirrored tasks as case
studies: multi-hop question answering and page navigation, which maintain equal
reasoning difficulty across worlds. Experiments in parametric-only (e.g.,
closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings
reveal a persistent knowledge advantage gap, defined as the performance boost
models gain from memorized parametric world knowledge. Knowledge acquisition
and integration mechanisms reduce but do not eliminate this gap, highlighting
opportunities for system improvements. Fully automatic and scalable,
SynthWorlds provides a controlled environment for evaluating LMs in ways that
were previously challenging, enabling precise and testable comparisons of
reasoning and memorization.
☆ Comprehensive and Efficient Distillation for Lightweight Sentiment Analysis Models EMNLP 2025
Recent efforts leverage knowledge distillation techniques to develop
lightweight and practical sentiment analysis models. These methods are grounded
in human-written instructions and large-scale user texts. Despite the promising
results, two key challenges remain: (1) manually written instructions are
limited in diversity and quantity, making them insufficient to ensure
comprehensive coverage of distilled knowledge; (2) large-scale user texts incur
high computational cost, hindering the practicality of these methods. To this
end, we introduce COMPEFFDIST, a comprehensive and efficient distillation
framework for sentiment analysis. Our framework consists of two key modules:
attribute-based automatic instruction construction and difficulty-based data
filtering, which correspondingly tackle the aforementioned challenges. Applying
our method across multiple model series (Llama-3, Qwen-3, and Gemma-3), we
enable 3B student models to match the performance of 20x larger teacher models
on most tasks. In addition, our approach greatly outperforms baseline methods
in data efficiency, attaining the same performance level with only 10% of the
data.
comment: Accepted by EMNLP 2025. 22 pages, 9 figures. The first two authors
contribute equally
☆ OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
Qiushi Sun, Mukai Li, Zhoumianze Liu, Zhihui Xie, Fangzhi Xu, Zhangyue Yin, Kanzhi Cheng, Zehao Li, Zichen Ding, Qi Liu, Zhiyong Wu, Zhuosheng Zhang, Ben Kao, Lingpeng Kong
Computer-using agents powered by Vision-Language Models (VLMs) have
demonstrated human-like capabilities in operating digital environments like
mobile platforms. While these agents hold great promise for advancing digital
automation, their potential for unsafe operations, such as system compromise
and privacy leakage, is raising significant concerns. Detecting these safety
concerns across the vast and complex operational space of mobile environments
presents a formidable challenge that remains critically underexplored. To
establish a foundation for mobile agent safety research, we introduce
MobileRisk-Live, a dynamic sandbox environment accompanied by a safety
detection benchmark comprising realistic trajectories with fine-grained
annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety
detection framework that synergistically combines a Formal Verifier for
detecting explicit system-level violations with a VLM-based Contextual Judge
for assessing contextual risks and agent actions. Experiments show that
OS-Sentinel achieves 10%-30% improvements over existing approaches across
multiple metrics. Further analysis provides critical insights that foster the
development of safer and more reliable autonomous mobile agents.
comment: work in progress
☆ Text Simplification with Sentence Embeddings
Sentence embeddings can be decoded to give approximations of the original
texts used to create them. We explore this effect in the context of text
simplification, demonstrating that reconstructed text embeddings preserve
complexity levels. We experiment with a small feed forward neural network to
effectively learn a transformation between sentence embeddings representing
high-complexity and low-complexity texts. We provide comparison to a Seq2Seq
and LLM-based approach, showing encouraging results in our much smaller
learning setting. Finally, we demonstrate the applicability of our
transformation to an unseen simplification dataset (MedEASI), as well as
datasets from languages outside the training data (ES,DE). We conclude that
learning transformations in sentence embedding space is a promising direction
for future research and has potential to unlock the ability to develop small,
but powerful models for text simplification and other natural language
generation tasks.
☆ Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation
Lingyue Fu, Bolun Zhang, Hao Guan, Yaoming Zhu, Lin Qiu, Weiwen Liu, Xuezhi Cao, Xunliang Cai, Weinan Zhang, Yong Yu
Recent advances in code agents have enabled automated software development at
the project level, supported by large language models (LLMs) and widely adopted
tools. However, existing benchmarks for code agent evaluation face two major
limitations: high annotation cost and expertise requirements, and rigid
evaluation metrics that rely primarily on unit tests. To address these
challenges, we propose an agent-driven benchmark construction pipeline that
leverages human supervision to efficiently generate diverse and challenging
project-level tasks. Based on this approach, we introduce PRDBench, a novel
benchmark comprising 50 real-world Python projects across 20 domains, each with
structured Product Requirement Document (PRD) requirements, comprehensive
evaluation criteria, and reference implementations. PRDBench features rich data
sources, high task complexity, and flexible metrics. We further employ an
Agent-as-a-Judge paradigm to score agent outputs, enabling the evaluation of
various test types beyond unit tests. Extensive experiments on PRDBench
demonstrate its effectiveness in assessing the capabilities of both code agents
and evaluation agents, providing a scalable and robust framework for annotation
and evaluation.
☆ LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability EMNLP
Zikai Xiao, Fei Huang, Jianhong Tu, Jianhui Wei, Wen Ma, Yuxuan Zhou, Jian Wu, Bowen Yu, Zuozhu Liu, Junyang Lin
Generating long, informative, and factual outputs remains a major challenge
for Large Language Models (LLMs). Existing benchmarks for long-form generation
typically assess real-world queries with hard-to-verify metrics or use
synthetic setups that ease evaluation but overlook real-world intricacies. In
this paper, we introduce \textbf{LongWeave}, which balances real-world and
verifiable assessment with Constraint-Verifier Evaluation (CoV-Eval). CoV-Eval
constructs tasks by first defining verifiable targets within real-world
scenarios, then systematically generating corresponding queries, textual
materials, and constraints based on these targets. This ensures that tasks are
both realistic and objectively assessable, enabling rigorous assessment of
model capabilities in meeting complex real-world constraints. LongWeave
supports customizable input/output lengths (up to 64K/8K tokens) across seven
distinct tasks. Evaluation on 23 LLMs shows that even state-of-the-art models
encounter significant challenges in long-form generation as real-world
complexity and output length increase.
comment: EMNLP Findings 2025
☆ Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants
Large Language Models (LLMs) are increasingly used to answer everyday
questions, yet their performance on culturally grounded and dialectal content
remains uneven across languages. We propose a comprehensive method that (i)
translates Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into
English and several Arabic dialects, (ii) converts them into open-ended
questions (OEQs), (iii) benchmarks a range of zero-shot and fine-tuned LLMs
under both MCQ and OEQ settings, and (iv) generates chain-of-thought (CoT)
rationales to fine-tune models for step-by-step reasoning. Using this method,
we extend an existing dataset in which QAs are parallelly aligned across
multiple language varieties, making it, to our knowledge, the first of its
kind. We conduct extensive experiments with both open and closed models. Our
findings show that (i) models underperform on Arabic dialects, revealing
persistent gaps in culturally grounded and dialect-specific knowledge; (ii)
Arabic-centric models perform well on MCQs but struggle with OEQs; and (iii)
CoT improves judged correctness while yielding mixed n-gram-based metrics. The
developed dataset will be publicly released to support further research on
culturally and linguistically inclusive evaluation.
comment: Cultural Knowledge, Everyday Knowledge, Open-Ended Question,
Chain-of-Thought, Large Language Models, Native, Multilingual, Language
Diversity
☆ Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning
Zhiheng Xi, Jixuan Huang, Xin Guo, Boyang Hong, Dingwen Yang, Xiaoran Fan, Shuo Li, Zehui Chen, Junjie Ye, Siyu Yuan, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang
Training critiquing language models to assess and provide feedback on model
outputs is a promising way to improve LLMs for complex reasoning tasks.
However, existing approaches typically rely on stronger supervisors for
annotating critique data. To address this, we propose Critique-RL, an online RL
approach for developing critiquing language models without stronger
supervision. Our approach operates on a two-player paradigm: the actor
generates a response, the critic provides feedback, and the actor refines the
response accordingly. We first reveal that relying solely on indirect reward
signals from the actor's outputs for RL optimization often leads to
unsatisfactory critics: while their helpfulness (i.e., providing constructive
feedback) improves, the discriminability (i.e., determining whether a response
is high-quality or not) remains poor, resulting in marginal performance gains.
To overcome this, Critique-RL adopts a two-stage optimization strategy. In
stage I, it reinforces the discriminability of the critic with direct
rule-based reward signals; in stage II, it introduces indirect rewards based on
actor refinement to improve the critic's helpfulness, while maintaining its
discriminability via appropriate regularization. Extensive experiments across
various tasks and models show that Critique-RL delivers substantial performance
improvements. For example, it achieves a 9.02% gain on in-domain tasks and a
5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.
comment: Preprint, 25 pages, 9 figures. Code:
https://github.com/WooooDyy/Critique-RL
☆ Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards
Reinforcement Learning with Verifiable Rewards (RLVR), particularly with
algorithms like Group Relative Policy Optimization (GRPO), has proven highly
effective in enhancing the reasoning capabilities of large language models.
However, a critical bottleneck in current pipelines lies in the limited
diversity of sampled trajectories during group rollouts. Homogeneous
trajectories and their associated rewards would diminish the return signals for
policy updates, thereby hindering effective policy learning. This lack of
diversity stems primarily from token-level stochastic sampling, where local
variations are likely to collapse into near-identical reasoning paths. To
address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a
novel rollout strategy designed to explicitly promotes trajectory-level
diversity by enforcing branching into different candidate tokens likely to
yield distinct continuations. Specifically, LATR iteratively operates in three
stages: (1) branching at high-uncertainty generation steps, (2) performing
lookahead simulation for each new branch, and (3) pruning branches that
exhibits prolonged similarity during simulation. Compared with stochastic
Sampling, LATR accelerates policy learning by 131% on average and improves
final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy
Optimization (DAPO) algorithms across different reasoning tasks. Our code and
data are publicly available at https://github.com/starreeze/latr.
☆ MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference
In recent years, many generalization benchmarks have shown language models'
lack of robustness in natural language inference (NLI). However, manually
creating new benchmarks is costly, while automatically generating high-quality
ones, even by modifying existing benchmarks, is extremely difficult. In this
paper, we propose a methodology for automatically generating high-quality
variants of original NLI problems by replacing open-class words, while
crucially preserving their underlying reasoning. We dub our generalization test
as MERGE (Minimal Expression-Replacements GEneralization), which evaluates the
correctness of models' predictions across reasoning-preserving variants of the
original problem. Our results show that NLI models' perform 4-20% worse on
variants, suggesting low generalizability even on such minimally altered
problems. We also analyse how word class of the replacements, word probability,
and plausibility influence NLI models' performance.
comment: Pre-print
☆ ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model
Juntian Zhang, Song Jin, Chuanqi Cheng, Yuhan Liu, Yankai Lin, Xun Zhang, Yufei Zhang, Fei Jiang, Guojun Yin, Wei Lin, Rui Yan
The limited capacity for fine-grained visual perception presents a critical
bottleneck for Vision-Language Models (VLMs) in real-world applications.
Addressing this is challenging due to the scarcity of high-quality data and the
limitations of existing methods: supervised fine-tuning (SFT) often compromises
general capabilities, while reinforcement fine-tuning (RFT) prioritizes textual
reasoning over visual perception. To bridge this gap, we propose a novel
two-stage task that structures visual perception learning as a coarse-to-fine
progressive process. Based on this task formulation, we develop ViPER, a
self-bootstrapping framework specifically designed to enable iterative
evolution through self-critiquing and self-prediction. By synergistically
integrating image-level and instance-level reconstruction with a two-stage
reinforcement learning strategy, ViPER establishes a closed-loop training
paradigm, where internally synthesized data directly fuel the enhancement of
perceptual ability. Applied to the Qwen2.5-VL family, ViPER produces the
Qwen-Viper series. With an average gain of 1.7% on seven comprehensive
benchmarks spanning various tasks and up to 6.0% on fine-grained perception,
Qwen-Viper consistently demonstrates superior performance across different
vision-language scenarios while maintaining generalizability. Beyond enabling
self-improvement in perceptual capabilities, ViPER provides concrete evidence
for the reciprocal relationship between generation and understanding, a
breakthrough to developing more autonomous and capable VLMs.
☆ Can LLMs Translate Human Instructions into a Reinforcement Learning Agent's Internal Emergent Symbolic Representation?
Emergent symbolic representations are critical for enabling developmental
learning agents to plan and generalize across tasks. In this work, we
investigate whether large language models (LLMs) can translate human natural
language instructions into the internal symbolic representations that emerge
during hierarchical reinforcement learning. We apply a structured evaluation
framework to measure the translation performance of commonly seen LLMs -- GPT,
Claude, Deepseek and Grok -- across different internal symbolic partitions
generated by a hierarchical reinforcement learning algorithm in the Ant Maze
and Ant Fall environments. Our findings reveal that although LLMs demonstrate
some ability to translate natural language into a symbolic representation of
the environment dynamics, their performance is highly sensitive to partition
granularity and task complexity. The results expose limitations in current LLMs
capacity for representation alignment, highlighting the need for further
research on robust alignment between language and internal agent
representations.
☆ From Memorization to Reasoning in the Spectrum of Loss Curvature
We characterize how memorization is represented in transformer models and
show that it can be disentangled in the weights of both language models (LMs)
and vision transformers (ViTs) using a decomposition based on the loss
landscape curvature. This insight is based on prior theoretical and empirical
work showing that the curvature for memorized training points is much sharper
than non memorized, meaning ordering weight components from high to low
curvature can reveal a distinction without explicit labels. This motivates a
weight editing procedure that suppresses far more recitation of untargeted
memorized data more effectively than a recent unlearning method
(BalancedSubnet), while maintaining lower perplexity. Since the basis of
curvature has a natural interpretation for shared structure in model weights,
we analyze the editing procedure extensively on its effect on downstream tasks
in LMs, and find that fact retrieval and arithmetic are specifically and
consistently negatively affected, even though open book fact retrieval and
general logical reasoning is conserved. We posit these tasks rely heavily on
specialized directions in weight space rather than general purpose mechanisms,
regardless of whether those individual datapoints are memorized. We support
this by showing a correspondence between task data's activation strength with
low curvature components that we edit out, and the drop in task performance
after the edit. Our work enhances the understanding of memorization in neural
networks with practical applications towards removing it, and provides evidence
for idiosyncratic, narrowly-used structures involved in solving tasks like math
and fact retrieval.
☆ Evaluating LLMs on Generating Age-Appropriate Child-Like Conversations
Large Language Models (LLMs), predominantly trained on adult conversational
data, face significant challenges when generating authentic, child-like
dialogue for specialized applications. We present a comparative study
evaluating five different LLMs (GPT-4, RUTER-LLAMA-2-13b, GPTSW, NorMistral-7b,
and NorBloom-7b) to generate age-appropriate Norwegian conversations for
children aged 5 and 9 years. Through a blind evaluation by eleven education
professionals using both real child interview data and LLM-generated text
samples, we assessed authenticity and developmental appropriateness. Our
results show that evaluators achieved strong inter-rater reliability (ICC=0.75)
and demonstrated higher accuracy in age prediction for younger children
(5-year-olds) compared to older children (9-year-olds). While GPT-4 and
NorBloom-7b performed relatively well, most models generated language perceived
as more linguistically advanced than the target age groups. These findings
highlight critical data-related challenges in developing LLM systems for
specialized applications involving children, particularly in low-resource
languages where comprehensive age-appropriate lexical resources are scarce.
comment: 11 pages excluding references and appendix. 3 figures and 6 tables
☆ Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations
In this work, we tackle the Diacritic Restoration (DR) task for Arabic
dialectal sentences using a multimodal approach that combines both textual and
speech information. We propose a model that represents the text modality using
an encoder extracted from our own pre-trained model named CATT. The speech
component is handled by the encoder module of the OpenAI Whisper base model.
Our solution is designed following two integration strategies. The former
consists of fusing the speech tokens with the input at an early stage, where
the 1500 frames of the audio segment are averaged over 10 consecutive frames,
resulting in 150 speech tokens. To ensure embedding compatibility, these
averaged tokens are processed through a linear projection layer prior to
merging them with the text tokens. Contextual encoding is guaranteed by the
CATT encoder module. The latter strategy relies on cross-attention, where text
and speech embeddings are fused. The cross-attention output is then fed to the
CATT classification head for token-level diacritic prediction. To further
improve model robustness, we randomly deactivate the speech input during
training, allowing the model to perform well with or without speech. Our
experiments show that the proposed approach achieves a word error rate (WER) of
0.25 and a character error rate (CER) of 0.9 on the development set. On the
test set, our model achieved WER and CER scores of 0.55 and 0.13, respectively.
☆ Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models? NeurIPS 2025
Large Language Models (LLMs) often produce explanations that do not
faithfully reflect the factors driving their predictions. In healthcare
settings, such unfaithfulness is especially problematic: explanations that omit
salient clinical cues or mask spurious shortcuts can undermine clinician trust
and lead to unsafe decision support. We study how inference and training-time
choices shape explanation faithfulness, focusing on factors practitioners can
control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA
8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions),
and manipulate the number and type of few-shot examples, prompting strategies,
and training procedure. Our results show: (i) both the quantity and quality of
few-shot examples significantly impact model faithfulness; (ii) faithfulness is
sensitive to prompting design; (iii) the instruction-tuning phase improves
measured faithfulness on MedQA. These findings offer insights into strategies
for enhancing the interpretability and trustworthiness of LLMs in sensitive
domains.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS
2025) Workshop: NeurIPS 2025 Workshop on Evaluating the Evolving LLM
Lifecycle: Benchmarks, Emergent Abilities, and Scaling
☆ HACK: Hallucinations Along Certainty and Knowledge Axes
Adi Simhi, Jonathan Herzig, Itay Itzhak, Dana Arad, Zorik Gekhman, Roi Reichart, Fazl Barez, Gabriel Stanovsky, Idan Szpektor, Yonatan Belinkov
Hallucinations in LLMs present a critical barrier to their reliable usage.
Existing research usually categorizes hallucination by their external
properties rather than by the LLMs' underlying internal properties. This
external focus overlooks that hallucinations may require tailored mitigation
strategies based on their underlying mechanism. We propose a framework for
categorizing hallucinations along two axes: knowledge and certainty. Since
parametric knowledge and certainty may vary across models, our categorization
method involves a model-specific dataset construction process that
differentiates between those types of hallucinations. Along the knowledge axis,
we distinguish between hallucinations caused by a lack of knowledge and those
occurring despite the model having the knowledge of the correct response. To
validate our framework along the knowledge axis, we apply steering mitigation,
which relies on the existence of parametric knowledge to manipulate model
activations. This addresses the lack of existing methods to validate knowledge
categorization by showing a significant difference between the two
hallucination types. We further analyze the distinct knowledge and
hallucination patterns between models, showing that different hallucinations do
occur despite shared parametric knowledge. Turning to the certainty axis, we
identify a particularly concerning subset of hallucinations where models
hallucinate with certainty despite having the correct knowledge internally. We
introduce a new evaluation metric to measure the effectiveness of mitigation
methods on this subset, revealing that while some methods perform well on
average, they fail disproportionately on these critical cases. Our findings
highlight the importance of considering both knowledge and certainty in
hallucination analysis and call for targeted mitigation approaches that
consider the hallucination underlying factors.
comment: The code is available at
https://github.com/technion-cs-nlp/HACK_Hallucinations_Along_Certainty_and_Knowledge_axes
☆ Beyond Neural Incompatibility: Easing Cross-Scale Knowledge Transfer in Large Language Models through Latent Semantic Alignment
Large Language Models (LLMs) encode vast amounts of knowledge in their
massive parameters, which is accessible to locate, trace, and analyze. Despite
advances in neural interpretability, it is still not clear how to transfer
knowledge in a fine-grained manner, namely parametric knowledge transfer (PKT).
A key problem is enabling effective and efficient knowledge transfer across
LLMs of different scales, which is essential for achieving greater flexibility
and broader applicability in transferring knowledge between LLMs. Due to neural
incompatibility, referring to the architectural and parametric differences
between LLMs of varying scales, existing methods that directly reuse layer
parameters are severely limited. In this paper, we identify the semantic
alignment in latent space as the fundamental prerequisite for LLM cross-scale
knowledge transfer. Instead of directly using the layer parameters, our
approach takes activations as the medium of layer-wise knowledge transfer.
Leveraging the semantics in latent space, our approach is simple and
outperforms prior work, better aligning model behaviors across varying scales.
Evaluations on four benchmarks demonstrate the efficacy of our method. Further
analysis reveals the key factors easing cross-scale knowledge transfer and
provides insights into the nature of latent semantic alignment.
comment: an early-stage version
☆ Exploring the Influence of Relevant Knowledge for Natural Language Generation Interpretability
This paper explores the influence of external knowledge integration in
Natural Language Generation (NLG), focusing on a commonsense generation task.
We extend the CommonGen dataset by creating KITGI, a benchmark that pairs input
concept sets with retrieved semantic relations from ConceptNet and includes
manually annotated outputs. Using the T5-Large model, we compare sentence
generation under two conditions: with full external knowledge and with filtered
knowledge where highly relevant relations were deliberately removed. Our
interpretability benchmark follows a three-stage method: (1) identifying and
removing key knowledge, (2) regenerating sentences, and (3) manually assessing
outputs for commonsense plausibility and concept coverage. Results show that
sentences generated with full knowledge achieved 91\% correctness across both
criteria, while filtering reduced performance drastically to 6\%. These
findings demonstrate that relevant external knowledge is critical for
maintaining both coherence and concept coverage in NLG. This work highlights
the importance of designing interpretable, knowledge-enhanced NLG systems and
calls for evaluation frameworks that capture the underlying reasoning beyond
surface-level metrics.
☆ MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations
Sarcasm is a complex form of figurative language in which the intended
meaning contradicts the literal one. Its prevalence in social media and popular
culture poses persistent challenges for natural language understanding,
sentiment analysis, and content moderation. With the emergence of multimodal
large language models, sarcasm detection extends beyond text and requires
integrating cues from audio and vision. We present MuSaG, the first German
multimodal sarcasm detection dataset, consisting of 33 minutes of manually
selected and human-annotated statements from German television shows. Each
instance provides aligned text, audio, and video modalities, annotated
separately by humans, enabling evaluation in unimodal and multimodal settings.
We benchmark nine open-source and commercial models, spanning text, audio,
vision, and multimodal architectures, and compare their performance to human
annotations. Our results show that while humans rely heavily on audio in
conversational settings, models perform best on text. This highlights a gap in
current multimodal models and motivates the use of MuSaG for developing models
better suited to realistic scenarios. We release MuSaG publicly to support
future research on multimodal sarcasm detection and human-model alignment.
☆ Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean ACL
Chanwoo Park, Suyoung Park, JiA Kang, Jongyeon Park, Sangho Kim, Hyunji M. Park, Sumin Bae, Mingyu Kang, Jaejin Lee
We present Ko-MuSR, the first benchmark to comprehensively evaluate
multistep, soft reasoning in long Korean narratives while minimizing data
contamination. Built following MuSR, Ko-MuSR features fully Korean narratives,
reasoning chains, and multiple-choice questions verified by human annotators
for logical consistency and answerability. Evaluations of four large language
models -- two multilingual and two Korean-specialized -- show that multilingual
models outperform Korean-focused ones even in Korean reasoning tasks,
indicating cross-lingual generalization of reasoning ability. Carefully
designed prompting strategies, which combine few-shot examples, reasoning
traces, and task-specific hints, further boost accuracy, approaching
human-level performance. Ko-MuSR offers a solid foundation for advancing Korean
NLP by enabling systematic evaluation of long-context reasoning and prompting
strategies.
comment: submitted to ACL ARR Rolling Review
☆ Beyond Line-Level Filtering for the Pretraining Corpora of LLMs ACL
While traditional line-level filtering techniques, such as line-level
deduplication and trailing-punctuation filters, are commonly used, these basic
methods can sometimes discard valuable content, negatively affecting downstream
performance. In this paper, we introduce two methods-pattern-aware line-level
deduplication (PLD) and pattern-aware trailing punctuation filtering (PTF)-by
enhancing the conventional filtering techniques. Our approach not only
considers line-level signals but also takes into account their sequential
distribution across documents, enabling us to retain structurally important
content that might otherwise be removed. We evaluate these proposed methods by
training small language models (1 B parameters) in both English and Korean. The
results demonstrate that our methods consistently improve performance on
multiple-choice benchmarks and significantly enhance generative
question-answering accuracy on both SQuAD v1 and KorQuAD v1.
comment: submitted to ACL ARR Rolling Review
☆ VC4VG: Optimizing Video Captions for Text-to-Video Generation EMNLP 2025
Recent advances in text-to-video (T2V) generation highlight the critical role
of high-quality video-text pairs in training models capable of producing
coherent and instruction-aligned videos. However, strategies for optimizing
video captions specifically for T2V training remain underexplored. In this
paper, we introduce VC4VG (Video Captioning for Video Generation), a
comprehensive caption optimization framework tailored to the needs of T2V
models.We begin by analyzing caption content from a T2V perspective,
decomposing the essential elements required for video reconstruction into
multiple dimensions, and proposing a principled caption design methodology. To
support evaluation, we construct VC4VG-Bench, a new benchmark featuring
fine-grained, multi-dimensional, and necessity-graded metrics aligned with
T2V-specific requirements.Extensive T2V fine-tuning experiments demonstrate a
strong correlation between improved caption quality and video generation
performance, validating the effectiveness of our approach. We release all
benchmark tools and code at https://github.com/qyr0403/VC4VG to support further
research.
comment: Accepted by EMNLP 2025
☆ Reinforcement Learning for Long-Horizon Multi-Turn Search Agents NeurIPS 2025
Large Language Model (LLM) agents can leverage multiple turns and tools to
solve complex tasks, with prompt-based approaches achieving strong performance.
This work demonstrates that Reinforcement Learning (RL) can push capabilities
significantly further by learning from experience. Through experiments on a
legal document search benchmark, we show that our RL-trained 14 Billion
parameter model outperforms frontier class models (85% vs 78% accuracy). In
addition, we explore turn-restricted regimes, during training and at test-time,
that show these agents achieve better results if allowed to operate over longer
multi-turn horizons.
comment: 4 pages plus references and appendices. Accepted into the First
Workshop on Multi-Turn Interactions in Large Language Models at NeurIPS 2025
☆ Squrve: A Unified and Modular Framework for Complex Real-World Text-to-SQL Tasks
Text-to-SQL technology has evolved rapidly, with diverse academic methods
achieving impressive results. However, deploying these techniques in real-world
systems remains challenging due to limited integration tools. Despite these
advances, we introduce Squrve, a unified, modular, and extensive Text-to-SQL
framework designed to bring together research advances and real-world
applications. Squrve first establishes a universal execution paradigm that
standardizes invocation interfaces, then proposes a multi-actor collaboration
mechanism based on seven abstracted effective atomic actor components.
Experiments on widely adopted benchmarks demonstrate that the collaborative
workflows consistently outperform the original individual methods, thereby
opening up a new effective avenue for tackling complex real-world queries. The
codes are available at https://github.com/Satissss/Squrve.
☆ RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects
Md. Rezuwan Hassan, Azmol Hossain, Kanij Fatema, Rubayet Sabbir Faruque, Tanmoy Shome, Ruwad Naswan, Trina Chakraborty, Md. Foriduzzaman Zihad, Tawsif Tashwar Dipto, Nazia Tasnim, Nazmuddoha Ansary, Md. Mehedi Hasan Shawon, Ahmed Imtiaz Humayun, Md. Golam Rabiul Alam, Farig Sadeque, Asif Sushmit
The Bengali language, spoken extensively across South Asia and among
diasporic communities, exhibits considerable dialectal diversity shaped by
geography, culture, and history. Phonological and pronunciation-based
classifications broadly identify five principal dialect groups: Eastern
Bengali, Manbhumi, Rangpuri, Varendri, and Rarhi. Within Bangladesh, further
distinctions emerge through variation in vocabulary, syntax, and morphology, as
observed in regions such as Chittagong, Sylhet, Rangpur, Rajshahi, Noakhali,
and Barishal. Despite this linguistic richness, systematic research on the
computational processing of Bengali dialects remains limited. This study seeks
to document and analyze the phonetic and morphological properties of these
dialects while exploring the feasibility of building computational models
particularly Automatic Speech Recognition (ASR) systems tailored to regional
varieties. Such efforts hold potential for applications in virtual assistants
and broader language technologies, contributing to both the preservation of
dialectal diversity and the advancement of inclusive digital tools for
Bengali-speaking communities. The dataset created for this study is released
for public use.
comment: 26 pages
☆ Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures
Tyler A. Chang, Catherine Arnett, Abdelrahman Eldesokey, Abdelrahman Sadallah, Abeer Kashar, Abolade Daud, Abosede Grace Olanihun, Adamu Labaran Mohammed, Adeyemi Praise, Adhikarinayum Meerajita Sharma, Aditi Gupta, Afitab Iyigun, Afonso Simplício, Ahmed Essouaied, Aicha Chorana, Akhil Eppa, Akintunde Oladipo, Akshay Ramesh, Aleksei Dorkin, Alfred Malengo Kondoro, Alham Fikri Aji, Ali Eren Çetintaş, Allan Hanbury, Alou Dembele, Alp Niksarli, Álvaro Arroyo, Amin Bajand, Amol Khanna, Ana Chkhaidze, Ana Condez, Andiswa Mkhonto, Andrew Hoblitzell, Andrew Tran, Angelos Poulis, Anirban Majumder, Anna Vacalopoulou, Annette Kuuipolani Kanahele Wong, Annika Simonsen, Anton Kovalev, Ashvanth. S, Ayodeji Joseph Lana, Barkin Kinay, Bashar Alhafni, Benedict Cibalinda Busole, Bernard Ghanem, Bharti Nathani, Biljana Stojanovska Đurić, Bola Agbonile, Bragi Bergsson, Bruce Torres Fischer, Burak Tutar, Burcu Alakuş Çınar, Cade J. Kanoniakapueo Kane, Can Udomcharoenchaikit, Catherine Arnett, Chadi Helwe, Chaithra Reddy Nerella, Chen Cecilia Liu, Chiamaka Glory Nwokolo, Cristina España-Bonet, Cynthia Amol, DaeYeop Lee, Dana Arad, Daniil Dzenhaliou, Daria Pugacheva, Dasol Choi, Daud Abolade, David Liu, David Semedo, Deborah Popoola, Deividas Mataciunas, Delphine Nyaboke, Dhyuthy Krishna Kumar, Diogo Glória-Silva, Diogo Tavares, Divyanshu Goyal, DongGeon Lee, Ebele Nwamaka Anajemba, Egonu Ngozi Grace, Elena Mickel, Elena Tutubalina, Elias Herranen, Emile Anand, Emmanuel Habumuremyi, Emuobonuvie Maria Ajiboye, Eryawan Presma Yulianrifat, Esther Adenuga, Ewa Rudnicka, Faith Olabisi Itiola, Faran Taimoor Butt, Fathima Thekkekara, Fatima Haouari, Filbert Aurelian Tjiaranata, Firas Laakom, Francesca Grasso, Francesco Orabona, Francesco Periti, Gbenga Kayode Solomon, Gia Nghia Ngo, Gloria Udhehdhe-oze, Gonçalo Martins, Gopi Naga Sai Ram Challagolla, Guijin Son, Gulnaz Abdykadyrova, Hafsteinn Einarsson, Hai Hu, Hamidreza Saffari, Hamza Zaidi, Haopeng Zhang, Harethah Abu Shairah, Harry Vuong, Hele-Andra Kuulmets, Houda Bouamor, Hwanjo Yu, Iben Nyholm Debess, İbrahim Ethem Deveci, Ikhlasul Akmal Hanif, Ikhyun Cho, Inês Calvo, Inês Vieira, Isaac Manzi, Ismail Daud, Itay Itzhak, Iuliia, Alekseenko, Ivan Belashkin, Ivan Spada, Ivan Zhelyazkov, Jacob Brinton, Jafar Isbarov, Jaka Čibej, Jan Čuhel, Jan Kocoń, Jauza Akbar Krito, Jebish Purbey, Jennifer Mickel, Jennifer Za, Jenny Kunz, Jihae Jeong, Jimena Tena Dávalos, Jinu Lee, João Magalhães, John Yi, Jongin Kim, Joseph Chataignon, Joseph Marvin Imperial, Jubeerathan Thevakumar, Judith Land, Junchen Jiang, Jungwhan Kim, Kairit Sirts, Kamesh R, Kamesh V, Kanda Patrick Tshinu, Kätriin Kukk, Kaustubh Ponkshe, Kavsar Huseynova, Ke He, Kelly Buchanan, Kengatharaiyer Sarveswaran, Kerem Zaman, Khalil Mrini, Kian Kyars, Krister Kruusmaa, Kusum Chouhan, Lainitha Krishnakumar, Laura Castro Sánchez, Laura Porrino Moscoso, Leshem Choshen, Levent Sencan, Lilja Øvrelid, Lisa Alazraki, Lovina Ehimen-Ugbede, Luheerathan Thevakumar, Luxshan Thavarasa, Mahnoor Malik, Mamadou K. Keita, Mansi Jangid, Marco De Santis, Marcos García, Marek Suppa, Mariam D'Ciofalo, Marii Ojastu, Maryam Sikander, Mausami Narayan, Maximos Skandalis, Mehak Mehak, Mehmet İlteriş Bozkurt, Melaku Bayu Workie, Menan Velayuthan, Michael Leventhal, Michał Marcińczuk, Mirna Potočnjak, Mohammadamin Shafiei, Mridul Sharma, Mrityunjaya Indoria, Muhammad Ravi Shulthan Habibi, Murat Kolić, Nada Galant, Naphat Permpredanun, Narada Maugin, Nicholas Kluge Corrêa, Nikola Ljubešić, Nirmal Thomas, Nisansa de Silva, Nisheeth Joshi, Nitish Ponkshe, Nizar Habash, Nneoma C. Udeze, Noel Thomas, Noémi Ligeti-Nagy, Nouhoum Coulibaly, Nsengiyumva Faustin, Odunayo Kareemat Buliaminu, Odunayo Ogundepo, Oghojafor Godswill Fejiro, Ogundipe Blessing Funmilola, Okechukwu God'spraise, Olanrewaju Samuel, Olaoye Deborah Oluwaseun, Olasoji Akindejoye, Olga Popova, Olga Snissarenko, Onyinye Anulika Chiemezie, Orkun Kinay, Osman Tursun, Owoeye Tobiloba Moses, Oyelade Oluwafemi Joshua, Oyesanmi Fiyinfoluwa, Pablo Gamallo, Pablo Rodríguez Fernández, Palak Arora, Pedro Valente, Peter Rupnik, Philip Oghenesuowho Ekiugbo, Pramit Sahoo, Prokopis Prokopidis, Pua Niau-Puhipau, Quadri Yahya, Rachele Mignone, Raghav Singhal, Ram Mohan Rao Kadiyala, Raphael Merx, Rapheal Afolayan, Ratnavel Rajalakshmi, Rishav Ghosh, Romina Oji, Ron Kekeha Solis, Rui Guerra, Rushikesh Zawar, Sa'ad Nasir Bashir, Saeed Alzaabi, Sahil Sandeep, Sai Pavan Batchu, SaiSandeep Kantareddy, Salsabila Zahirah Pranida, Sam Buchanan, Samuel Rutunda, Sander Land, Sarah Sulollari, Sardar Ali, Saroj Sapkota, Saulius Tautvaisas, Sayambhu Sen, Sayantani Banerjee, Sebastien Diarra, SenthilNathan. M, Sewoong Lee, Shaan Shah, Shankar Venkitachalam, Sharifa Djurabaeva, Sharon Ibejih, Shivanya Shomir Dutta, Siddhant Gupta, Silvia Paniagua Suárez, Sina Ahmadi, Sivasuthan Sukumar, Siyuan Song, Snegha A., Sokratis Sofianopoulos, Sona Elza Simon, Sonja Benčina, Sophie Gvasalia, Sphurti Kirit More, Spyros Dragazis, Stephan P. Kaufhold, Suba. S, Sultan AlRashed, Surangika Ranathunga, Taiga Someya, Taja Kuzman Pungeršek, Tal Haklay, Tasi'u Jibril, Tatsuya Aoyama, Tea Abashidze, Terenz Jomar Dela Cruz, Terra Blevins, Themistoklis Nikas, Theresa Dora Idoko, Thu Mai Do, Tilek Chubakov, Tommaso Gargiani, Uma Rathore, Uni Johannesen, Uwuma Doris Ugwu, Vallerie Alexandra Putra, Vanya Bannihatti Kumar, Varsha Jeyarajalingam, Varvara Arzt, Vasudevan Nedumpozhimana, Viktoria Ondrejova, Viktoryia Horbik, Vishnu Vardhan Reddy Kummitha, Vuk Dinić, Walelign Tewabe Sewunetie, Winston Wu, Xiaojing Zhao, Yacouba Diarra, Yaniv Nikankin, Yash Mathur, Yixi Chen, Yiyuan Li, Yolanda Xavier, Yonatan Belinkov, Yusuf Ismail Abayomi, Zaid Alyafeai, Zhengyang Shan, Zhi Rui Tam, Zilu Tang, Zuzana Nadova, Baber Abbasi, Stella Biderman, David Stap, Duygu Ataman, Fabian Schmidt, Hila Gonen, Jiayi Wang, David Ifeoluwa Adelani
To date, there exist almost no culturally-specific evaluation benchmarks for
large language models (LLMs) that cover a large number of languages and
cultures. In this paper, we present Global PIQA, a participatory commonsense
reasoning benchmark for over 100 languages, constructed by hand by 335
researchers from 65 countries around the world. The 116 language varieties in
Global PIQA cover five continents, 14 language families, and 23 writing
systems. In the non-parallel split of Global PIQA, over 50% of examples
reference local foods, customs, traditions, or other culturally-specific
elements. We find that state-of-the-art LLMs perform well on Global PIQA in
aggregate, but they exhibit weaker performance in lower-resource languages (up
to a 37% accuracy gap, despite random chance at 50%). Open models generally
perform worse than proprietary models. Global PIQA highlights that in many
languages and cultures, everyday knowledge remains an area for improvement,
alongside more widely-discussed capabilities such as complex reasoning and
expert knowledge. Beyond its uses for LLM evaluation, we hope that Global PIQA
provides a glimpse into the wide diversity of cultures in which human language
is embedded.
comment: Preprint
☆ Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation
Large Language Models (LLMs) have advanced machine translation but remain
vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not
capable of exposing failures in multilingual LLMs. To disclose hallucination in
multilingual LLMs, we introduce a diagnostic framework with a taxonomy that
separates Instruction Detachment from Source Detachment. Guided by this
taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark
across 11 English-to-X directions. We employed 4 frontier LLMs to generate
candidates and scrutinize these candidates with an ensemble of LLM judges, and
expert validation. In this way, we curate 5,435 high-quality instances. We have
evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination
triggers'' -- unique failure patterns reflecting model scale, source length
sensitivity, linguistic biases, and Reinforcement-Learning (RL) amplified
language mixing. HalloMTBench offers a forward-looking testbed for diagnosing
LLM translation failures. HalloMTBench is available in
https://huggingface.co/collections/AIDC-AI/marco-mt.
☆ Pie: A Programmable Serving System for Emerging LLM Applications SOSP 2025
Emerging large language model (LLM) applications involve diverse reasoning
strategies and agentic workflows, straining the capabilities of existing
serving systems built on a monolithic token generation loop. This paper
introduces Pie, a programmable LLM serving system designed for flexibility and
efficiency. Pie decomposes the traditional generation loop into fine-grained
service handlers exposed via an API and delegates control of the generation
process to user-provided programs, called inferlets. This enables applications
to implement new KV cache strategies, bespoke generation logic, and seamlessly
integrate computation and I/O-entirely within the application, without
requiring modifications to the serving system. Pie executes inferlets using
WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows
Pie matches state-of-the-art performance on standard tasks (3-12% latency
overhead) while significantly improving latency and throughput (1.3x-3.4x
higher) on agentic workflows by enabling application-specific optimizations.
comment: SOSP 2025. Source code available at
https://github.com/pie-project/pie
☆ GraphNet: A Large-Scale Computational Graph Dataset for Tensor Compiler Research
Xinqi Li, Yiqun Liu, Shan Jiang, Enrong Zheng, Huaijin Zheng, Wenhao Dai, Haodong Deng, Dianhai Yu, Yanjun Ma
We introduce GraphNet, a dataset of 2.7K real-world deep learning
computational graphs with rich metadata, spanning six major task categories
across multiple deep learning frameworks. To evaluate tensor compiler
performance on these samples, we propose the benchmark metric Speedup Score
S(t), which jointly considers runtime speedup and execution correctness under
tunable tolerance levels, offering a reliable measure of general optimization
capability. Furthermore, we extend S(t) to the Error-aware Speedup Score ES(t),
which incorporates error information and helps compiler developers identify key
performance bottlenecks. In this report, we benchmark the default tensor
compilers, CINN for PaddlePaddle and TorchInductor for PyTorch, on computer
vision (CV) and natural language processing (NLP) samples to demonstrate the
practicality of GraphNet. The full construction pipeline with graph extraction
and compiler evaluation tools is available at
https://github.com/PaddlePaddle/GraphNet .
☆ Success and Cost Elicit Convention Formation for Efficient Communication
Humans leverage shared conversational context to become increasingly
successful and efficient at communicating over time. One manifestation of this
is the formation of ad hoc linguistic conventions, which allow people to
coordinate on short, less costly utterances that are understood using shared
conversational context. We present a method to train large multimodal models to
form conventions, enabling efficient communication. Our approach uses simulated
reference games between models, and requires no additional human-produced data.
In repeated reference games involving photographs and tangram images, our
method enables models to communicate efficiently with people: reducing the
message length by up to 41% while increasing success by 15% over the course of
the interaction. Human listeners respond faster when interacting with our model
that forms conventions. We also show that training based on success or cost
alone is insufficient - both are necessary to elicit convention formation.
☆ SpecKD: Speculative Decoding for Effective Knowledge Distillation of LLMs
Knowledge Distillation (KD) has become a cornerstone technique for
compressing Large Language Models (LLMs) into smaller, more efficient student
models. However, conventional KD approaches typically apply the distillation
loss uniformly across all tokens, regardless of the teacher's confidence. This
indiscriminate mimicry can introduce noise, as the student is forced to learn
from the teacher's uncertain or high-entropy predictions, which may ultimately
harm student performance-especially when the teacher is much larger and more
powerful. To address this, we propose Speculative Knowledge Distillation
(SpecKD), a novel, plug-and-play framework that introduces a dynamic,
token-level gating mechanism inspired by the "propose-and-verify" paradigm of
speculative decoding. At each step, the student's token proposal is verified
against the teacher's distribution; the distillation loss is selectively
applied only to "accepted" tokens, while "rejected" tokens are masked out.
Extensive experiments on diverse text generation tasks show that SpecKD
consistently and significantly outperforms strong KD baselines, leading to more
stable training and more capable student models, and achieving state-of-the-art
results.
☆ Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward
Mitigating hallucinations in Large Language Models (LLMs) is critical for
their reliable deployment. Existing methods typically fine-tune LLMs to abstain
from answering questions beyond their knowledge scope. However, these methods
often rely on coarse-grained signals to guide LLMs to abstain, such as overall
confidence or uncertainty scores on multiple sampled answers, which may result
in an imprecise awareness of the model's own knowledge boundaries. To this end,
we propose a novel reinforcement learning framework built on
$\textbf{\underline{Fi}ne-grained \underline{S}emantic \underline{Co}nfidence
\underline{Re}ward (\Ours)}$, which guides LLMs to abstain via sample-specific
confidence. Specifically, our method operates by sampling multiple candidate
answers and conducting semantic clustering, then training the LLM to retain
answers within high-confidence clusters and discard those within low-confidence
ones, thereby promoting accurate post-hoc abstention. Additionally, we propose
a new metric for evaluating the reliability of abstention fine-tuning tasks
more comprehensively. Our method significantly enhances reliability in both
in-domain and out-of-distribution benchmarks.
comment: 23pages, 4figures
☆ TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents ACL 2025
The task of information extraction (IE) is to extract structured knowledge
from text. However, it is often not straightforward to utilize IE output due to
the mismatch between the IE ontology and the downstream application needs. We
propose a new formulation of IE TEXT2DB that emphasizes the integration of IE
output and the target database (or knowledge base). Given a user instruction, a
document set, and a database, our task requires the model to update the
database with values from the document set to satisfy the user instruction.
This task requires understanding user instructions for what to extract and
adapting to the given DB/KB schema for how to extract on the fly. To evaluate
this new task, we introduce a new benchmark featuring common demands such as
data infilling, row population, and column addition. In addition, we propose an
LLM agent framework OPAL (Observe-PlanAnalyze LLM) which includes an Observer
component that interacts with the database, the Planner component that
generates a code-based plan with calls to IE models, and the Analyzer component
that provides feedback regarding code quality before execution. Experiments
show that OPAL can successfully adapt to diverse database schemas by generating
different code plans and calling the required IE models. We also highlight
difficult cases such as dealing with large databases with complex dependencies
and extraction hallucination, which we believe deserve further investigation.
Source code: https://github.com/yzjiao/Text2DB
comment: ACL 2025. Source code: https://github.com/yzjiao/Text2DB
☆ META-RAG: Meta-Analysis-Inspired Evidence-Re-Ranking Method for Retrieval-Augmented Generation in Evidence-Based Medicine
Evidence-based medicine (EBM) holds a crucial role in clinical application.
Given suitable medical articles, doctors effectively reduce the incidence of
misdiagnoses. Researchers find it efficient to use large language models (LLMs)
techniques like RAG for EBM tasks. However, the EBM maintains stringent
requirements for evidence, and RAG applications in EBM struggle to efficiently
distinguish high-quality evidence. Therefore, inspired by the meta-analysis
used in EBM, we provide a new method to re-rank and filter the medical
evidence. This method presents multiple principles to filter the best evidence
for LLMs to diagnose. We employ a combination of several EBM methods to emulate
the meta-analysis, which includes reliability analysis, heterogeneity analysis,
and extrapolation analysis. These processes allow the users to retrieve the
best medical evidence for the LLMs. Ultimately, we evaluate these high-quality
articles and show an accuracy improvement of up to 11.4% in our experiments and
results. Our method successfully enables RAG to extract higher-quality and more
reliable evidence from the PubMed dataset. This work can reduce the infusion of
incorrect knowledge into responses and help users receive more effective
replies.
☆ PICOs-RAG: PICO-supported Query Rewriting for Retrieval-Augmented Generation in Evidence-Based Medicine
Evidence-based medicine (EBM) research has always been of paramount
importance. It is important to find appropriate medical theoretical support for
the needs from physicians or patients to reduce the occurrence of medical
accidents. This process is often carried out by human querying relevant
literature databases, which lacks objectivity and efficiency. Therefore,
researchers utilize retrieval-augmented generation (RAG) to search for evidence
and generate responses automatically. However, current RAG methods struggle to
handle complex queries in real-world clinical scenarios. For example, when
queries lack certain information or use imprecise language, the model may
retrieve irrelevant evidence and generate unhelpful answers. To address this
issue, we present the PICOs-RAG to expand the user queries into a better
format. Our method can expand and normalize the queries into professional ones
and use the PICO format, a search strategy tool present in EBM, to extract the
most important information used for retrieval. This approach significantly
enhances retrieval efficiency and relevance, resulting in up to an 8.8\%
improvement compared to the baseline evaluated by our method. Thereby the
PICOs-RAG improves the performance of the large language models into a helpful
and reliable medical assistant in EBM.
☆ M-Eval: A Heterogeneity-Based Framework for Multi-evidence Validation in Medical RAG Systems
Retrieval-augmented Generation (RAG) has demonstrated potential in enhancing
medical question-answering systems through the integration of large language
models (LLMs) with external medical literature. LLMs can retrieve relevant
medical articles to generate more professional responses efficiently. However,
current RAG applications still face problems. They generate incorrect
information, such as hallucinations, and they fail to use external knowledge
correctly. To solve these issues, we propose a new method named M-Eval. This
method is inspired by the heterogeneity analysis approach used in
Evidence-Based Medicine (EBM). Our approach can check for factual errors in RAG
responses using evidence from multiple sources. First, we extract additional
medical literature from external knowledge bases. Then, we retrieve the
evidence documents generated by the RAG system. We use heterogeneity analysis
to check whether the evidence supports different viewpoints in the response. In
addition to verifying the accuracy of the response, we also assess the
reliability of the evidence provided by the RAG system. Our method shows an
improvement of up to 23.31% accuracy across various LLMs. This work can help
detect errors in current RAG-based medical systems. It also makes the
applications of LLMs more reliable and reduces diagnostic errors.
☆ emg2speech: synthesizing speech from electromyography using self-supervised speech models
We present a neuromuscular speech interface that translates electromyographic
(EMG) signals collected from orofacial muscles during speech articulation
directly into audio. We show that self-supervised speech (SS) representations
exhibit a strong linear relationship with the electrical power of muscle action
potentials: SS features can be linearly mapped to EMG power with a correlation
of $r = 0.85$. Moreover, EMG power vectors corresponding to different
articulatory gestures form structured and separable clusters in feature space.
This relationship: $\text{SS features}$ $\xrightarrow{\texttt{linear mapping}}$
$\text{EMG power}$ $\xrightarrow{\texttt{gesture-specific clustering}}$
$\text{articulatory movements}$, highlights that SS models implicitly encode
articulatory mechanisms. Leveraging this property, we directly map EMG signals
to SS feature space and synthesize speech, enabling end-to-end EMG-to-speech
generation without explicit articulatory models and vocoder training.
☆ Uncovering the Potential Risks in Unlearning: Danger of English-only Unlearning in Multilingual LLMs
There have been a couple of studies showing that attempting to erase
multilingual knowledge using only English data is insufficient for multilingual
LLMs. However, their analyses remain highly performance-oriented. In this
paper, we switch the point of view to evaluation, and address an additional
blind spot which reveals itself when the multilingual LLM is fully finetuned
with parallel multilingual dataset before unlearning. Here, language confusion
occurs whereby a model responds in language different from that of the input
prompt. Language confusion is a problematic phenomenon in unlearning, causing
the standard reference-based metrics to fail. We tackle this phenomenon in
three steps: (1) introduce N-gram-based Language-Mix (N-Mix) score to
quantitatively show the language confusion is pervasive and consistent in
multilingual LLMs, (2) demonstrate that reference-based metrics result in false
negatives when N-Mix score is high, and(3) suggest the need of new type of
unlearning evaluation that can directly assess the content of the generated
sentences. We call this type of metrics as semantic-based metric.
♻ ☆ Retrieval-Augmented Generation-based Relation Extraction
Information Extraction (IE) is a transformative process that converts
unstructured text data into a structured format by employing entity and
relation extraction (RE) methodologies. The identification of the relation
between a pair of entities plays a crucial role within this framework. Despite
the existence of various techniques for relation extraction, their efficacy
heavily relies on access to labeled data and substantial computational
resources. In addressing these challenges, Large Language Models (LLMs) emerge
as promising solutions; however, they might return hallucinating responses due
to their own training data. To overcome these limitations, Retrieved-Augmented
Generation-based Relation Extraction (RAG4RE) in this work is proposed,
offering a pathway to enhance the performance of relation extraction tasks.
This work evaluated the effectiveness of our RAG4RE approach utilizing
different LLMs. Through the utilization of established benchmarks, such as
TACRED, TACREV, Re-TACRED, and SemEval RE datasets, our aim is to
comprehensively evaluate the efficacy of our RAG4RE approach. In particularly,
we leverage prominent LLMs including Flan T5, Llama2, and Mistral in our
investigation. The results of our study demonstrate that our RAG4RE approach
surpasses performance of traditional RE approaches based solely on LLMs,
particularly evident in the TACRED dataset and its variations. Furthermore, our
approach exhibits remarkable performance compared to previous RE methodologies
across both TACRED and TACREV datasets, underscoring its efficacy and potential
for advancing RE tasks in natural language processing.
comment: published at the Semantic Web journal. The last version is available:
https://doi.org/10.1177/22104968251385519
♻ ☆ Arena-Lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons
As Large Language Models (LLMs) expand across domains, LLM judges have become
essential for systems evaluation. Current benchmarks typically compare system
outputs against baselines. This baseline-mediated approach, though convenient,
yields lower reliability than direct comparison between systems. We propose
Arena-Lite which integrates tournament structure on top of head-to-head
comparison. The application of a tournament structure and direct comparison
eliminates the need for baseline outputs, reduces the number of required
comparisons, and allows higher reliability in system rankings. We conducted two
experiments: (1) controlled stochastic modeling and (2) empirical validation
with a real LLM judge. Those experiments collectively demonstrate that
Arena-Lite consistently achieves higher reliability with fewer comparisons,
even with smaller datasets or weaker judges. We release an easy-to-use web
demonstration and code to foster adoption of Arena-Lite, streamlining model
selection across research and industry communities. Arena-Lite demo and code
are available on
\href{https://huggingface.co/spaces/NCSOFT/ArenaLite}{https://huggingface.co/spaces/NCSOFT/ArenaLite}
comment: 8 pages for main body, 19 pages in total
♻ ☆ Says Who? Effective Zero-Shot Annotation of Focalization
Focalization describes the way in which access to narrative information is
restricted or controlled based on the knowledge available to knowledge of the
narrator. It is encoded via a wide range of lexico-grammatical features and is
subject to reader interpretation. Even trained annotators frequently disagree
on correct labels, suggesting this task is both qualitatively and
computationally challenging. In this work, we test how well five contemporary
large language model (LLM) families and two baselines perform when annotating
short literary excerpts for focalization. Despite the challenging nature of the
task, we find that LLMs show comparable performance to trained human
annotators, with GPT-4o achieving an average F1 of 84.79%. Further, we
demonstrate that the log probabilities output by GPT-family models frequently
reflect the difficulty of annotating particular excerpts. Finally, we provide a
case study analyzing sixteen Stephen King novels, demonstrating the usefulness
of this approach for computational literary studies and the insights gleaned
from examining focalization at scale.
comment: Accepted at CHR 2025
♻ ☆ TableTime: Reformulating Time Series Classification as Training-Free Table Understanding with Large Language Models
Large language models (LLMs) have demonstrated their effectiveness in
multivariate time series classification (MTSC). Effective adaptation of LLMs
for MTSC necessitates informative data representations. Existing LLM-based
methods directly encode embeddings for time series within the latent space of
LLMs from scratch to align with semantic space of LLMs. Despite their
effectiveness, we reveal that these methods conceal three inherent bottlenecks:
(1) they struggle to encode temporal and channel-specific information in a
lossless manner, both of which are critical components of multivariate time
series; (2) it is much difficult to align the learned representation space with
the semantic space of the LLMs; (3) they require task-specific retraining,
which is both computationally expensive and labor-intensive. To bridge these
gaps, we propose TableTime, which reformulates MTSC as a table understanding
task. Specifically, TableTime introduces the following strategies: (1) convert
multivariate time series into a tabular form, thus minimizing information loss
to the greatest extent; (2) represent tabular time series in text format to
achieve natural alignment with the semantic space of LLMs; (3) design a
reasoning framework that integrates contextual text information, neighborhood
assistance, multi-path inference and problem decomposition to enhance the
reasoning ability of LLMs and realize zero-shot classification. Extensive
experiments performed on 10 publicly representative datasets from UEA archive
verify the superiorities of the TableTime.
♻ ☆ BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents
Litu Ou, Kuan Li, Huifeng Yin, Liwen Zhang, Zhongwang Zhang, Xixi Wu, Rui Ye, Zile Qiao, Pengjun Xie, Jingren Zhou, Yong Jiang
Confidence in LLMs is a useful indicator of model uncertainty and answer
reliability. Existing work mainly focused on single-turn scenarios, while
research on confidence in complex multi-turn interactions is limited. In this
paper, we investigate whether LLM-based search agents have the ability to
communicate their own confidence through verbalized confidence scores after
long sequences of actions, a significantly more challenging task compared to
outputting confidence in a single interaction. Experimenting on open-source
agentic models, we first find that models exhibit much higher task accuracy at
high confidence while having near-zero accuracy when confidence is low. Based
on this observation, we propose Test-Time Scaling (TTS) methods that use
confidence scores to determine answer quality, encourage the model to try again
until reaching a satisfactory confidence level. Results show that our proposed
methods significantly reduce token consumption while demonstrating competitive
performance compared to baseline fixed budget TTS methods.
comment: 25 pages
♻ ☆ The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness NeurIPS 2025
Reasoning-focused LLMs sometimes alter their behavior when they detect that
they are being evaluated, which can lead them to optimize for test-passing
performance or to comply more readily with harmful prompts if real-world
consequences appear absent. We present the first quantitative study of how such
"test awareness" impacts model behavior, particularly its performance on
safety-related tasks. We introduce a white-box probing framework that (i)
linearly identifies awareness-related activations and (ii) steers models toward
or away from test awareness while monitoring downstream performance. We apply
our method to different state-of-the-art open-weight reasoning LLMs across both
realistic and hypothetical tasks (denoting tests or simulations). Our results
demonstrate that test awareness significantly impacts safety alignment (such as
compliance with harmful requests and conforming to stereotypes) with effects
varying in both magnitude and direction across models. By providing control
over this latent effect, our work aims to provide a stress-test mechanism and
increase trust in how we perform safety evaluations.
comment: NeurIPS 2025 (Spotlight). Code is available at:
https://github.com/microsoft/Test_Awareness_Steering
♻ ☆ TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs
Accelerating the inference of large language models (LLMs) has been a
critical challenge in generative AI. Speculative decoding (SD) substantially
improves LLM inference efficiency. However, its utility is limited by a
fundamental constraint: the draft and target models must share the same
vocabulary, thus limiting the herd of available draft models and often
necessitating the training of a new model from scratch. Inspired by Dynamic
Time Warping (DTW), a classic algorithm for aligning time series, we propose
the algorithm TokenTiming for universal speculative decoding. It operates by
re-encoding the draft token sequence to get a new target token sequence, and
then uses DTW to build a mapping to transfer the probability distributions for
speculative sampling. Benefiting from this, our method accommodates mismatched
vocabularies and works with any off-the-shelf models without retraining and
modification. We conduct comprehensive experiments on various tasks,
demonstrating 1.57x speedup. This work enables a universal approach for draft
model selection, making SD a more versatile and practical tool for LLM
acceleration.
♻ ☆ Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays
BERT and its variants are extensively explored for automated scoring.
However, a limit of 512 tokens for these encoder-based models showed the
deficiency in automated scoring of long essays. Thus, this research explores
generative language models for automated scoring of long essays via
summarization and prompting. The results revealed great improvement of scoring
accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab
Automated Essay Scoring 2.0 dataset.
comment: 19 pages, 5 Tables 7 Figures, Presentation at Artificial Intelligence
in Measurement and Education Conference (AIME-Con)
♻ ☆ AutoJudge: Judge Decoding Without Manual Annotation
Roman Garipov, Fedor Velikonivtsev, Ivan Ermakov, Ruslan Svirschevski, Vage Egiazarian, Max Ryabinin
We introduce AutoJudge, a method that accelerates large language model (LLM)
inference with task-specific lossy speculative decoding. Instead of matching
the original model output distribution token-by-token, we identify which of the
generated tokens affect the downstream quality of the response, relaxing the
distribution match guarantee so that the "unimportant" tokens can be generated
faster. Our approach relies on a semi-greedy search algorithm to test which of
the mismatches between target and draft models should be corrected to preserve
quality and which ones may be skipped. We then train a lightweight classifier
based on existing LLM embeddings to predict, at inference time, which
mismatching tokens can be safely accepted without compromising the final answer
quality. We evaluate the effectiveness of AutoJudge with multiple draft/target
model pairs on mathematical reasoning and programming benchmarks, achieving
significant speedups at the cost of a minor accuracy reduction. Notably, on
GSM8k with the Llama 3.1 70B target model, our approach achieves up to
$\approx2\times$ speedup over speculative decoding at the cost of $\le 1\%$
drop in accuracy. When applied to the LiveCodeBench benchmark, AutoJudge
automatically detects programming-specific important tokens, accepting $\ge 25$
tokens per speculation cycle at $2\%$ drop in Pass@1. Our approach requires no
human annotation and is easy to integrate with modern LLM inference frameworks.
♻ ☆ Mano Technical Report
Tianyu Fu, Anyang Su, Chenxu Zhao, Hanning Wang, Minghui Wu, Zhe Yu, Fei Hu, Mingjia Shi, Wei Dong, Jiayao Wang, Yuyang Chen, Ruiyang Yu, Siran Peng, Menglin Li, Nan Huang, Haitian Wei, Jiawei Yu, Yi Xin, Xilin Zhao, Kai Gu, Ping Jiang, Sifan Zhou, Shuo Wang
Graphical user interfaces (GUIs) are the primary medium for human-computer
interaction, yet automating GUI interactions remains challenging due to the
complexity of visual elements, dynamic environments, and the need for
multi-step reasoning. Existing methods based on vision-language models (VLMs)
often suffer from limited resolution, domain mismatch, and insufficient
sequential decisionmaking capability. To address these issues, we propose Mano,
a robust GUI agent built upon a multi-modal foundation model pre-trained on
extensive web and computer system data. Our approach integrates a novel
simulated environment for high-fidelity data generation, a three-stage training
pipeline (supervised fine-tuning, offline reinforcement learning, and online
reinforcement learning), and a verification module for error recovery. Mano
demonstrates state-of-the-art performance on multiple GUI benchmarks, including
Mind2Web and OSWorld, achieving significant improvements in success rate and
operational accuracy. Our work provides new insights into the effective
integration of reinforcement learning with VLMs for practical GUI agent
deployment, highlighting the importance of domain-specific data, iterative
training, and holistic reward design.
♻ ☆ Are you sure? Measuring models bias in content moderation through uncertainty ACL
Automatic content moderation is crucial to ensuring safety in social media.
Language Model-based classifiers are being increasingly adopted for this task,
but it has been shown that they perpetuate racial and social biases. Even if
several resources and benchmark corpora have been developed to challenge this
issue, measuring the fairness of models in content moderation remains an open
issue. In this work, we present an unsupervised approach that benchmarks models
on the basis of their uncertainty in classifying messages annotated by people
belonging to vulnerable groups. We use uncertainty, computed by means of the
conformal prediction technique, as a proxy to analyze the bias of 11 models
against women and non-white annotators and observe to what extent it diverges
from metrics based on performance, such as the $F_1$ score. The results show
that some pre-trained models predict with high accuracy the labels coming from
minority groups, even if the confidence in their prediction is low. Therefore,
by measuring the confidence of models, we are able to see which groups of
annotators are better represented in pre-trained models and lead the debiasing
process of these models before their effective use.
comment: accepted at Findings of ACL: EMNLP 2025
♻ ☆ From Language to Action: A Review of Large Language Models as Autonomous Agents and Tool Users
Sadia Sultana Chowa, Riasad Alvi, Subhey Sadi Rahman, Md Abdur Rahman, Mohaimenul Azam Khan Raiaan, Md Rafiqul Islam, Mukhtar Hussain, Sami Azam
The pursuit of human-level artificial intelligence (AI) has significantly
advanced the development of autonomous agents and Large Language Models (LLMs).
LLMs are now widely utilized as decision-making agents for their ability to
interpret instructions, manage sequential tasks, and adapt through feedback.
This review examines recent developments in employing LLMs as autonomous agents
and tool users and comprises seven research questions. We only used the papers
published between 2023 and 2025 in conferences of the A* and A rank and Q1
journals. A structured analysis of the LLM agents' architectural design
principles, dividing their applications into single-agent and multi-agent
systems, and strategies for integrating external tools is presented. In
addition, the cognitive mechanisms of LLM, including reasoning, planning, and
memory, and the impact of prompting methods and fine-tuning procedures on agent
performance are also investigated. Furthermore, we evaluated current benchmarks
and assessment protocols and have provided an analysis of 68 publicly available
datasets to assess the performance of LLM-based agents in various tasks. In
conducting this review, we have identified critical findings on verifiable
reasoning of LLMs, the capacity for self-improvement, and the personalization
of LLM-based agents. Finally, we have discussed ten future research directions
to overcome these gaps.
comment: Submitted to Artificial Intelligence Review for peer review
♻ ☆ Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark
This paper presents a novel task of extracting Latin fragments from
mixed-language historical documents with varied layouts. We benchmark and
evaluate the performance of large foundation models against a multimodal
dataset of 724 annotated pages. The results demonstrate that reliable Latin
detection with contemporary models is achievable. Our study provides the first
comprehensive analysis of these models' capabilities and limits for this task.
comment: Under review. Both the dataset and code will be published
♻ ☆ Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs NeurIPS 2025
The increasing deployment of Large Vision-Language Models (LVLMs) raises
safety concerns under potential malicious inputs. However, existing multimodal
safety evaluations primarily focus on model vulnerabilities exposed by static
image inputs, ignoring the temporal dynamics of video that may induce distinct
safety risks. To bridge this gap, we introduce Video-SafetyBench, the first
comprehensive benchmark designed to evaluate the safety of LVLMs under
video-text attacks. It comprises 2,264 video-text pairs spanning 48
fine-grained unsafe categories, each pairing a synthesized video with either a
harmful query, which contains explicit malice, or a benign query, which appears
harmless but triggers harmful behavior when interpreted alongside the video. To
generate semantically accurate videos for safety evaluation, we design a
controllable pipeline that decomposes video semantics into subject images (what
is shown) and motion text (how it moves), which jointly guide the synthesis of
query-relevant videos. To effectively evaluate uncertain or borderline harmful
outputs, we propose RJScore, a novel LLM-based metric that incorporates the
confidence of judge models and human-aligned decision threshold calibration.
Extensive experiments show that benign-query video composition achieves average
attack success rates of 67.2%, revealing consistent vulnerabilities to
video-induced attacks. We believe Video-SafetyBench will catalyze future
research into video-based safety evaluation and defense strategies.
comment: Accepted by NeurIPS 2025 Dataset and Benchmark Track, Project page:
https://liuxuannan.github.io/Video-SafetyBench.github.io/
♻ ☆ Zero-Shot Tokenizer Transfer NeurIPS 2024
Language models (LMs) are bound to their tokenizer, which maps raw text to a
sequence of vocabulary items (tokens). This restricts their flexibility: for
example, LMs trained primarily on English may still perform well in other
natural and programming languages, but have vastly decreased efficiency due to
their English-centric tokenizer. To mitigate this, we should be able to swap
the original LM tokenizer with an arbitrary one, on the fly, without degrading
performance. Hence, in this work we define a new problem: Zero-Shot Tokenizer
Transfer (ZeTT). The challenge at the core of ZeTT is finding embeddings for
the tokens in the vocabulary of the new tokenizer. Since prior heuristics for
initializing embeddings often perform at chance level in a ZeTT setting, we
propose a new solution: we train a hypernetwork taking a tokenizer as input and
predicting the corresponding embeddings. We empirically demonstrate that the
hypernetwork generalizes to new tokenizers both with encoder (e.g., XLM-R) and
decoder LLMs (e.g., Mistral-7B). Our method comes close to the original models'
performance in cross-lingual and coding tasks while markedly reducing the
length of the tokenized sequence. We also find that the remaining gap can be
quickly closed by continued training on less than 1B tokens. Finally, we show
that a ZeTT hypernetwork trained for a base (L)LM can also be applied to
fine-tuned variants without extra training. Overall, our results make
substantial strides toward detaching LMs from their tokenizer.
comment: NeurIPS 2024
♻ ☆ Face the Facts! Evaluating RAG-based Fact-checking Pipelines in Realistic Settings
Natural Language Processing and Generation systems have recently shown the
potential to complement and streamline the costly and time-consuming job of
professional fact-checkers. In this work, we lift several constraints of
current state-of-the-art pipelines for automated fact-checking based on the
Retrieval-Augmented Generation (RAG) paradigm. Our goal is to benchmark, under
more realistic scenarios, RAG-based methods for the generation of verdicts -
i.e., short texts discussing the veracity of a claim - evaluating them on
stylistically complex claims and heterogeneous, yet reliable, knowledge bases.
Our findings show a complex landscape, where, for example, LLM-based retrievers
outperform other retrieval techniques, though they still struggle with
heterogeneous knowledge bases; larger models excel in verdict faithfulness,
while smaller models provide better context adherence, with human evaluations
favouring zero-shot and one-shot approaches for informativeness, and fine-tuned
models for emotional alignment.
comment: Code and data at https://github.com/drusso98/face-the-facts -
Accepted for publication at INLG 2025
♻ ☆ Provable Scaling Laws for the Test-Time Compute of Large Language Models NeurIPS 2025
We propose two simple, principled and practical algorithms that enjoy
provable scaling laws for the test-time compute of large language models
(LLMs). The first one is a two-stage knockout-style algorithm: given an input
problem, it first generates multiple candidate solutions, and then aggregate
them via a knockout tournament for the final output. Assuming that the LLM can
generate a correct solution with non-zero probability and do better than a
random guess in comparing a pair of correct and incorrect solutions, we prove
theoretically that the failure probability of this algorithm decays to zero
exponentially or by a power law (depending on the specific way of scaling) as
its test-time compute grows. The second one is a two-stage league-style
algorithm, where each candidate is evaluated by its average win rate against
multiple opponents, rather than eliminated upon loss to a single opponent.
Under analogous but more robust assumptions, we prove that its failure
probability also decays to zero exponentially with more test-time compute. Both
algorithms require a black-box LLM and nothing else (e.g., no verifier or
reward model) for a minimalistic implementation, which makes them appealing for
practical applications and easy to adapt for different tasks. Through extensive
experiments with diverse models and datasets, we validate the proposed theories
and demonstrate the outstanding scaling properties of both algorithms.
comment: NeurIPS 2025 camera-ready version
♻ ☆ Offline Learning and Forgetting for Reasoning with Large Language Models
Leveraging inference-time search in large language models has proven
effective in further enhancing a trained model's capability to solve complex
mathematical and reasoning problems. However, this approach significantly
increases computational costs and inference time, as the model must generate
and evaluate multiple candidate solutions to identify a viable reasoning path.
To address this, we propose an effective approach that integrates search
capabilities directly into the model by fine-tuning it on unpaired successful
(learning) and failed reasoning paths (forgetting) derived from diverse search
methods. A key challenge we identify is that naive fine-tuning can degrade the
model's search capability; we show this can be mitigated with a smaller
learning rate. Extensive experiments on the challenging Game-of-24 and
Countdown arithmetic puzzles show that, replacing CoT-generated data with
search-generated data for offline fine-tuning improves success rates by around
23% over inference-time search baselines, while reducing inference time by
180$\times$. On top of this, our learning and forgetting objective consistently
outperforms both supervised fine-tuning and preference-based methods.
comment: Published in Transactions on Machine Learning Research (TMLR), 2025.
Code: https://github.com/twni2016/llm-reasoning-uft
♻ ☆ LittleBit: Ultra Low-Bit Quantization via Latent Factorization NeurIPS 2025
Deploying large language models (LLMs) often faces challenges from
substantial memory and computational costs. Quantization offers a solution, yet
performance degradation in the sub-1-bit regime remains particularly difficult.
This paper introduces LittleBit, a novel method for extreme LLM compression. It
targets levels like 0.1 bits per weight (BPW), achieving nearly 31$\times$
memory reduction, e.g., Llama2-13B to under 0.9 GB. LittleBit represents
weights in a low-rank form using latent matrix factorization, subsequently
binarizing these factors. To counteract information loss from this extreme
precision, it integrates a multi-scale compensation mechanism. This includes
row, column, and an additional latent dimension that learns per-rank
importance. Two key contributions enable effective training: Dual
Sign-Value-Independent Decomposition (Dual-SVID) for quantization-aware
training (QAT) initialization, and integrated Residual Compensation to mitigate
errors. Extensive experiments confirm LittleBit's superiority in sub-1-bit
quantization: e.g., its 0.1 BPW performance on Llama2-7B surpasses the leading
method's 0.7 BPW. LittleBit establishes a new, viable size-performance
trade-off--unlocking a potential 11.6$\times$ speedup over FP16 at the kernel
level--and makes powerful LLMs practical for resource-constrained environments.
comment: Accepted to NeurIPS 2025. Banseok Lee and Dongkyu Kim contributed
equally
♻ ☆ NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables NeurIPS 2025
Lanrui Wang, Mingyu Zheng, Hongyin Tang, Zheng Lin, Yanan Cao, Jingang Wang, Xunliang Cai, Weiping Wang
Processing structured tabular data, particularly large and lengthy tables,
constitutes a fundamental yet challenging task for large language models
(LLMs). However, existing long-context benchmarks like Needle-in-a-Haystack
primarily focus on unstructured text, neglecting the challenge of diverse
structured tables. Meanwhile, previous tabular benchmarks mainly consider
downstream tasks that require high-level reasoning abilities, and overlook
models' underlying fine-grained perception of individual table cells, which is
crucial for practical and robust LLM-based table applications. To address this
gap, we introduce \textsc{NeedleInATable} (NIAT), a new long-context tabular
benchmark that treats each table cell as a ``needle'' and requires models to
extract the target cell based on cell locations or lookup questions. Our
comprehensive evaluation of various LLMs and multimodal LLMs reveals a
substantial performance gap between popular downstream tabular tasks and the
simpler NIAT task, suggesting that they may rely on dataset-specific
correlations or shortcuts to obtain better benchmark results but lack truly
robust long-context understanding towards structured tables. Furthermore, we
demonstrate that using synthesized NIAT training data can effectively improve
performance on both NIAT task and downstream tabular tasks, which validates the
importance of NIAT capability for LLMs' genuine table understanding ability.
comment: Accepted by NeurIPS 2025
♻ ☆ DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations
Chao-Hong Tan, Qian Chen, Wen Wang, Chong Deng, Qinglin Zhang, Luyao Cheng, Hai Yu, Xin Zhang, Xiang Lv, Tianyu Zhao, Chong Zhang, Yukun Ma, Yafeng Chen, Hui Wang, Jiaqing Liu, Xiangang Li, Jieping Ye
Recent studies on end-to-end (E2E) speech generation with large language
models (LLMs) have attracted significant community attention, with multiple
works extending text-based LLMs to generate discrete speech tokens. Existing
E2E approaches primarily fall into two categories: (1) Methods that generate
discrete speech tokens independently without incorporating them into the LLM's
autoregressive process, resulting in text generation being unaware of
concurrent speech synthesis. (2) Models that generate interleaved or parallel
speech-text tokens through joint autoregressive modeling, enabling mutual
modality awareness during generation. This paper presents DrVoice, a parallel
speech-text voice conversation model based on joint autoregressive modeling,
featuring dual-resolution speech representations. Notably, while current
methods utilize mainly 12.5Hz input audio representation, our proposed
dual-resolution mechanism reduces the input frequency for the LLM to 5Hz,
significantly reducing computational cost and alleviating the frequency
discrepancy between speech and text tokens and in turn better exploiting LLMs'
capabilities. Experimental results demonstrate that DRVOICE-7B establishes new
state-of-the-art (SOTA) on OpenAudioBench and Big Bench Audio benchmarks, while
achieving performance comparable to the SOTA on VoiceBench and UltraEval-Audio
benchmarks, making it a leading open-source speech foundation model in ~7B
models.
comment: Work in progress
♻ ☆ Evaluation of Geographical Distortions in Language Models
Language models now constitute essential tools for improving efficiency for
many professional tasks such as writing, coding, or learning. For this reason,
it is imperative to identify inherent biases. In the field of Natural Language
Processing, five sources of bias are well-identified: data, annotation,
representation, models, and research design. This study focuses on biases
related to geographical knowledge. We explore the connection between geography
and language models by highlighting their tendency to misrepresent spatial
information, thus leading to distortions in the representation of geographical
distances. This study introduces four indicators to assess these distortions,
by comparing geographical and semantic distances. Experiments are conducted
from these four indicators with ten widely used language models. Results
underscore the critical necessity of inspecting and rectifying spatial biases
in language models to ensure accurate and equitable representations.
comment: Accepted version. Published in Machine Learning (Springer) 114:263
(2025). Open access under a CC BY-NC-ND 4.0 license. DOI:
10.1007/s10994-025-06916-9
♻ ☆ Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views NeurIPS 2025
We introduce Look and Tell, a multimodal dataset for studying referential
communication across egocentric and exocentric perspectives. Using Meta Project
Aria smart glasses and stationary cameras, we recorded synchronized gaze,
speech, and video as 25 participants instructed a partner to identify
ingredients in a kitchen. Combined with 3D scene reconstructions, this setup
provides a benchmark for evaluating how different spatial representations (2D
vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67
hours of recordings, including 2,707 richly annotated referential expressions,
and is designed to advance the development of embodied agents that can
understand and engage in situated dialogue.
comment: 10 pages, 6 figures, 2 tables. Accepted to the NeurIPS 2025 Workshop
on SPACE in Vision, Language, and Embodied AI (SpaVLE). Dataset:
https://huggingface.co/datasets/annadeichler/KTH-ARIA-referential
♻ ☆ LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora
Luyao Zhuang, Shengyuan Chen, Yilin Xiao, Huachi Zhou, Yujing Zhang, Hao Chen, Qinggang Zhang, Xiao Huang
Retrieval-Augmented Generation (RAG) is widely used to mitigate
hallucinations of Large Language Models (LLMs) by leveraging external
knowledge. While effective for simple queries, traditional RAG systems struggle
with large-scale, unstructured corpora where information is fragmented. Recent
advances incorporate knowledge graphs to capture relational structures,
enabling more comprehensive retrieval for complex, multi-hop reasoning tasks.
However, existing graph-based RAG (GraphRAG) methods rely on unstable and
costly relation extraction for graph construction, often producing noisy graphs
with incorrect or inconsistent relations that degrade retrieval quality. In
this paper, we revisit the pipeline of existing GraphRAG systems and propose
LinearRAG (Linear Graph-based Retrieval-Augmented Generation), an efficient
framework that enables reliable graph construction and precise passage
retrieval. Specifically, LinearRAG constructs a relation-free hierarchical
graph, termed Tri-Graph, using only lightweight entity extraction and semantic
linking, avoiding unstable relation modeling. This new paradigm of graph
construction scales linearly with corpus size and incurs no extra token
consumption, providing an economical and reliable indexing of the original
passages. For retrieval, LinearRAG adopts a two-stage strategy: (i) relevant
entity activation via local semantic bridging, followed by (ii) passage
retrieval through global importance aggregation. Extensive experiments on four
datasets demonstrate that LinearRAG significantly outperforms baseline models.
Our code and datasets are available at https://github.com/DEEP-PolyU/LinearRAG.
♻ ☆ MATCH: Task-Driven Code Evaluation through Contrastive Learning
AI-based code generation is increasingly prevalent, with GitHub Copilot
estimated to generate 46% of the code on GitHub. Accurately evaluating how well
generated code aligns with developer intent remains a critical challenge.
Traditional evaluation methods, such as unit tests, are often unscalable and
costly. Syntactic similarity metrics (e.g., BLEU, ROUGE) fail to capture code
functionality, and metrics like CodeBERTScore require reference code, which is
not always available. To address the gap in reference-free evaluation, with few
alternatives such as ICE-Score, this paper introduces MATCH, a novel
reference-free metric. MATCH uses Contrastive Learning to generate meaningful
embeddings for code and natural language task descriptions, enabling similarity
scoring that reflects how well generated code implements the task. We show that
MATCH achieves stronger correlations with functional correctness and human
preference than existing metrics across multiple programming languages.
♻ ☆ GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training
Recent progress, such as DeepSeek-R1, has shown that the GRPO algorithm, a
Reinforcement Learning (RL) approach, can effectively train Chain-of-Thought
(CoT) reasoning in Large Language Models (LLMs) and Vision-Language Models
(VLMs). In this paper, we analyze three challenges of GRPO: gradient coupling
between thoughts and answers, sparse reward signals caused by limited parallel
sampling, and unstable advantage estimation. To mitigate these challenges, we
propose GRPO-MA, a simple yet theoretically grounded method that leverages
multi-answer generation from each thought process, enabling more robust and
efficient optimization. Theoretically, we show that the variance of thought
advantage decreases as the number of answers per thought increases.
Empirically, our gradient analysis confirms this effect, showing that GRPO-MA
reduces gradient spikes compared to GRPO. Experiments on math, code, and
diverse multimodal tasks demonstrate that GRPO-MA substantially improves
performance and training efficiency. Our ablation studies further reveal that
increasing the number of answers per thought consistently enhances model
performance.
comment: Under review
♻ ☆ Context-level Language Modeling by Learning Predictive Context Embeddings
Next-token prediction (NTP) is the cornerstone of modern large language
models (LLMs) pretraining, driving their unprecedented capabilities in text
generation, reasoning, and instruction following. However, the token-level
prediction limits the model's capacity to capture higher-level semantic
structures and long-range contextual relationships. To overcome this
limitation, we introduce \textbf{ContextLM}, a framework that augments standard
pretraining with an inherent \textbf{next-context prediction} objective. This
mechanism trains the model to learn predictive representations of multi-token
contexts, leveraging error signals derived from future token chunks. Crucially,
ContextLM achieves this enhancement while remaining fully compatible with the
standard autoregressive, token-by-token evaluation paradigm (e.g., perplexity).
Extensive experiments on the GPT2 and Pythia model families, scaled up to
$1.5$B parameters, show that ContextLM delivers consistent improvements in both
perplexity and downstream task performance. Our analysis indicates that
next-context prediction provides a scalable and efficient pathway to stronger
language modeling, yielding better long-range coherence and more effective
attention allocation with minimal computational overhead.
comment: 16pages,6 figures
♻ ☆ Surface Reading LLMs: Synthetic Text and its Styles
Despite a potential plateau in ML advancement, the societal impact of large
language models lies not in approaching superintelligence but in generating
text surfaces indistinguishable from human writing. While Critical AI Studies
provides essential material and socio-technical critique, it risks overlooking
how LLMs phenomenologically reshape meaning-making. This paper proposes a
semiotics of "surface integrity" as attending to the immediate plane where LLMs
inscribe themselves into human communication. I distinguish three knowledge
interests in ML research (epistemology, epist\=em\=e, and epistemics) and argue
for integrating surface-level stylistic analysis alongside depth-oriented
critique. Through two case studies examining stylistic markers of synthetic
text, I argue how attending to style as a semiotic phenomenon reveals LLMs as
cultural actors that transform the conditions of meaning emergence and
circulation in contemporary discourse, independent of questions about machine
consciousness.
comment: 12 pages, 1 figure
♻ ☆ SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models' Knowledge of Indian Culture ACL 2025
Language Models (LMs) are indispensable tools shaping modern workflows, but
their global effectiveness depends on understanding local socio-cultural
contexts. To address this, we introduce SANSKRITI, a benchmark designed to
evaluate language models' comprehension of India's rich cultural diversity.
Comprising 21,853 meticulously curated question-answer pairs spanning 28 states
and 8 union territories, SANSKRITI is the largest dataset for testing Indian
cultural knowledge. It covers sixteen key attributes of Indian culture: rituals
and ceremonies, history, tourism, cuisine, dance and music, costume, language,
art, festivals, religion, medicine, transport, sports, nightlife, and
personalities, providing a comprehensive representation of India's cultural
tapestry. We evaluate SANSKRITI on leading Large Language Models (LLMs), Indic
Language Models (ILMs), and Small Language Models (SLMs), revealing significant
disparities in their ability to handle culturally nuanced queries, with many
models struggling in region-specific contexts. By offering an extensive,
culturally rich, and diverse dataset, SANSKRITI sets a new standard for
assessing and improving the cultural understanding of LMs.
comment: ACL 2025 Findings
♻ ☆ MENTOR: A Reinforcement Learning Framework for Enabling Tool Use in Small Models via Teacher-Optimized Rewards
ChangSu Choi, Hoyun Song, Dongyeon Kim, WooHyeon Jung, Minkyung Cho, Sunjin Park, NohHyeob Bae, Seona Yu, KyungTae Lim
Distilling the tool-using capabilities of large language models (LLMs) into
smaller, more efficient small language models (SLMs) is a key challenge for
their practical application. The predominant approach, supervised fine-tuning
(SFT), suffers from poor generalization as it trains models to imitate a static
set of teacher trajectories rather than learn a robust methodology. While
reinforcement learning (RL) offers an alternative, the standard RL using sparse
rewards fails to effectively guide SLMs, causing them to struggle with
inefficient exploration and adopt suboptimal strategies. To address these
distinct challenges, we propose MENTOR, a framework that synergistically
combines RL with teacher-guided distillation. Instead of simple imitation,
MENTOR employs an RL-based process to learn a more generalizable policy through
exploration. In addition, to solve the problem of reward sparsity, it uses a
teacher's reference trajectory to construct a dense, composite teacher-guided
reward that provides fine-grained guidance. Extensive experiments demonstrate
that MENTOR significantly improves the cross-domain generalization and
strategic competence of SLMs compared to both SFT and standard sparse-reward RL
baselines.
♻ ☆ TrajAgent: An LLM-Agent Framework for Trajectory Modeling via Large-and-Small Model Collaboration NeurIPS 2025
Trajectory modeling, which includes research on trajectory data pattern
mining and future prediction, has widespread applications in areas such as life
services, urban transportation, and public administration. Numerous methods
have been proposed to address specific problems within trajectory modeling.
However, the heterogeneity of data and the diversity of trajectory tasks make
effective and reliable trajectory modeling an important yet highly challenging
endeavor, even for domain experts. In this paper, we propose TrajAgent, an
agent framework powered by large language models, designed to facilitate robust
and efficient trajectory modeling through automation modeling. This framework
leverages and optimizes diverse specialized models to address various
trajectory modeling tasks across different datasets effectively. In TrajAgent,
we first develop UniEnv, an execution environment with a unified data and model
interface, to support the execution and training of various models. Building on
UniEnv, we introduce an agentic workflow designed for automatic trajectory
modeling across various trajectory tasks and data. Furthermore, we introduce
collaborative learning schema between LLM-based agents and small speciallized
models, to enhance the performance of the whole framework effectively.
Extensive experiments on five tasks using four real-world datasets demonstrate
the effectiveness of TrajAgent in automated trajectory modeling, achieving a
performance improvement of 2.38%-69.91% over baseline methods. The codes and
data can be accessed via https://github.com/tsinghua-fib-lab/TrajAgent.
comment: Accepted by NeurIPS 2025,
https://github.com/tsinghua-fib-lab/TrajAgent
♻ ☆ The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents' Inquiry Capability
An effective physician should possess a combination of empathy, expertise,
patience, and clear communication when treating a patient. Recent advances have
successfully endowed AI doctors with expert diagnostic skills, particularly the
ability to actively seek information through inquiry. However, other essential
qualities of a good doctor remain overlooked. To bridge this gap, we present
MAQuE(Medical Agent Questioning Evaluation), the largest-ever benchmark for the
automatic and comprehensive evaluation of medical multi-turn questioning. It
features 3,000 realistically simulated patient agents that exhibit diverse
linguistic patterns, cognitive limitations, emotional responses, and tendencies
for passive disclosure. We also introduce a multi-faceted evaluation framework,
covering task success, inquiry proficiency, dialogue competence, inquiry
efficiency, and patient experience. Experiments on different LLMs reveal
substantial challenges across the evaluation aspects. Even state-of-the-art
models show significant room for improvement in their inquiry capabilities.
These models are highly sensitive to variations in realistic patient behavior,
which considerably impacts diagnostic accuracy. Furthermore, our fine-grained
metrics expose trade-offs between different evaluation perspectives,
highlighting the challenge of balancing performance and practicality in
real-world clinical settings.
♻ ☆ FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation
While large language models (LLMs) excel at handling long-context sequences,
they require substantial prefill computation and key-value (KV) cache, which
can heavily burden computational efficiency and memory usage in both prefill
and decoding stages. Recent works that compress KV caches with prefill
acceleration reduce this cost but inadvertently tie the prefill compute
reduction to the decoding KV budget. This coupling arises from overlooking the
layer-dependent variation of critical context, often leading to accuracy
degradation. To address this issue, we introduce FastKV, a KV cache compression
framework designed to reduce latency in both prefill and decoding by leveraging
the stabilization of token importance in later layers. FastKV performs
full-context computation until a Token-Selective Propagation (TSP) layer, which
forwards only the most informative tokens to subsequent layers. From these
propagated tokens, FastKV independently selects salient KV entries for caching,
thereby decoupling KV budget from the prefill compute reduction based on the
TSP decision. This independent control of the TSP rate and KV retention rate
enables flexible optimization of efficiency and accuracy. Experimental results
show that FastKV achieves speedups of up to 1.82$\times$ in prefill and
2.87$\times$ in decoding compared to the full-context baseline, while matching
the accuracy of the baselines that only accelerate the decoding stage. Our code
is available at https://github.com/dongwonjo/FastKV.
♻ ☆ Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay NeurIPS 2025
Reinforcement learning (RL) has become an effective approach for fine-tuning
large language models (LLMs), particularly to enhance their reasoning
capabilities. However, RL fine-tuning remains highly resource-intensive, and
existing work has largely overlooked the problem of data efficiency. In this
paper, we propose two techniques to improve data efficiency in LLM RL
fine-tuning: difficulty-targeted online data selection and rollout replay. We
introduce the notion of adaptive difficulty to guide online data selection,
prioritizing questions of moderate difficulty that are more likely to yield
informative learning signals. To estimate adaptive difficulty efficiently, we
develop an attention-based framework that requires rollouts for only a small
reference set of questions. The adaptive difficulty of the remaining questions
is then estimated based on their similarity to this set. To further reduce
rollout cost, we introduce a rollout replay mechanism inspired by experience
replay in traditional RL. This technique reuses recent rollouts, lowering
per-step computation while maintaining stable updates. Experiments across 6
LLM-dataset combinations show that our method reduces RL fine-tuning time by
23% to 62% while reaching the same level of performance as the original GRPO
algorithm. Our code is available at
https://github.com/ASTRAL-Group/data-efficient-llm-rl.
comment: Accepted at NeurIPS 2025
♻ ☆ ReCode: Unify Plan and Action for Universal Granularity Control
Zhaoyang Yu, Jiayi Zhang, Huixue Su, Yufan Zhao, Yifan Wu, Mingyi Deng, Jinyu Xiang, Yizhang Lin, Lingxiao Tang, Yingchao Li, Yuyu Luo, Bang Liu, Chenglin Wu
Real-world tasks require decisions at varying granularities, and humans excel
at this by leveraging a unified cognitive representation where planning is
fundamentally understood as a high-level form of action. However, current Large
Language Model (LLM)-based agents lack this crucial capability to operate
fluidly across decision granularities. This limitation stems from existing
paradigms that enforce a rigid separation between high-level planning and
low-level action, which impairs dynamic adaptability and limits generalization.
We propose ReCode (Recursive Code Generation), a novel paradigm that addresses
this limitation by unifying planning and action within a single code
representation. In this representation, ReCode treats high-level plans as
abstract placeholder functions, which the agent then recursively decomposes
into finer-grained sub-functions until reaching primitive actions. This
recursive approach dissolves the rigid boundary between plan and action,
enabling the agent to dynamically control its decision granularity.
Furthermore, the recursive structure inherently generates rich,
multi-granularity training data, enabling models to learn hierarchical
decision-making processes. Extensive experiments show ReCode significantly
surpasses advanced baselines in inference performance and demonstrates
exceptional data efficiency in training, validating our core insight that
unifying planning and action through recursive code generation is a powerful
and effective approach to achieving universal granularity control. The code is
available at https://github.com/FoundationAgents/ReCode.
♻ ☆ AdaRewriter: Unleashing the Power of Prompting-based Conversational Query Reformulation via Test-Time Adaptation EMNLP 2025
Prompting-based conversational query reformulation has emerged as a powerful
approach for conversational search, refining ambiguous user queries into
standalone search queries. Best-of-N reformulation over the generated
candidates via prompting shows impressive potential scaling capability.
However, both the previous tuning methods (training time) and adaptation
approaches (test time) can not fully unleash their benefits. In this paper, we
propose AdaRewriter, a novel framework for query reformulation using an
outcome-supervised reward model via test-time adaptation. By training a
lightweight reward model with contrastive ranking loss, AdaRewriter selects the
most promising reformulation during inference. Notably, it can operate
effectively in black-box systems, including commercial LLM APIs. Experiments on
five conversational search datasets show that AdaRewriter significantly
outperforms the existing methods across most settings, demonstrating the
potential of test-time adaptation for conversational query reformulation.
comment: Accepted by EMNLP 2025
♻ ☆ MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models
Kailin Jiang, Ning Jiang, Yuntao Du, Yuchen Ren, Yuchen Li, Yifan Gao, Jinhe Bi, Yunpu Ma, Qingqing Liu, Xianhao Wang, Yifan Jia, Hongbo Jiang, Yaocong Hu, Bin Li, Lei Liu
Large Multimodal Models (LMMs) encode rich factual knowledge via cross-modal
pre-training, yet their static representations struggle to maintain an accurate
understanding of time-sensitive factual knowledge. Existing benchmarks remain
constrained by static designs, inadequately evaluating LMMs' ability to
understand time-sensitive knowledge. To address this gap, we propose MINED, a
comprehensive benchmark that evaluates temporal awareness along 6 key
dimensions and 11 challenging tasks: cognition, awareness, trustworthiness,
understanding, reasoning, and robustness. MINED is constructed from Wikipedia
by two professional annotators, containing 2,104 time-sensitive knowledge
samples spanning six knowledge types. Evaluating 15 widely used LMMs on MINED
shows that Gemini-2.5-Pro achieves the highest average CEM score of 63.07,
while most open-source LMMs still lack time understanding ability. Meanwhile,
LMMs perform best on organization knowledge, whereas their performance is
weakest on sport. To address these challenges, we investigate the feasibility
of updating time-sensitive knowledge in LMMs through knowledge editing methods
and observe that LMMs can effectively update knowledge via knowledge editing
methods in single editing scenarios.
comment: project page:https://mined-lmm.github.io/
♻ ☆ Navigation with VLM framework: Towards Going to Any Language
Navigating towards fully open language goals and exploring open scenes in an
intelligent way have always raised significant challenges. Recently, Vision
Language Models (VLMs) have demonstrated remarkable capabilities to reason with
both language and visual data. Although many works have focused on leveraging
VLMs for navigation in open scenes, they often require high computational cost,
rely on object-centric approaches, or depend on environmental priors in
detailed human instructions. We introduce Navigation with VLM (NavVLM), a
training-free framework that harnesses open-source VLMs to enable robots to
navigate effectively, even for human-friendly language goal such as abstract
places, actions, or specific objects in open scenes. NavVLM leverages the VLM
as its cognitive core to perceive environmental information and constantly
provides exploration guidance achieving intelligent navigation with only a neat
target rather than a detailed instruction with environment prior. We evaluated
and validated NavVLM in both simulation and real-world experiments. In
simulation, our framework achieves state-of-the-art performance in Success
weighted by Path Length (SPL) on object-specifc tasks in richly detailed
environments from Matterport 3D (MP3D), Habitat Matterport 3D (HM3D) and
Gibson. With navigation episode reported, NavVLM demonstrates the capabilities
to navigate towards any open-set languages. In real-world validation, we
validated our framework's effectiveness in real-world robot at indoor scene.
comment: under review
♻ ☆ Discourse Features Enhance Detection of Document-Level Machine-Generated Content IJCNN 2025
The availability of high-quality APIs for Large Language Models (LLMs) has
facilitated the widespread creation of Machine-Generated Content (MGC), posing
challenges such as academic plagiarism and the spread of misinformation.
Existing MGC detectors often focus solely on surface-level information,
overlooking implicit and structural features. This makes them susceptible to
deception by surface-level sentence patterns, particularly for longer texts and
in texts that have been subsequently paraphrased. To overcome these challenges,
we introduce novel methodologies and datasets. Besides the publicly available
dataset Plagbench, we developed the paraphrased Long-Form Question and Answer
(paraLFQA) and paraphrased Writing Prompts (paraWP) datasets using GPT and
DIPPER, a discourse paraphrasing tool, by extending artifacts from their
original versions. To better capture the structure of longer texts at document
level, we propose DTransformer, a model that integrates discourse analysis
through PDTB preprocessing to encode structural features. It results in
substantial performance gains across both datasets - 15.5% absolute improvement
on paraLFQA, 4% absolute improvement on paraWP, and 1.5% absolute improvemene
on M4 compared to SOTA approaches. The data and code are available at:
https://github.com/myxp-lyp/Discourse-Features-Enhance-Detection-of-Document-Level-Machine-Generated-Content.git.
comment: Accepted by IJCNN 2025
♻ ☆ BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text
Jiageng Wu, Bowen Gu, Ren Zhou, Kevin Xie, Doug Snyder, Yixing Jiang, Valentina Carducci, Richard Wyss, Rishi J Desai, Emily Alsentzer, Leo Anthony Celi, Adam Rodman, Sebastian Schneeweiss, Jonathan H. Chen, Santiago Romero-Brufau, Kueiyu Joshua Lin, Jie Yang
Large language models (LLMs) hold great promise for medical applications and
are evolving rapidly, with new models being released at an accelerated pace.
However, benchmarking on large-scale real-world data such as electronic health
records (EHRs) is critical, as clinical decisions are directly informed by
these sources, yet current evaluations remain limited. Most existing benchmarks
rely on medical exam-style questions or PubMed-derived text, failing to capture
the complexity of real-world clinical data. Others focus narrowly on specific
application scenarios, limiting their generalizability across broader clinical
use. To address this gap, we present BRIDGE, a comprehensive multilingual
benchmark comprising 87 tasks sourced from real-world clinical data sources
across nine languages. It covers eight major task types spanning the entire
continuum of patient care across six clinical stages and 20 representative
applications, including triage and referral, consultation, information
extraction, diagnosis, prognosis, and billing coding, and involves 14 clinical
specialties. We systematically evaluated 95 LLMs (including DeepSeek-R1,
GPT-4o, Gemini series, and Qwen3 series) under various inference strategies.
Our results reveal substantial performance variation across model sizes,
languages, natural language processing tasks, and clinical specialties.
Notably, we demonstrate that open-source LLMs can achieve performance
comparable to proprietary models, while medically fine-tuned LLMs based on
older architectures often underperform versus updated general-purpose models.
The BRIDGE and its corresponding leaderboard serve as a foundational resource
and a unique reference for the development and evaluation of new LLMs in
real-world clinical text understanding.
The BRIDGE leaderboard:
https://huggingface.co/spaces/YLab-Open/BRIDGE-Medical-Leaderboard
♻ ☆ SEER: The Span-based Emotion Evidence Retrieval Benchmark
We introduce the SEER (Span-based Emotion Evidence Retrieval) Benchmark to
test Large Language Models' (LLMs) ability to identify the specific spans of
text that express emotion. Unlike traditional emotion recognition tasks that
assign a single label to an entire sentence, SEER targets the underexplored
task of emotion evidence detection: pinpointing which exact phrases convey
emotion. This span-level approach is crucial for applications like empathetic
dialogue and clinical support, which need to know how emotion is expressed, not
just what the emotion is. SEER includes two tasks: identifying emotion evidence
within a single sentence, and identifying evidence across a short passage of
five consecutive sentences. It contains new annotations for both emotion and
emotion evidence on 1200 real-world sentences. We evaluate 14 open-source LLMs
and find that, while some models approach average human performance on
single-sentence inputs, their accuracy degrades in longer passages. Our error
analysis reveals key failure modes, including overreliance on emotion keywords
and false positives in neutral text.
♻ ☆ PVP: An Image Dataset for Personalized Visual Persuasion with Persuasion Strategies, Viewer Characteristics, and Persuasiveness Ratings ACL 2025
Visual persuasion, which uses visual elements to influence cognition and
behaviors, is crucial in fields such as advertising and political
communication. With recent advancements in artificial intelligence, there is
growing potential to develop persuasive systems that automatically generate
persuasive images tailored to individuals. However, a significant bottleneck in
this area is the lack of comprehensive datasets that connect the persuasiveness
of images with the personal information about those who evaluated the images.
To address this gap and facilitate technological advancements in personalized
visual persuasion, we release the Personalized Visual Persuasion (PVP) dataset,
comprising 28,454 persuasive images across 596 messages and 9 persuasion
strategies. Importantly, the PVP dataset provides persuasiveness scores of
images evaluated by 2,521 human annotators, along with their demographic and
psychological characteristics (personality traits and values). We demonstrate
the utility of our dataset by developing a persuasive image generator and an
automated evaluator, and establish benchmark baselines. Our experiments reveal
that incorporating psychological characteristics enhances the generation and
evaluation of persuasive images, providing valuable insights for personalized
visual persuasion.
comment: ACL 2025 Main. Code and dataset are released at:
https://github.com/holi-lab/PVP_Personalized_Visual_Persuasion