Computation and Language
☆ BitNet b1.58 2B4T Technical Report
We introduce BitNet b1.58 2B4T, the first open-source, native 1-bit Large
Language Model (LLM) at the 2-billion parameter scale. Trained on a corpus of 4
trillion tokens, the model has been rigorously evaluated across benchmarks
covering language understanding, mathematical reasoning, coding proficiency,
and conversational ability. Our results demonstrate that BitNet b1.58 2B4T
achieves performance on par with leading open-weight, full-precision LLMs of
similar size, while offering significant advantages in computational
efficiency, including substantially reduced memory footprint, energy
consumption, and decoding latency. To facilitate further research and adoption,
the model weights are released via Hugging Face along with open-source
inference implementations for both GPU and CPU architectures.
comment: Work in progress
☆ Dysarthria Normalization via Local Lie Group Transformations for Robust ASR
We present a geometry-driven method for normalizing dysarthric speech using
local Lie group transformations of spectrograms. Time, frequency, and amplitude
distortions are modeled as smooth, invertible deformations, parameterized by
scalar fields and applied via exponential maps. A neural network is trained to
infer these fields from synthetic distortions of typical speech-without using
any pathological data. At test time, the model applies an approximate inverse
to real dysarthric inputs. Despite zero-shot generalization, we observe
substantial ASR gains, including up to 16 percentage points WER reduction on
challenging TORGO samples, with no degradation on clean speech. This work
introduces a principled, interpretable approach for robust speech recognition
under motor speech disorders
comment: Preprint. 11 pages, 3 figures, 2 tables, 8 appendices. Code and data
available upon request
☆ Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning
Mahmoud Salhab, Marwan Elghitany, Shameed Sait, Syed Sibghat Ullah, Mohammad Abusheikh, Hasan Abusheikh
Automatic speech recognition (ASR) is crucial for human-machine interaction
in diverse applications like conversational agents, industrial robotics, call
center automation, and automated subtitling. However, developing
high-performance ASR models remains challenging, particularly for low-resource
languages like Arabic, due to the scarcity of large, labeled speech datasets,
which are costly and labor-intensive to produce. In this work, we employ weakly
supervised learning to train an Arabic ASR model using the Conformer
architecture. Our model is trained from scratch on 15,000 hours of weakly
annotated speech data covering both Modern Standard Arabic (MSA) and Dialectal
Arabic (DA), eliminating the need for costly manual transcriptions. Despite the
absence of human-verified labels, our approach attains state-of-the-art (SOTA)
performance, exceeding all previous efforts in the field of Arabic ASR on the
standard benchmarks. By demonstrating the effectiveness of weak supervision as
a scalable, cost-efficient alternative to traditional supervised approaches,
paving the way for improved ASR systems in low resource settings.
☆ Watermarking Needs Input Repetition Masking
Recent advancements in Large Language Models (LLMs) raised concerns over
potential misuse, such as for spreading misinformation. In response two counter
measures emerged: machine learning-based detectors that predict if text is
synthetic, and LLM watermarking, which subtly marks generated text for
identification and attribution. Meanwhile, humans are known to adjust language
to their conversational partners both syntactically and lexically. By
implication, it is possible that humans or unwatermarked LLMs could
unintentionally mimic properties of LLM generated text, making counter measures
unreliable. In this work we investigate the extent to which such conversational
adaptation happens. We call the concept $\textit{mimicry}$ and demonstrate that
both humans and LLMs end up mimicking, including the watermarking signal even
in seemingly improbable settings. This challenges current academic assumptions
and suggests that for long-term watermarking to be reliable, the likelihood of
false positives needs to be significantly lower, while longer word sequences
should be used for seeding watermarking mechanisms.
☆ d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
Recent large language models (LLMs) have demonstrated strong reasoning
capabilities that benefits from online reinforcement learning (RL). These
capabilities have primarily been demonstrated within the left-to-right
autoregressive (AR) generation paradigm. In contrast, non-autoregressive
paradigms based on diffusion generate text in a coarse-to-fine manner. Although
recent diffusion-based large language models (dLLMs) have achieved competitive
language modeling performance compared to their AR counterparts, it remains
unclear if dLLMs can also leverage recent advances in LLM reasoning. To this
end, we propose d1, a framework to adapt pre-trained masked dLLMs into
reasoning models via a combination of supervised finetuning (SFT) and RL.
Specifically, we develop and extend techniques to improve reasoning in
pretrained dLLMs: (a) we utilize a masked SFT technique to distill knowledge
and instill self-improvement behavior directly from existing datasets, and (b)
we introduce a novel critic-free, policy-gradient based RL algorithm called
diffu-GRPO. Through empirical studies, we investigate the performance of
different post-training recipes on multiple mathematical and logical reasoning
benchmarks. We find that d1 yields the best performance and significantly
improves performance of a state-of-the-art dLLM.
comment: 25 pages, project page at https://dllm-reasoning.github.io/
☆ What Do Large Language Models Know? Tacit Knowledge as a Potential Causal-Explanatory Structure
It is sometimes assumed that Large Language Models (LLMs) know language, or
for example that they know that Paris is the capital of France. But what -- if
anything -- do LLMs actually know? In this paper, I argue that LLMs can acquire
tacit knowledge as defined by Martin Davies (1990). Whereas Davies himself
denies that neural networks can acquire tacit knowledge, I demonstrate that
certain architectural features of LLMs satisfy the constraints of semantic
description, syntactic structure, and causal systematicity. Thus, tacit
knowledge may serve as a conceptual framework for describing, explaining, and
intervening on LLMs and their behavior.
comment: Accepted for publication in Philosophy of Science
☆ SALAD: Improving Robustness and Generalization through Contrastive Learning with Structure-Aware and LLM-Driven Augmented Data NAACL 2025
In various natural language processing (NLP) tasks, fine-tuning Pre-trained
Language Models (PLMs) often leads to the issue of spurious correlations, which
negatively impacts performance, particularly when dealing with
out-of-distribution data. To address this problem, we propose SALAD}(Structure
Aware and LLM-driven Augmented Data), a novel approach designed to enhance
model robustness and generalization by generating structure-aware and
counterfactually augmented data for contrastive learning. Our method leverages
a tagging-based approach to generate structure-aware positive samples and
utilizes large language models (LLMs) to generate counterfactual negative
samples with diverse sentence patterns. By applying contrastive learning, SALAD
enables the model to focus on learning the structural relationships between key
sentence components while minimizing reliance on spurious correlations. We
validate our approach through experiments on three tasks: Sentiment
Classification, Sexism Detection, and Natural Language Inference. The results
demonstrate that SALAD not only improves model robustness and performance
across different environments but also enhances generalization to
out-of-distribution datasets and cross-domain scenarios.
comment: Accepted to NAACL 2025 main. 15 pages, 4 figures
☆ Trusting CHATGPT: how minor tweaks in the prompts lead to major differences in sentiment classification
Jaime E. Cuellar, Oscar Moreno-Martinez, Paula Sofia Torres-Rodriguez, Jaime Andres Pavlich-Mariscal, Andres Felipe Mican-Castiblanco, Juan Guillermo Torres-Hurtado
One fundamental question for the social sciences today is: how much can we
trust highly complex predictive models like ChatGPT? This study tests the
hypothesis that subtle changes in the structure of prompts do not produce
significant variations in the classification results of sentiment polarity
analysis generated by the Large Language Model GPT-4o mini. Using a dataset of
100.000 comments in Spanish on four Latin American presidents, the model
classified the comments as positive, negative, or neutral on 10 occasions,
varying the prompts slightly each time. The experimental methodology included
exploratory and confirmatory analyses to identify significant discrepancies
among classifications.
The results reveal that even minor modifications to prompts such as lexical,
syntactic, or modal changes, or even their lack of structure impact the
classifications. In certain cases, the model produced inconsistent responses,
such as mixing categories, providing unsolicited explanations, or using
languages other than Spanish. Statistical analysis using Chi-square tests
confirmed significant differences in most comparisons between prompts, except
in one case where linguistic structures were highly similar.
These findings challenge the robustness and trust of Large Language Models
for classification tasks, highlighting their vulnerability to variations in
instructions. Moreover, it was evident that the lack of structured grammar in
prompts increases the frequency of hallucinations. The discussion underscores
that trust in Large Language Models is based not only on technical performance
but also on the social and institutional relationships underpinning their use.
comment: in Spanish language
☆ Mapping Controversies Using Artificial Intelligence: An Analysis of the Hamas-Israel Conflict on YouTube
This article analyzes the Hamas-Israel controversy through 253,925
Spanish-language YouTube comments posted between October 2023 and January 2024,
following the October 7 attack that escalated the conflict. Adopting an
interdisciplinary approach, the study combines the analysis of controversies
from Science and Technology Studies (STS) with advanced computational
methodologies, specifically Natural Language Processing (NLP) using the BERT
(Bidirectional Encoder Representations from Transformers) model. Using this
approach, the comments were automatically classified into seven categories,
reflecting pro-Palestinian, pro-Israeli, anti- Palestinian, anti-Israeli
positions, among others. The results show a predominance of pro- Palestinian
comments, although pro-Israeli and anti-Palestinian comments received more
"likes." This study also applies the agenda-setting theory to demonstrate how
media coverage significantly influences public perception, observing a notable
shift in public opinion, transitioning from a pro- Palestinian stance to a more
critical position towards Israel. This work highlights the importance of
combining social science perspectives with technological tools in the analysis
of controversies, presenting a methodological innovation by integrating
computational analysis with critical social theories to address complex public
opinion phenomena and media narratives.
comment: in Spanish language
☆ Poem Meter Classification of Recited Arabic Poetry: Integrating High-Resource Systems for a Low-Resource Task
Arabic poetry is an essential and integral part of Arabic language and
culture. It has been used by the Arabs to spot lights on their major events
such as depicting brutal battles and conflicts. They also used it, as in many
other languages, for various purposes such as romance, pride, lamentation, etc.
Arabic poetry has received major attention from linguistics over the decades.
One of the main characteristics of Arabic poetry is its special rhythmic
structure as opposed to prose. This structure is referred to as a meter.
Meters, along with other poetic characteristics, are intensively studied in an
Arabic linguistic field called "\textit{Aroud}". Identifying these meters for a
verse is a lengthy and complicated process. It also requires technical
knowledge in \textit{Aruod}. For recited poetry, it adds an extra layer of
processing. Developing systems for automatic identification of poem meters for
recited poems need large amounts of labelled data. In this study, we propose a
state-of-the-art framework to identify the poem meters of recited Arabic
poetry, where we integrate two separate high-resource systems to perform the
low-resource task. To ensure generalization of our proposed architecture, we
publish a benchmark for this task for future research.
☆ Multilingual Contextualization of Large Language Models for Document-Level Machine Translation
Large language models (LLMs) have demonstrated strong performance in
sentence-level machine translation, but scaling to document-level translation
remains challenging, particularly in modeling long-range dependencies and
discourse phenomena across sentences and paragraphs. In this work, we propose a
method to improve LLM-based long-document translation through targeted
fine-tuning on high-quality document-level data, which we curate and introduce
as DocBlocks. Our approach supports multiple translation paradigms, including
direct document-to-document and chunk-level translation, by integrating
instructions both with and without surrounding context. This enables models to
better capture cross-sentence dependencies while maintaining strong
sentence-level translation performance. Experimental results show that
incorporating multiple translation paradigms improves document-level
translation quality and inference speed compared to prompting and agent-based
methods.
comment: 9 pages, work-in-progress
☆ Efficient Contrastive Decoding with Probabilistic Hallucination Detection - Mitigating Hallucinations in Large Vision Language Models -
Despite recent advances in Large Vision Language Models (LVLMs), these models
still suffer from generating hallucinatory responses that do not align with the
visual input provided. To mitigate such hallucinations, we introduce Efficient
Contrastive Decoding (ECD), a simple method that leverages probabilistic
hallucination detection to shift the output distribution towards contextually
accurate answers at inference time. By contrasting token probabilities and
hallucination scores, ECD subtracts hallucinated concepts from the original
distribution, effectively suppressing hallucinations. Notably, our proposed
method can be applied to any open-source LVLM and does not require additional
LVLM training. We evaluate our method on several benchmark datasets and across
different LVLMs. Our experiments show that ECD effectively mitigates
hallucinations, outperforming state-of-the-art methods with respect to
performance on LVLM benchmarks and computation time.
☆ Entropy-Guided Watermarking for LLMs: A Test-Time Framework for Robust and Traceable Text Generation
The rapid development of Large Language Models (LLMs) has intensified
concerns about content traceability and potential misuse. Existing watermarking
schemes for sampled text often face trade-offs between maintaining text quality
and ensuring robust detection against various attacks. To address these issues,
we propose a novel watermarking scheme that improves both detectability and
text quality by introducing a cumulative watermark entropy threshold. Our
approach is compatible with and generalizes existing sampling functions,
enhancing adaptability. Experimental results across multiple LLMs show that our
scheme significantly outperforms existing methods, achieving over 80\%
improvements on widely-used datasets, e.g., MATH and GSM8K, while maintaining
high detection accuracy.
☆ Gauging Overprecision in LLMs: An Empirical Study
Recently, overconfidence in large language models (LLMs) has garnered
considerable attention due to its fundamental importance in quantifying the
trustworthiness of LLM generation. However, existing approaches prompt the
\textit{black box LLMs} to produce their confidence (\textit{verbalized
confidence}), which can be subject to many biases and hallucinations. Inspired
by a different aspect of overconfidence in cognitive science called
\textit{overprecision}, we designed a framework for its study in black box
LLMs. This framework contains three main phases: 1) generation, 2) refinement
and 3) evaluation. In the generation phase we prompt the LLM to generate
answers to numerical questions in the form of intervals with a certain level of
confidence. This confidence level is imposed in the prompt and not required for
the LLM to generate as in previous approaches. We use various prompting
techniques and use the same prompt multiple times to gauge the effects of
randomness in the generation process. In the refinement phase, answers from the
previous phase are refined to generate better answers. The LLM answers are
evaluated and studied in the evaluation phase to understand its internal
workings. This study allowed us to gain various insights into LLM
overprecision: 1) LLMs are highly uncalibrated for numerical tasks 2)
{\color{blue}there is no correlation between the length of the interval and the
imposed confidence level, which can be symptomatic of a a) lack of
understanding of the concept of confidence or b) inability to adjust
self-confidence by following instructions}, {\color{blue}3)} LLM numerical
precision differs depending on the task, scale of answer and prompting
technique {\color{blue}4) Refinement of answers doesn't improve precision in
most cases}. We believe this study offers new perspectives on LLM
overconfidence and serves as a strong baseline for overprecision in LLMs.
comment: 16 pages
☆ Selective Demonstration Retrieval for Improved Implicit Hate Speech Detection
Hate speech detection is a crucial area of research in natural language
processing, essential for ensuring online community safety. However, detecting
implicit hate speech, where harmful intent is conveyed in subtle or indirect
ways, remains a major challenge. Unlike explicit hate speech, implicit
expressions often depend on context, cultural subtleties, and hidden biases,
making them more challenging to identify consistently. Additionally, the
interpretation of such speech is influenced by external knowledge and
demographic biases, resulting in varied detection results across different
language models. Furthermore, Large Language Models often show heightened
sensitivity to toxic language and references to vulnerable groups, which can
lead to misclassifications. This over-sensitivity results in false positives
(incorrectly identifying harmless statements as hateful) and false negatives
(failing to detect genuinely harmful content). Addressing these issues requires
methods that not only improve detection precision but also reduce model biases
and enhance robustness. To address these challenges, we propose a novel method,
which utilizes in-context learning without requiring model fine-tuning. By
adaptively retrieving demonstrations that focus on similar groups or those with
the highest similarity scores, our approach enhances contextual comprehension.
Experimental results show that our method outperforms current state-of-the-art
techniques. Implementation details and code are available at TBD.
☆ Bayesian dynamic borrowing considering semantic similarity between outcomes for disproportionality analysis in FAERS
We present a Bayesian dynamic borrowing (BDB) approach to enhance the
quantitative identification of adverse events (AEs) in spontaneous reporting
systems (SRSs). The method embeds a robust meta-analytic predictive (MAP) prior
within a Bayesian hierarchical model and incorporates semantic similarity
measures (SSMs) to enable weighted information sharing from MedDRA Preferred
Terms (PTs) that are clinical similar to the target PT. This continuous
similarity-based borrowing addresses limitation of rigid hierarchical grouping
in current disproportionality analysis (DPA).
Using data from the FDA Adverse Event Reporting System (FAERS) between 2015
and 2019, we evalute this approach - termed IC SSM - against standard
Information Component (IC) analysis and IC with borrowing at the MedDRA
high-level group term (HLGT) level. A novel references set (PVLens), derived
from FDA product label updates, enabled prospective evaluation of method
performance in identifying AEs prior to official labeling.
The IC SSM approach demonstrated improved sensitivity compared to both
traditional IC and HLGT-based borrowing, with minor trade-offs in F1 scores and
Youden's index. IC SSM consistently identified more true positives and detected
signals over 5 months sooner than traditional IC. Despite a marginally lower
aggregate Youden's index, IC SSM showed higher performance in the early
post-marketing period, providing more stable and relevant estimates than
HLGT-based borrowing and traditional IC.
These findings support the use of SSM-informed Bayesian borrowing as a
scalable and context-aware enhancement to traditional DPA methods. Future
research should validate this approach across other datasets and explore
additional similarity metrics and Bayesian inference strategies using
case-level data.
comment: 30 pages, 7 figures, 5 supplementary figures
☆ Language Models as Quasi-Crystalline Thought: Structure, Constraint, and Emergence in Generative Systems
This essay proposes an analogy between large language models (LLMs) and
quasicrystals: systems that exhibit global coherence without periodic
repetition and that are generated through local constraints. While LLMs are
often evaluated in terms of predictive accuracy, factuality, or alignment, this
structural perspective suggests that their most characteristic behavior is the
production of internally resonant linguistic patterns. Just as quasicrystals
forced a redefinition of order in physical systems, viewing LLMs as generators
of quasi-structured language opens new paths for evaluation and design:
privileging propagation of constraint over token-level accuracy, and coherence
of form over fixed meaning. LLM outputs should be read not only for what they
say, but for the patterns of constraint and coherence that organize them. This
shift reframes generative language as a space of emergent patterning: LLMs are
neither fully random nor strictly rule-based, but defined by a logic of
constraint, resonance, and structural depth.
☆ SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared Task on Hallucinations and Related Observable Overgeneration Mistakes SemEval-2025
Raúl Vázquez, Timothee Mickus, Elaine Zosa, Teemu Vahtola, Jörg Tiedemann, Aman Sinha, Vincent Segonne, Fernando Sánchez-Vega, Alessandro Raganato, Jindřich Libovický, Jussi Karlgren, Shaoxiong Ji, Jindřich Helcl, Liane Guillou, Ona de Gibert, Jaione Bengoetxea, Joseph Attieh, Marianna Apidianaki
We present the Mu-SHROOM shared task which is focused on detecting
hallucinations and other overgeneration mistakes in the output of
instruction-tuned large language models (LLMs). Mu-SHROOM addresses
general-purpose LLMs in 14 languages, and frames the hallucination detection
problem as a span-labeling task. We received 2,618 submissions from 43
participating teams employing diverse methodologies. The large number of
submissions underscores the interest of the community in hallucination
detection. We present the results of the participating systems and conduct an
empirical analysis to identify key factors contributing to strong performance
in this task. We also emphasize relevant current challenges, notably the
varying degree of hallucinations across languages and the high annotator
disagreement when labeling hallucination spans.
comment: Mu-SHROOM is part of SemEval-2025 (Task 3). TBP: Proceedings of the
19th International Workshop on Semantic Evaluation (SemEval-2025)
☆ LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
Extractive reading comprehension question answering (QA) datasets are
typically evaluated using Exact Match (EM) and F1-score, but these metrics
often fail to fully capture model performance. With the success of large
language models (LLMs), they have been employed in various tasks, including
serving as judges (LLM-as-a-judge). In this paper, we reassess the performance
of QA models using LLM-as-a-judge across four reading comprehension QA
datasets. We examine different families of LLMs and various answer types to
evaluate the effectiveness of LLM-as-a-judge in these tasks. Our results show
that LLM-as-a-judge is highly correlated with human judgments and can replace
traditional EM/F1 metrics. By using LLM-as-a-judge, the correlation with human
judgments improves significantly, from 0.17 (EM) and 0.36 (F1-score) to 0.85.
These findings confirm that EM and F1 metrics underestimate the true
performance of the QA models. While LLM-as-a-judge is not perfect for more
difficult answer types (e.g., job), it still outperforms EM/F1, and we observe
no bias issues, such as self-preference, when the same model is used for both
the QA and judgment tasks.
comment: 17 pages; code and data are available at
https://github.com/Alab-NII/llm-judge-extract-qa
☆ Robust and Fine-Grained Detection of AI Generated Texts ACL 2025
Ram Mohan Rao Kadiyala, Siddartha Pullakhandam, Kanwal Mehreen, Drishti Sharma, Siddhant Gupta, Jebish Purbey, Ashay Srivastava, Subhasya TippaReddy, Arvind Reddy Bobbili, Suraj Telugara Chandrashekhar, Modabbir Adeeb, Srinadh Vura, Hamza Farooq
An ideal detection system for machine generated content is supposed to work
well on any generator as many more advanced LLMs come into existence day by
day. Existing systems often struggle with accurately identifying AI-generated
content over shorter texts. Further, not all texts might be entirely authored
by a human or LLM, hence we focused more over partial cases i.e human-LLM
co-authored texts. Our paper introduces a set of models built for the task of
token classification which are trained on an extensive collection of
human-machine co-authored texts, which performed well over texts of unseen
domains, unseen generators, texts by non-native speakers and those with
adversarial inputs. We also introduce a new dataset of over 2.4M such texts
mostly co-authored by several popular proprietary LLMs over 23 languages. We
also present findings of our models' performance over each texts of each domain
and generator. Additional findings include comparison of performance against
each adversarial method, length of input texts and characteristics of generated
texts compared to the original human authored texts.
comment: ACL 2025 Feb ARR Submission
☆ ADAT: Time-Series-Aware Adaptive Transformer Architecture for Sign Language Translation
Current sign language machine translation systems rely on recognizing hand
movements, facial expressions and body postures, and natural language
processing, to convert signs into text. Recent approaches use Transformer
architectures to model long-range dependencies via positional encoding.
However, they lack accuracy in recognizing fine-grained, short-range temporal
dependencies between gestures captured at high frame rates. Moreover, their
high computational complexity leads to inefficient training. To mitigate these
issues, we propose an Adaptive Transformer (ADAT), which incorporates
components for enhanced feature extraction and adaptive feature weighting
through a gating mechanism to emphasize contextually relevant features while
reducing training overhead and maintaining translation accuracy. To evaluate
ADAT, we introduce MedASL, the first public medical American Sign Language
dataset. In sign-to-gloss-to-text experiments, ADAT outperforms the
encoder-decoder transformer, improving BLEU-4 accuracy by 0.1% while reducing
training time by 14.33% on PHOENIX14T and 3.24% on MedASL. In sign-to-text
experiments, it improves accuracy by 8.7% and reduces training time by 2.8% on
PHOENIX14T and achieves 4.7% higher accuracy and 7.17% faster training on
MedASL. Compared to encoder-only and decoder-only baselines in sign-to-text,
ADAT is at least 6.8% more accurate despite being up to 12.1% slower due to its
dual-stream structure.
☆ An LLM-as-a-judge Approach for Scalable Gender-Neutral Translation Evaluation
Gender-neutral translation (GNT) aims to avoid expressing the gender of human
referents when the source text lacks explicit cues about the gender of those
referents. Evaluating GNT automatically is particularly challenging, with
current solutions being limited to monolingual classifiers. Such solutions are
not ideal because they do not factor in the source sentence and require
dedicated data and fine-tuning to scale to new languages. In this work, we
address such limitations by investigating the use of large language models
(LLMs) as evaluators of GNT. Specifically, we explore two prompting approaches:
one in which LLMs generate sentence-level assessments only, and another, akin
to a chain-of-thought approach, where they first produce detailed phrase-level
annotations before a sentence-level judgment. Through extensive experiments on
multiple languages with five models, both open and proprietary, we show that
LLMs can serve as evaluators of GNT. Moreover, we find that prompting for
phrase-level annotations before sentence-level assessments consistently
improves the accuracy of all models, providing a better and more scalable
alternative to current solutions.
comment: Accepted at GITT 2025
☆ Finding Flawed Fictions: Evaluating Complex Reasoning in Language Models via Plot Hole Detection
Stories are a fundamental aspect of human experience. Engaging deeply with
stories and spotting plot holes -- inconsistencies in a storyline that break
the internal logic or rules of a story's world -- requires nuanced reasoning
skills, including tracking entities and events and their interplay, abstract
thinking, pragmatic narrative understanding, commonsense and social reasoning,
and theory of mind. As Large Language Models (LLMs) increasingly generate,
interpret, and modify text, rigorously assessing their narrative consistency
and deeper language understanding becomes critical. However, existing
benchmarks focus mainly on surface-level comprehension. In this work, we
propose plot hole detection in stories as a proxy to evaluate language
understanding and reasoning in LLMs. We introduce FlawedFictionsMaker, a novel
algorithm to controllably and carefully synthesize plot holes in human-written
stories. Using this algorithm, we construct a benchmark to evaluate LLMs' plot
hole detection abilities in stories -- FlawedFictions -- , which is robust to
contamination, with human filtering ensuring high quality. We find that
state-of-the-art LLMs struggle in accurately solving FlawedFictions regardless
of the reasoning effort allowed, with performance significantly degrading as
story length increases. Finally, we show that LLM-based story summarization and
story generation are prone to introducing plot holes, with more than 50% and
100% increases in plot hole detection rates with respect to human-written
originals.
comment: Preprint
☆ Rethinking LLM-Based Recommendations: A Query Generation-Based, Training-Free Approach
Existing large language model LLM-based recommendation methods face several
challenges, including inefficiency in handling large candidate pools,
sensitivity to item order within prompts ("lost in the middle" phenomenon) poor
scalability, and unrealistic evaluation due to random negative sampling. To
address these issues, we propose a Query-to-Recommendation approach that
leverages LLMs to generate personalized queries for retrieving relevant items
from the entire candidate pool, eliminating the need for candidate
pre-selection. This method can be integrated into an ID-based recommendation
system without additional training, enhances recommendation performance and
diversity through LLMs' world knowledge, and performs well even for less
popular item groups. Experiments on three datasets show up to 57 percent
improvement, with an average gain of 31 percent, demonstrating strong zero-shot
performance and further gains when ensembled with existing models.
☆ Evaluating the Goal-Directedness of Large Language Models
Tom Everitt, Cristina Garbacea, Alexis Bellot, Jonathan Richens, Henry Papadatos, Siméon Campos, Rohin Shah
To what extent do LLMs use their capabilities towards their given goal? We
take this as a measure of their goal-directedness. We evaluate
goal-directedness on tasks that require information gathering, cognitive
effort, and plan execution, where we use subtasks to infer each model's
relevant capabilities. Our evaluations of LLMs from Google DeepMind, OpenAI,
and Anthropic show that goal-directedness is relatively consistent across
tasks, differs from task performance, and is only moderately sensitive to
motivational prompts. Notably, most models are not fully goal-directed. We hope
our goal-directedness evaluations will enable better monitoring of LLM
progress, and enable more deliberate design choices of agentic properties in
LLMs.
☆ FiSMiness: A Finite State Machine Based Paradigm for Emotional Support Conversations
Emotional support conversation (ESC) aims to alleviate the emotional distress
of individuals through effective conversations. Although large language models
(LLMs) have obtained remarkable progress on ESC, most of these studies might
not define the diagram from the state model perspective, therefore providing a
suboptimal solution for long-term satisfaction. To address such an issue, we
leverage the Finite State Machine (FSM) on LLMs, and propose a framework called
FiSMiness. Our framework allows a single LLM to bootstrap the planning during
ESC, and self-reason the seeker's emotion, support strategy and the final
response upon each conversational turn. Substantial experiments on ESC datasets
suggest that FiSMiness outperforms many baselines, including direct inference,
self-refine, chain of thought, finetuning, and external-assisted methods, even
those with many more parameters.
comment: accepted by CMCL
☆ Could Thinking Multilingually Empower LLM Reasoning?
Previous work indicates that large language models exhibit a significant
"English bias", i.e. they often perform better when tasks are presented in
English. Interestingly, we have observed that using certain other languages in
reasoning tasks can yield better performance than English. However, this
phenomenon remains under-explored. In this paper, we explore the upper bound of
harnessing multilingualism in reasoning tasks, suggesting that multilingual
reasoning promises significantly (by nearly 10 Acc@$k$ points) and robustly
(tolerance for variations in translation quality and language choice) higher
upper bounds than English-only reasoning. Besides analyzing the reason behind
the upper bound and challenges in reaching it, we also find that common answer
selection methods cannot achieve this upper bound, due to their limitations and
biases. These insights could pave the way for future research aimed at fully
harnessing the potential of multilingual reasoning in LLMs.
☆ Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation
Generation capabilities and language coverage of multilingual large language
models (mLLMs) are advancing rapidly. However, evaluation practices for
generative abilities of mLLMs are still lacking comprehensiveness, scientific
rigor, and consistent adoption across research labs, which undermines their
potential to meaningfully guide mLLM development. We draw parallels with
machine translation (MT) evaluation, a field that faced similar challenges and
has, over decades, developed transparent reporting standards and reliable
evaluations for multilingual generative models. Through targeted experiments
across key stages of the generative evaluation pipeline, we demonstrate how
best practices from MT evaluation can deepen the understanding of quality
differences between models. Additionally, we identify essential components for
robust meta-evaluation of mLLMs, ensuring the evaluation methods themselves are
rigorously assessed. We distill these insights into a checklist of actionable
recommendations for mLLM research and development.
☆ ARWI: Arabic Write and Improve
Although Arabic is spoken by over 400 million people, advanced Arabic writing
assistance tools remain limited. To address this gap, we present ARWI, a new
writing assistant that helps learners improve essay writing in Modern Standard
Arabic. ARWI is the first publicly available Arabic writing assistant to
include a prompt database for different proficiency levels, an Arabic text
editor, state-of-the-art grammatical error detection and correction, and
automated essay scoring aligned with the Common European Framework of Reference
standards for language attainment. Moreover, ARWI can be used to gather a
growing auto-annotated corpus, facilitating further research on Arabic grammar
correction and essay scoring, as well as profiling patterns of errors made by
native speakers and non-native learners. A preliminary user study shows that
ARWI provides actionable feedback, helping learners identify grammatical gaps,
assess language proficiency, and guide improvement.
☆ Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture
Simultaneous speech translation (SimulST) produces translations incrementally
while processing partial speech input. Although large language models (LLMs)
have showcased strong capabilities in offline translation tasks, applying them
to SimulST poses notable challenges. Existing LLM-based SimulST approaches
either incur significant computational overhead due to repeated encoding of
bidirectional speech encoder, or they depend on a fixed read/write policy,
limiting the efficiency and performance. In this work, we introduce Efficient
and Adaptive Simultaneous Speech Translation (EASiST) with fully unidirectional
architecture, including both speech encoder and LLM. EASiST includes a
multi-latency data curation strategy to generate semantically aligned SimulST
training samples and redefines SimulST as an interleaved generation task with
explicit read/write tokens. To facilitate adaptive inference, we incorporate a
lightweight policy head that dynamically predicts read/write actions.
Additionally, we employ a multi-stage training strategy to align speech-text
modalities and optimize both translation and policy behavior. Experiments on
the MuST-C En$\rightarrow$De and En$\rightarrow$Es datasets demonstrate that
EASiST offers superior latency-quality trade-offs compared to several strong
baselines.
☆ Selective Attention Federated Learning: Improving Privacy and Efficiency for Clinical Text Classification
Federated Learning (FL) faces major challenges regarding communication
overhead and model privacy when training large language models (LLMs),
especially in healthcare applications. To address these, we introduce Selective
Attention Federated Learning (SAFL), a novel approach that dynamically
fine-tunes only those transformer layers identified as attention-critical. By
employing attention patterns to determine layer importance, SAFL significantly
reduces communication bandwidth and enhances differential privacy resilience.
Evaluations on clinical NLP benchmarks (i2b2 Clinical Concept Extraction and
MIMIC-III discharge summaries) demonstrate that SAFL achieves competitive
performance with centralized models while substantially improving communication
efficiency and privacy preservation.
☆ Enhancing Web Agents with Explicit Rollback Mechanisms
With recent advancements in large language models, web agents have been
greatly improved. However, dealing with complex and dynamic web environments
requires more advanced planning and search abilities. Previous studies usually
adopt a greedy one-way search strategy, which may struggle to recover from
erroneous states. In this work, we enhance web agents with an explicit rollback
mechanism, enabling the agent to revert back to a previous state in its
navigation trajectory. This mechanism gives the model the flexibility to
directly control the search process, leading to an effective and efficient web
navigation method. We conduct experiments on two live web navigation benchmarks
with zero-shot and fine-tuning settings. The results demonstrate the
effectiveness of our proposed approach.
☆ Unsupervised Classification of English Words Based on Phonological Information: Discovery of Germanic and Latinate Clusters
Cross-linguistically, native words and loanwords follow different
phonological rules. In English, for example, words of Germanic and Latinate
origin exhibit different stress patterns, and a certain syntactic structure is
exclusive to Germanic verbs. When seeing them as a cognitive model, however,
such etymology-based generalizations face challenges in terms of learnability,
since the historical origins of words are presumably inaccessible information
for general language learners. In this study, we present computational evidence
indicating that the Germanic-Latinate distinction in the English lexicon is
learnable from the phonotactic information of individual words. Specifically,
we performed an unsupervised clustering on corpus-extracted words, and the
resulting word clusters largely aligned with the etymological distinction. The
model-discovered clusters also recovered various linguistic generalizations
documented in the previous literature regarding the corresponding etymological
classes. Moreover, our findings also uncovered previously unrecognized features
of the quasi-etymological clusters, offering novel hypotheses for future
experimental studies.
☆ Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT?
Recent supervised fine-tuning (SFT) approaches have significantly improved
language models' performance on mathematical reasoning tasks, even when models
are trained at a small scale. However, the specific capabilities enhanced
through such fine-tuning remain poorly understood. In this paper, we conduct a
detailed analysis of model performance on the AIME24 dataset to understand how
reasoning capabilities evolve. We discover a ladder-like structure in problem
difficulty, categorize questions into four tiers (Easy, Medium, Hard, and
Extremely Hard (Exh)), and identify the specific requirements for advancing
between tiers. We find that progression from Easy to Medium tier requires
adopting an R1 reasoning style with minimal SFT (500-1K instances), while
Hard-level questions suffer from frequent model's errors at each step of the
reasoning chain, with accuracy plateauing at around 65% despite logarithmic
scaling. Exh-level questions present a fundamentally different challenge; they
require unconventional problem-solving skills that current models uniformly
struggle with. Additional findings reveal that carefully curated small-scale
datasets offer limited advantage-scaling dataset size proves far more
effective. Our analysis provides a clearer roadmap for advancing language model
capabilities in mathematical reasoning.
☆ The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation CVPR2025
The evolution of Text-to-video (T2V) generative models, trained on
large-scale datasets, has been marked by significant progress. However, the
sensitivity of T2V generative models to input prompts highlights the critical
role of prompt design in influencing generative outcomes. Prior research has
predominantly relied on Large Language Models (LLMs) to align user-provided
prompts with the distribution of training prompts, albeit without tailored
guidance encompassing prompt vocabulary and sentence structure nuances. To this
end, we introduce \textbf{RAPO}, a novel \textbf{R}etrieval-\textbf{A}ugmented
\textbf{P}rompt \textbf{O}ptimization framework. In order to address potential
inaccuracies and ambiguous details generated by LLM-generated prompts. RAPO
refines the naive prompts through dual optimization branches, selecting the
superior prompt for T2V generation. The first branch augments user prompts with
diverse modifiers extracted from a learned relational graph, refining them to
align with the format of training prompts via a fine-tuned LLM. Conversely, the
second branch rewrites the naive prompt using a pre-trained LLM following a
well-defined instruction set. Extensive experiments demonstrate that RAPO can
effectively enhance both the static and dynamic dimensions of generated videos,
demonstrating the significance of prompt optimization for user-provided
prompts. Project website:
\href{https://whynothaha.github.io/Prompt_optimizer/RAPO.html}{GitHub}.
comment: accepted by CVPR2025
☆ Higher-Order Binding of Language Model Virtual Personas: a Study on Approximating Political Partisan Misperceptions
Large language models (LLMs) are increasingly capable of simulating human
behavior, offering cost-effective ways to estimate user responses during the
early phases of survey design. While previous studies have examined whether
models can reflect individual opinions or attitudes, we argue that a
\emph{higher-order} binding of virtual personas requires successfully
approximating not only the opinions of a user as an identified member of a
group, but also the nuanced ways in which that user perceives and evaluates
those outside the group. In particular, faithfully simulating how humans
perceive different social groups is critical for applying LLMs to various
political science studies, including timely topics on polarization dynamics,
inter-group conflict, and democratic backsliding. To this end, we propose a
novel methodology for constructing virtual personas with synthetic user
``backstories" generated as extended, multi-turn interview transcripts. Our
generated backstories are longer, rich in detail, and consistent in
authentically describing a singular individual, compared to previous methods.
We show that virtual personas conditioned on our backstories closely replicate
human response distributions (up to an 87\% improvement as measured by
Wasserstein Distance) and produce effect sizes that closely match those
observed in the original studies. Altogether, our work extends the
applicability of LLMs beyond estimating individual self-opinions, enabling
their use in a broader range of human studies.
♻ ☆ ReGenesis: LLMs can Grow into Reasoning Generalists via Self-Improvement
Post-training Large Language Models (LLMs) with explicit reasoning
trajectories can enhance their reasoning abilities. However, acquiring such
high-quality trajectory data typically demands meticulous supervision from
humans or superior models, which can be either expensive or
license-constrained. In this paper, we explore how far an LLM can improve its
reasoning by self-synthesizing reasoning paths as training data without any
additional supervision. Existing self-synthesizing methods, such as STaR,
suffer from poor generalization to out-of-domain (OOD) reasoning tasks. We
hypothesize it is due to that their self-synthesized reasoning paths are too
task-specific, lacking general task-agnostic reasoning guidance. To address
this, we propose Reasoning Generalist via Self-Improvement (ReGenesis), a
method to self-synthesize reasoning paths as post-training data by progressing
from abstract to concrete. More specifically, ReGenesis self-synthesizes
reasoning paths by converting general reasoning guidelines into task-specific
ones, generating reasoning structures, and subsequently transforming these
structures into reasoning paths, without the need for human-designed
task-specific examples used in existing methods. We show that ReGenesis
achieves superior performance on all in-domain and OOD settings tested compared
to existing methods. For six OOD tasks specifically, while previous methods
exhibited an average performance decrease of approximately 4.6% after post
training, ReGenesis delivers around 6.1% performance improvement. We also
conduct in-depth analysis of our framework and show ReGenesis is effective
across various LLMs and design choices.
♻ ☆ Taming Data and Transformers for Audio Generation
The scalability of ambient sound generators is hindered by data scarcity,
insufficient caption quality, and limited scalability in model architecture.
This work addresses these challenges by advancing both data and model scaling.
First, we propose an efficient and scalable dataset collection pipeline
tailored for ambient audio generation, resulting in AutoReCap-XL, the largest
ambient audio-text dataset with over 47 million clips. To provide high-quality
textual annotations, we propose AutoCap, a high-quality automatic audio
captioning model. By adopting a Q-Former module and leveraging audio metadata,
AutoCap substantially enhances caption quality, reaching a CIDEr score of
$83.2$, a $3.2\%$ improvement over previous captioning models. Finally, we
propose GenAu, a scalable transformer-based audio generation architecture that
we scale up to 1.25B parameters. We demonstrate its benefits from data scaling
with synthetic captions as well as model size scaling. When compared to
baseline audio generators trained at similar size and data scale, GenAu obtains
significant improvements of $4.7\%$ in FAD score, $11.1\%$ in IS, and $13.5\%$
in CLAP score. Our code, model checkpoints, and dataset are publicly available.
comment: Project Webpage: https://snap-research.github.io/GenAU/
♻ ☆ How Inclusively do LMs Perceive Social and Moral Norms? NAACL 2025
This paper discusses and contains offensive content. Language models (LMs)
are used in decision-making systems and as interactive assistants. However, how
well do these models making judgements align with the diversity of human
values, particularly regarding social and moral norms? In this work, we
investigate how inclusively LMs perceive norms across demographic groups (e.g.,
gender, age, and income). We prompt 11 LMs on rules-of-thumb (RoTs) and compare
their outputs with the existing responses of 100 human annotators. We introduce
the Absolute Distance Alignment Metric (ADA-Met) to quantify alignment on
ordinal questions. We find notable disparities in LM responses, with younger,
higher-income groups showing closer alignment, raising concerns about the
representation of marginalized perspectives. Our findings highlight the
importance of further efforts to make LMs more inclusive of diverse human
values. The code and prompts are available on GitHub under the CC BY-NC 4.0
license.
comment: Accepted at NAACL 2025 Findings
♻ ☆ RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins (early version)
Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, Ping Luo
In the rapidly advancing field of robotics, dual-arm coordination and complex
object manipulation are essential capabilities for developing advanced
autonomous systems. However, the scarcity of diverse, high-quality
demonstration data and real-world-aligned evaluation benchmarks severely limits
such development. To address this, we introduce RoboTwin, a generative digital
twin framework that uses 3D generative foundation models and large language
models to produce diverse expert datasets and provide a real-world-aligned
evaluation platform for dual-arm robotic tasks. Specifically, RoboTwin creates
varied digital twins of objects from single 2D images, generating realistic and
interactive scenarios. It also introduces a spatial relation-aware code
generation framework that combines object annotations with large language
models to break down tasks, determine spatial constraints, and generate precise
robotic movement code. Our framework offers a comprehensive benchmark with both
simulated and real-world data, enabling standardized evaluation and better
alignment between simulated training and real-world performance. We validated
our approach using the open-source COBOT Magic Robot platform. Policies
pre-trained on RoboTwin-generated data and fine-tuned with limited real-world
samples improve the success rate of over 70% for single-arm tasks and over 40%
for dual-arm tasks compared to models trained solely on real-world data. This
significant improvement demonstrates RoboTwin's potential to enhance the
development and evaluation of dual-arm robotic manipulation systems. Project
Page: https://robotwin-benchmark.github.io/early-version/.
comment: Project page: https://robotwin-benchmark.github.io/early-version/
♻ ☆ BPP-Search: Enhancing Tree of Thought Reasoning for Mathematical Modeling Problem Solving
Teng Wang, Wing-Yin Yu, Zhenqi He, Zehua Liu, Hailei Gong, Han Wu, Xiongwei Han, Wei Shi, Ruifeng She, Fangzhou Zhu, Tao Zhong
LLMs exhibit advanced reasoning capabilities, offering the potential to
transform natural language questions into mathematical models. However,
existing open-source datasets in operations research domain lack detailed
annotations of the modeling process, such as variable definitions, focusing
solely on objective values, which hinders reinforcement learning applications.
To address this, we release the StructuredOR dataset, annotated with
comprehensive labels that capture the complete mathematical modeling process.
We further propose BPP-Search, an algorithm that integrates reinforcement
learning into a tree-of-thought structure using Beam search, a Process reward
model, and a pairwise Preference algorithm. This approach enables efficient
exploration of tree structures, avoiding exhaustive search while improving
accuracy. Extensive experiments on StructuredOR, NL4OPT, and MAMO-ComplexLP
datasets show that BPP-Search significantly outperforms state-of-the-art
methods. In tree-based reasoning, BPP-Search excels in accuracy and efficiency,
enabling faster retrieval of correct solutions. The StructuredOR dataset is
available at https://github.com/tengwang0318/StructuredOR.
♻ ☆ Science Out of Its Ivory Tower: Improving Accessibility with Reinforcement Learning
Haining Wang, Jason Clark, Hannah McKelvey, Leila Sterman, Zheng Gao, Zuoyu Tian, Sandra Kübler, Xiaozhong Liu
A vast amount of scholarly work is published daily, yet much of it remains
inaccessible to the general public due to dense jargon and complex language. To
address this challenge in science communication, we introduce a reinforcement
learning framework that fine-tunes a language model to rewrite scholarly
abstracts into more comprehensible versions. Guided by a carefully balanced
combination of word- and sentence-level accessibility rewards, our language
model effectively substitutes technical terms with more accessible
alternatives, a task which models supervised fine-tuned or guided by
conventional readability measures struggle to accomplish. Our best model
adjusts the readability level of scholarly abstracts by approximately six U.S.
grade levels -- in other words, from a postgraduate to a high school level.
This translates to roughly a 90% relative boost over the supervised fine-tuning
baseline, all while maintaining factual accuracy and high-quality language. An
in-depth analysis of our approach shows that balanced rewards lead to
systematic modifications in the base model, likely contributing to smoother
optimization and superior performance. We envision this work as a step toward
bridging the gap between scholarly research and the general public,
particularly younger readers and those without a college degree.
♻ ☆ Automatic Item Generation for Personality Situational Judgment Tests with Large Language Models
Personality assessment, particularly through situational judgment tests
(SJTs), is a vital tool for psychological research, talent selection, and
educational evaluation. This study explores the potential of GPT-4, a
state-of-the-art large language model (LLM), to automate the generation of
personality situational judgment tests (PSJTs) in Chinese. Traditional SJT
development is labor-intensive and prone to biases, while GPT-4 offers a
scalable, efficient alternative. Two studies were conducted: Study 1 evaluated
the impact of prompt design and temperature settings on content validity,
finding that optimized prompts with a temperature of 1.0 produced creative and
accurate items. Study 2 assessed the psychometric properties of GPT-4-generated
PSJTs, revealing that they demonstrated satisfactory reliability and validity,
surpassing the performance of manually developed tests in measuring the Big
Five personality traits. This research highlights GPT-4's effectiveness in
developing high-quality PSJTs, providing a scalable and innovative method for
psychometric test development. These findings expand the possibilities of
automatic item generation and the application of LLMs in psychology, and offer
practical implications for streamlining test development processes in
resource-limited settings.
comment: Submitted to Psychological Methods. 56 pages (main text), 12 pages
(appendix), and 5 figures
♻ ☆ FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models
Improving the training efficiency remains one of the most significant
challenges in large-scale reinforcement learning. In this paper, we investigate
how the model's context length and the complexity of the training dataset
influence the training process of R1-like models. Our experiments reveal three
key insights: (1) adopting longer context lengths may not necessarily result in
better performance; (2) selecting an appropriate context length helps mitigate
entropy collapse; and (3) appropriately controlling the model's context length
and curating training data based on input prompt length can effectively improve
RL training efficiency, achieving better performance with shorter thinking
length. Inspired by these insights, we propose FastCuRL, a curriculum
reinforcement learning framework with the progressive context extension
strategy, and successfully accelerate the training process of RL models.
Experimental results demonstrate that FastCuRL-1.5B-Preview surpasses
DeepScaleR-1.5B-Preview across all five benchmarks while only utilizing 50\% of
training steps. Furthermore, all training stages for FastCuRL-1.5B-Preview are
completed using a single node with 8 GPUs.
comment: Ongoing Work
♻ ☆ Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction
Qintong Zhang, Bin Wang, Victor Shea-Jay Huang, Junyuan Zhang, Zhengren Wang, Hao Liang, Conghui He, Wentao Zhang
Document parsing is essential for converting unstructured and semi-structured
documents such as contracts, academic papers, and invoices into structured,
machine-readable data. Document parsing reliable structured data from
unstructured inputs, providing huge convenience for numerous applications.
Especially with recent achievements in Large Language Models, document parsing
plays an indispensable role in both knowledge base construction and training
data generation. This survey presents a comprehensive review of the current
state of document parsing, covering key methodologies, from modular pipeline
systems to end-to-end models driven by large vision-language models. Core
components such as layout detection, content extraction (including text,
tables, and mathematical expressions), and multi-modal data integration are
examined in detail. Additionally, this paper discusses the challenges faced by
modular document parsing systems and vision-language models in handling complex
layouts, integrating multiple modules, and recognizing high-density text. It
outlines future research directions and emphasizes the importance of developing
larger and more diverse datasets.
♻ ☆ Task Memory Engine (TME): A Structured Memory Framework with Graph-Aware Extensions for Multi-Step LLM Agent Tasks
Large Language Models (LLMs) are increasingly used as autonomous agents for
multi-step tasks. However, most existing frameworks fail to maintain a
structured understanding of the task state, often relying on linear prompt
concatenation or shallow memory buffers. This leads to brittle performance,
frequent hallucinations, and poor long-range coherence. In this work, we
propose the Task Memory Engine (TME), a lightweight and structured memory
module that tracks task execution using a hierarchical Task Memory Tree (TMT).
Each node in the tree corresponds to a task step, storing relevant input,
output, status, and sub-task relationships. We introduce a prompt synthesis
method that dynamically generates LLM prompts based on the active node path,
significantly improving execution consistency and contextual grounding. Through
case studies and comparative experiments on multi-step agent tasks, we
demonstrate that TME leads to better task completion accuracy and more
interpretable behavior with minimal implementation overhead. A reference
implementation of the core TME components is available at
https://github.com/biubiutomato/TME-Agent, including basic examples and
structured memory integration. While the current implementation uses a
tree-based structure, TME is designed to be graph-aware, supporting reusable
substeps, converging task paths, and shared dependencies. This lays the
groundwork for future DAG-based memory architectures.
comment: 14 pages, 5 figures. Preprint prepared for future submission.
Includes implementation and token-efficiency analysis. Code at
https://github.com/biubiutomato/TME-Agent
♻ ☆ LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks
Large language model unlearning has become a critical challenge in ensuring
safety and controlled model behavior by removing undesired data-model
influences from the pretrained model while preserving general utility.
Significant recent efforts have been dedicated to developing LLM unlearning
benchmarks such as WMDP (Weapons of Mass Destruction Proxy) and MUSE (Machine
Unlearning Six-way Evaluation), facilitating standardized unlearning
performance assessment and method comparison. Despite their usefulness, we
uncover for the first time a novel coreset effect within these benchmarks.
Specifically, we find that LLM unlearning achieved with the original (full)
forget set can be effectively maintained using a significantly smaller subset
(functioning as a "coreset"), e.g., as little as 5% of the forget set, even
when selected at random. This suggests that LLM unlearning in these benchmarks
can be performed surprisingly easily, even in an extremely low-data regime. We
demonstrate that this coreset effect remains strong, regardless of the LLM
unlearning method used, such as NPO (Negative Preference Optimization) and RMU
(Representation Misdirection Unlearning), the popular ones in these benchmarks.
The surprisingly strong coreset effect is also robust across various data
selection methods, ranging from random selection to more sophisticated
heuristic approaches. We explain the coreset effect in LLM unlearning through a
keyword-based perspective, showing that keywords extracted from the forget set
alone contribute significantly to unlearning effectiveness and indicating that
current unlearning is driven by a compact set of high-impact tokens rather than
the entire dataset. We further justify the faithfulness of coreset-unlearned
models along additional dimensions, such as mode connectivity and robustness to
jailbreaking attacks. Codes are available at
https://github.com/OPTML-Group/MU-Coreset.
♻ ☆ Automated Python Translation
Python is one of the most commonly used programming languages in industry and
education. Its English keywords and built-in functions/modules allow it to come
close to pseudo-code in terms of its readability and ease of writing. However,
those who do not speak English may not experience these advantages. In fact,
they may even be hindered in their ability to understand Python code, as the
English nature of its terms creates an additional layer of overhead. To that
end, we introduce the task of automatically translating Python's natural
modality (keywords, error types, identifiers, etc.) into other human languages.
This presents a unique challenge, considering the abbreviated nature of these
forms, as well as potential untranslatability of advanced
mathematical/programming concepts across languages. We therefore create an
automated pipeline to translate Python into other human languages, comparing
strategies using machine translation and large language models. We then use
this pipeline to acquire translations from five common Python libraries
(pytorch, pandas, tensorflow, numpy, and random) in seven languages, and do a
quality test on a subset of these terms in French, Greek, and Bengali. We hope
this will provide a clearer path forward towards creating a universal Python,
accessible to anyone regardless of nationality or language background.
comment: 15 pages, 4 figures, 17 tables
♻ ☆ StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text
Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, Humphrey Shi
Text-to-video diffusion models enable the generation of high-quality videos
that follow text instructions, making it easy to create diverse and individual
content. However, existing approaches mostly focus on high-quality short video
generation (typically 16 or 24 frames), ending up with hard-cuts when naively
extended to the case of long video synthesis. To overcome these limitations, we
introduce StreamingT2V, an autoregressive approach for long video generation of
80, 240, 600, 1200 or more frames with smooth transitions. The key components
are:(i) a short-term memory block called conditional attention module (CAM),
which conditions the current generation on the features extracted from the
previous chunk via an attentional mechanism, leading to consistent chunk
transitions, (ii) a long-term memory block called appearance preservation
module, which extracts high-level scene and object features from the first
video chunk to prevent the model from forgetting the initial scene, and (iii) a
randomized blending approach that enables to apply a video enhancer
autoregressively for infinitely long videos without inconsistencies between
chunks. Experiments show that StreamingT2V generates high motion amount. In
contrast, all competing image-to-video methods are prone to video stagnation
when applied naively in an autoregressive manner. Thus, we propose with
StreamingT2V a high-quality seamless text-to-long video generator that
outperforms competitors with consistency and motion. Our code will be available
at: https://github.com/Picsart-AI-Research/StreamingT2V
comment: https://github.com/Picsart-AI-Research/StreamingT2V
♻ ☆ Fine-Grained Reward Optimization for Machine Translation using Error Severity Mappings
Miguel Moura Ramos, Tomás Almeida, Daniel Vareta, Filipe Azevedo, Sweta Agrawal, Patrick Fernandes, André F. T. Martins
Reinforcement learning (RL) has been proven to be an effective and robust
method for training neural machine translation systems, especially when paired
with powerful reward models that accurately assess translation quality.
However, most research has focused on RL methods that use sentence-level
feedback, leading to inefficient learning signals due to the reward sparsity
problem -- the model receives a single score for the entire sentence. To
address this, we propose a novel approach that leverages fine-grained,
token-level quality assessments along with error severity levels using RL
methods. Specifically, we use xCOMET, a state-of-the-art quality estimation
system, as our token-level reward model. We conduct experiments on small and
large translation datasets with standard encoder-decoder and large language
models-based machine translation systems, comparing the impact of
sentence-level versus fine-grained reward signals on translation quality. Our
results show that training with token-level rewards improves translation
quality across language pairs over baselines according to both automatic and
human evaluation. Furthermore, token-level reward optimization improves
training stability, evidenced by a steady increase in mean rewards over
training epochs.
comment: 12 pages, work-in-progress
♻ ☆ Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation
Steffen Eger, Yong Cao, Jennifer D'Souza, Andreas Geiger, Christian Greisinger, Stephanie Gross, Yufang Hou, Brigitte Krenn, Anne Lauscher, Yizhi Li, Chenghua Lin, Nafise Sadat Moosavi, Wei Zhao, Tristan Miller
With the advent of large multimodal language models, science is now at a
threshold of an AI-based technological transformation. Recently, a plethora of
new AI models and tools has been proposed, promising to empower researchers and
academics worldwide to conduct their research more effectively and efficiently.
This includes all aspects of the research cycle, especially (1) searching for
relevant literature; (2) generating research ideas and conducting
experimentation; generating (3) text-based and (4) multimodal content (e.g.,
scientific figures and diagrams); and (5) AI-based automatic peer review. In
this survey, we provide an in-depth overview over these exciting recent
developments, which promise to fundamentally alter the scientific research
process for good. Our survey covers the five aspects outlined above, indicating
relevant datasets, methods and results (including evaluation) as well as
limitations and scope for future research. Ethical concerns regarding
shortcomings of these tools and potential for misuse (fake science, plagiarism,
harms to research integrity) take a particularly prominent place in our
discussion. We hope that our survey will not only become a reference guide for
newcomers to the field but also a catalyst for new AI-based initiatives in the
area of "AI4Science".
comment: 44 pages, 7 figures, 8 tables
♻ ☆ Think Before Recommend: Unleashing the Latent Reasoning Power for Sequential Recommendation
Sequential Recommendation (SeqRec) aims to predict the next item by capturing
sequential patterns from users' historical interactions, playing a crucial role
in many real-world recommender systems. However, existing approaches
predominantly adopt a direct forward computation paradigm, where the final
hidden state of the sequence encoder serves as the user representation. We
argue that this inference paradigm, due to its limited computational depth,
struggles to model the complex evolving nature of user preferences and lacks a
nuanced understanding of long-tail items, leading to suboptimal performance. To
address this issue, we propose \textbf{ReaRec}, the first inference-time
computing framework for recommender systems, which enhances user
representations through implicit multi-step reasoning. Specifically, ReaRec
autoregressively feeds the sequence's last hidden state into the sequential
recommender while incorporating special reasoning position embeddings to
decouple the original item encoding space from the multi-step reasoning space.
Moreover, we introduce two lightweight reasoning-based learning methods,
Ensemble Reasoning Learning (ERL) and Progressive Reasoning Learning (PRL), to
further effectively exploit ReaRec's reasoning potential. Extensive experiments
on five public real-world datasets and different SeqRec architectures
demonstrate the generality and effectiveness of our proposed ReaRec.
Remarkably, post-hoc analyses reveal that ReaRec significantly elevates the
performance ceiling of multiple sequential recommendation backbones by
approximately 30\%-50\%. Thus, we believe this work can open a new and
promising avenue for future research in inference-time computing for sequential
recommendation.
♻ ☆ Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads
We introduce Orthus, an autoregressive (AR) transformer that excels in
generating images given textual prompts, answering questions based on visual
inputs, and even crafting lengthy image-text interleaved contents. Unlike prior
arts on unified multimodal modeling, Orthus simultaneously copes with discrete
text tokens and continuous image features under the AR modeling principle. The
continuous treatment of visual signals minimizes the information loss for both
image understanding and generation while the fully AR formulation renders the
characterization of the correlation between modalities straightforward. The key
mechanism enabling Orthus to leverage these advantages lies in its
modality-specific heads -- one regular language modeling (LM) head predicts
discrete text tokens and one diffusion head generates continuous image features
conditioning on the output of the backbone. We devise an efficient strategy for
building Orthus -- by substituting the Vector Quantization (VQ) operation in
the existing unified AR model with a soft alternative, introducing a diffusion
head, and tuning the added modules to reconstruct images, we can create an
Orthus-base model effortlessly (e.g., within mere 72 A100 GPU hours).
Orthus-base can further embrace post-training to better model interleaved
images and texts. Empirically, Orthus surpasses competing baselines including
Show-o and Chameleon across standard benchmarks, achieving a GenEval score of
0.58 and an MME-P score of 1265.8 using 7B parameters. Orthus also shows
exceptional mixed-modality generation capabilities, reflecting the potential
for handling intricate practical generation tasks.
♻ ☆ Local Grammar-Based Coding Revisited
In the setting of minimal local grammar-based coding, the input string is
represented as a grammar with the minimal output length defined via simple
symbol-by-symbol encoding. This paper discusses four contributions to this
field. First, we invoke a simple harmonic bound on ranked probabilities, which
reminds Zipf's law and simplifies universality proofs for minimal local
grammar-based codes. Second, we refine known bounds on the vocabulary size,
showing its partial power-law equivalence with mutual information and
redundancy. These bounds are relevant for linking Zipf's law with the neural
scaling law for large language models. Third, we develop a framework for
universal codes with fixed infinite vocabularies, recasting universal coding as
matching ranked patterns that are independent of empirical data. Finally, we
analyze grammar-based codes with finite vocabularies being empirical rank
lists, proving that that such codes are also universal. These results extend
foundations of universal grammar-based coding and reaffirm previously stated
connections to power laws for human language and language models.
comment: 41 pages, no figures
♻ ☆ Natural Language Outlines for Code: Literate Programming in the LLM Era
Kensen Shi, Deniz Altınbüken, Saswat Anand, Mihai Christodorescu, Katja Grünwedel, Alexa Koenings, Sai Naidu, Anurag Pathak, Marc Rasi, Fredde Ribeiro, Brandon Ruffin, Siddhant Sanyam, Maxim Tabachnyk, Sara Toth, Roy Tu, Tobias Welp, Pengcheng Yin, Manzil Zaheer, Satish Chandra, Charles Sutton
We propose using natural language outlines as a novel modality and
interaction surface for providing AI assistance to developers throughout the
software development process. An NL outline for a code function comprises
multiple statements written in concise prose, which partition the code and
summarize its main ideas in the style of literate programming. Crucially, we
find that modern LLMs can generate accurate and high-quality NL outlines in
practice. Moreover, NL outlines enable a bidirectional sync between code and
NL: a developer can change one and the LLM automatically updates the other. We
discuss many use cases for NL outlines: they can accelerate understanding and
navigation of code and diffs, simplify code maintenance, augment code search,
steer code generation, and more. We then propose and compare multiple LLM
prompting techniques for generating outlines and ask professional developers to
judge outline quality. Finally, we present two case studies applying NL
outlines toward code review and malware detection.
comment: Accepted to FSE'25 Industry Track
♻ ☆ Leveraging Social Determinants of Health in Alzheimer's Research Using LLM-Augmented Literature Mining and Knowledge Graphs
Tianqi Shang, Shu Yang, Weiqing He, Tianhua Zhai, Dawei Li, Bojian Hou, Tianlong Chen, Jason H. Moore, Marylyn D. Ritchie, Li Shen
Growing evidence suggests that social determinants of health (SDoH), a set of
nonmedical factors, affect individuals' risks of developing Alzheimer's disease
(AD) and related dementias. Nevertheless, the etiological mechanisms underlying
such relationships remain largely unclear, mainly due to difficulties in
collecting relevant information. This study presents a novel, automated
framework that leverages recent advancements of large language model (LLM) and
natural language processing techniques to mine SDoH knowledge from extensive
literature and integrate it with AD-related biological entities extracted from
the general-purpose knowledge graph PrimeKG. Utilizing graph neural networks,
we performed link prediction tasks to evaluate the resultant SDoH-augmented
knowledge graph. Our framework shows promise for enhancing knowledge discovery
in AD and can be generalized to other SDoH-related research areas, offering a
new tool for exploring the impact of social determinants on health outcomes.
Our code is available at: https://github.com/hwq0726/SDoHenPKG
comment: Accepted by AMIA-IS'25: AMIA Informatics Summit
♻ ☆ Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models
Joseph Lee, Shu Yang, Jae Young Baik, Xiaoxi Liu, Zhen Tan, Dawei Li, Zixuan Wen, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Li Shen
Predicting phenotypes with complex genetic bases based on a small,
interpretable set of variant features remains a challenging task.
Conventionally, data-driven approaches are utilized for this task, yet the high
dimensional nature of genotype data makes the analysis and prediction
difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and
their success in processing complex biomedical concepts, we set to examine the
ability of LLMs in feature selection and engineering for tabular genotype data,
with a novel knowledge-driven framework. We develop FREEFORM, Free-flow
Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling,
designed with chain-of-thought and ensembling principles, to select and
engineer features with the intrinsic knowledge of LLMs. Evaluated on two
distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing
loss, we find this framework outperforms several data-driven methods,
particularly on low-shot regimes. FREEFORM is available as open-source
framework at GitHub: https://github.com/PennShenLab/FREEFORM.
comment: accepted by AMIA-IS'25: AMIA Informatics Summit [Marco Ramoni
Distinguished Paper Award for Translational Bioinformatics]
♻ ☆ Syzygy of Thoughts: Improving LLM CoT with the Minimal Free Resolution
Chenghao Li, Chaoning Zhang, Yi Lu, Jiaquan Zhang, Qigan Sun, Xudong Wang, Jiwei Wei, Guoqing Wang, Yang Yang, Heng Tao Shen
Chain-of-Thought (CoT) prompting enhances the reasoning of large language
models (LLMs) by decomposing problems into sequential steps, mimicking human
logic and reducing errors. However, complex tasks with vast solution spaces and
vague constraints often exceed the capacity of a single reasoning chain.
Inspired by Minimal Free Resolution (MFR) in commutative algebra and algebraic
geometry, we propose Syzygy of Thoughts (SoT)-a novel framework that extends
CoT by introducing auxiliary, interrelated reasoning paths. SoT captures deeper
logical dependencies, enabling more robust and structured problem-solving. MFR
decomposes a module into a sequence of free modules with minimal rank,
providing a structured analytical approach to complex systems. This method
introduces the concepts of "Module", "Betti numbers","Freeness", "Mapping",
"Exactness" and "Minimality", enabling the systematic decomposition of the
original complex problem into logically complete minimal subproblems while
preserving key problem features and reducing reasoning length. We tested SoT
across diverse datasets (e.g., GSM8K, MATH) and models (e.g., GPT-4o-mini,
Qwen2.5), achieving inference accuracy that matches or surpasses mainstream
CoTs standards. Additionally, by aligning the sampling process with algebraic
constraints, our approach enhances the scalability of inference time in LLMs,
ensuring both transparent reasoning and high performance. Our code will be
publicly available at https://github.com/dlMARiA/Syzygy-of-thoughts.
♻ ☆ What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models
Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Zhihan Guo, Yufei Wang, Niklas Muennighoff, Irwin King, Xue Liu, Chen Ma
As enthusiasm for scaling computation (data and parameters) in the
pretraining era gradually diminished, test-time scaling (TTS), also referred to
as ``test-time computing'' has emerged as a prominent research focus. Recent
studies demonstrate that TTS can further elicit the problem-solving
capabilities of large language models (LLMs), enabling significant
breakthroughs not only in specialized reasoning tasks, such as mathematics and
coding, but also in general tasks like open-ended Q&A. However, despite the
explosion of recent efforts in this area, there remains an urgent need for a
comprehensive survey offering a systemic understanding. To fill this gap, we
propose a unified, multidimensional framework structured along four core
dimensions of TTS research: what to scale, how to scale, where to scale, and
how well to scale. Building upon this taxonomy, we conduct an extensive review
of methods, application scenarios, and assessment aspects, and present an
organized decomposition that highlights the unique functional roles of
individual techniques within the broader TTS landscape. From this analysis, we
distill the major developmental trajectories of TTS to date and offer hands-on
guidelines for practical deployment. Furthermore, we identify several open
challenges and offer insights into promising future directions, including
further scaling, clarifying the functional essence of techniques, generalizing
to more tasks, and more attributions. Our repository is available on
https://github.com/testtimescaling/testtimescaling.github.io/
comment: v2: Creating the GitHub repository, Citing some missed works,
Incorporating two new domains (agentic and evaluation) in where to scale,
Incorporating one direction (thoughtology research) in challenge and future
work
♻ ☆ Sequence-Level Leakage Risk of Training Data in Large Language Models
This work quantifies the risk of training data leakage from LLMs (Large
Language Models) using sequence-level probabilities. Computing extraction
probabilities for individual sequences provides finer-grained information than
has been studied in prior benchmarking work. We re-analyze the effects of
decoding schemes, model sizes, prefix lengths, partial sequence leakages, and
token positions to uncover new insights that were not possible in previous
works due to their choice of metrics. We perform this study on two pre-trained
models, Llama and OPT, trained on the Common Crawl and The Pile respectively.
We discover that 1) Extraction Rate, the predominant metric used in prior
quantification work, underestimates the threat of leakage of training data in
randomized LLMs by as much as 2.14X. 2) Although on average, larger models and
longer prefixes can extract more data, this is not true for a substantial
portion of individual sequences. 30.4-41.5% of our sequences are easier to
extract with either shorter prefixes or smaller models. 3) Contrary to previous
beliefs, partial leakage in commonly used decoding schemes like top-k and top-p
is not easier than leaking verbatim training data. The aim of this work is to
encourage the adoption of this metric for future work on quantification of
training data extraction.
♻ ☆ ChaosEater: Fully Automating Chaos Engineering with Large Language Models
Chaos Engineering (CE) is an engineering technique aimed at improving the
resiliency of distributed systems. It involves artificially injecting specific
failures into a distributed system and observing its behavior in response.
Based on the observation, the system can be proactively improved to handle
those failures. Recent CE tools implement the automated execution of predefined
CE experiments. However, defining these experiments and improving the system
based on the experimental results still remain manual. To reduce the costs of
the manual operations, we propose ChaosEater, a system for automating the
entire CE operations with Large Language Models (LLMs). It predefines the
agentic workflow according to a systematic CE cycle and assigns subdivided
operations within the workflow to LLMs. ChaosEater targets CE for Kubernetes
systems, which are managed through code (i.e., Infrastructure as Code).
Therefore, the LLMs in ChaosEater perform software engineering tasks to
complete CE cycles, including requirement definition, code generation,
debugging, and testing. We evaluate ChaosEater through case studies on both
small and large Kubernetes systems. The results demonstrate that it stably
completes reasonable single CE cycles with significantly low time and monetary
costs. The CE cycles are also qualitatively validated by human engineers and
LLMs.
comment: 114 pages (7 main), 11 figures. Project page:
https://ntt-dkiku.github.io/chaos-eater
♻ ☆ UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis
Recent advancements in Large Vision-Language Models are accelerating the
development of Graphical User Interface (GUI) agents that utilize human-like
vision perception capabilities to enhance productivity on digital devices.
Compared to approaches predicated on GUI metadata, which are platform-dependent
and vulnerable to implementation variations, vision-based approaches offer
broader applicability. In this vision-based paradigm, the GUI instruction
grounding, which maps user instruction to the location of corresponding element
on the given screenshot, remains a critical challenge, particularly due to
limited public training dataset and resource-intensive manual instruction data
annotation. In this paper, we delve into unexplored challenges in this task
including element-to-screen ratio, unbalanced element type, and implicit
instruction. To address these challenges, we introduce a large-scale data
synthesis pipeline UI-E2I-Synth for generating varying complex instruction
datasets using GPT-4o instead of human annotators. Furthermore, we propose a
new GUI instruction grounding benchmark UI-I2E-Bench, which is designed to
address the limitations of existing benchmarks by incorporating diverse
annotation aspects. Our model, trained on the synthesized data, achieves
superior performance in GUI instruction grounding, demonstrating the
advancements of proposed data synthesis pipeline. The proposed benchmark,
accompanied by extensive analyses, provides practical insights for future
research in GUI grounding. We will release corresponding artifacts at
https://colmon46.github.io/i2e-bench-leaderboard/ .
♻ ☆ KPC-cF: Aspect-Based Sentiment Analysis via Implicit-Feature Alignment with Corpus Filtering ICML 2024
Investigations into Aspect-Based Sentiment Analysis (ABSA) for Korean
industrial reviews are notably lacking in the existing literature. Our research
proposes an intuitive and effective framework for ABSA in low-resource
languages such as Korean. It optimizes prediction labels by integrating
translated benchmark and unlabeled Korean data. Using a model fine-tuned on
translated data, we pseudo-labeled the actual Korean NLI set. Subsequently, we
applied LaBSE and \MSP{}-based filtering to this pseudo-NLI set as implicit
feature, enhancing Aspect Category Detection and Polarity determination through
additional training. Incorporating dual filtering, this model bridged dataset
gaps and facilitates feature alignment with minimal resources. By implementing
alignment pipelines, our approach aims to leverage high-resource datasets to
develop reliable predictive and refined models within corporate or individual
communities in low-resource language countries. Compared to English ABSA, our
framework showed an approximately 3\% difference in F1 scores and accuracy. We
will release our dataset and code for Korean ABSA, at this link.
comment: Work in Progress, DMLR@ICML 2024
♻ ☆ Exploring the Role of Knowledge Graph-Based RAG in Japanese Medical Question Answering with Small-Scale LLMs
Large language models (LLMs) perform well in medical QA, but their
effectiveness in Japanese contexts is limited due to privacy constraints that
prevent the use of commercial models like GPT-4 in clinical settings. As a
result, recent efforts focus on instruction-tuning open-source LLMs, though the
potential of combining them with retrieval-augmented generation (RAG) remains
underexplored. To bridge this gap, we are the first to explore a knowledge
graph-based (KG) RAG framework for Japanese medical QA small-scale open-source
LLMs. Experimental results show that KG-based RAG has only a limited impact on
Japanese medical QA using small-scale open-source LLMs. Further case studies
reveal that the effectiveness of the RAG is sensitive to the quality and
relevance of the external retrieved content. These findings offer valuable
insights into the challenges and potential of applying RAG in Japanese medical
QA, while also serving as a reference for other low-resource languages.
comment: 10 pages
♻ ☆ Large Visual-Language Models Are Also Good Classifiers: A Study of In-Context Multimodal Fake News Detection
Large visual-language models (LVLMs) exhibit exceptional performance in
visual-language reasoning across diverse cross-modal benchmarks. Despite these
advances, recent research indicates that Large Language Models (LLMs), like
GPT-3.5-turbo, underachieve compared to well-trained smaller models, such as
BERT, in Fake News Detection (FND), prompting inquiries into LVLMs' efficacy in
FND tasks. Although performance could improve through fine-tuning LVLMs, the
substantial parameters and requisite pre-trained weights render it a
resource-heavy endeavor for FND applications. This paper initially assesses the
FND capabilities of two notable LVLMs, CogVLM and GPT4V, in comparison to a
smaller yet adeptly trained CLIP model in a zero-shot context. The findings
demonstrate that LVLMs can attain performance competitive with that of the
smaller model. Next, we integrate standard in-context learning (ICL) with
LVLMs, noting improvements in FND performance, though limited in scope and
consistency. To address this, we introduce the \textbf{I}n-context
\textbf{M}ultimodal \textbf{F}ake \textbf{N}ews \textbf{D}etection (IMFND)
framework, enriching in-context examples and test inputs with predictions and
corresponding probabilities from a well-trained smaller model. This strategic
integration directs the LVLMs' focus towards news segments associated with
higher probabilities, thereby improving their analytical accuracy. The
experimental results suggest that the IMFND framework significantly boosts the
FND efficiency of LVLMs, achieving enhanced accuracy over the standard ICL
approach across three publicly available FND datasets.
comment: Withdraw for new experiments