1

SusBench: An Online Benchmark for Evaluating Dark Pattern Susceptibility of Computer-Use Agents

Benchmarking dark pattern susceptibility of computer-use agents in realistic UI environments.

Read more

Asking the Missing Piece: Context-Driven Clarification for Ambiguous VQA

Context-driven clarification strategies for ambiguous VQA in a reasoning-focused NeurIPS workshop.

Read more

Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations?

Benchmarking LVLM robustness to spurious correlations and studying generalization beyond the SpuriVerse.

Read more
MARVEL: Modular Abstention for Reliable and Versatile Expert LLMs featured image

MARVEL: Modular Abstention for Reliable and Versatile Expert LLMs

A modular abstention framework for reliable expert LLMs that enables selective abstention from uncertain questions.

Read more
Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLMs featured image

Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLMs

Exploring psychological insights to address overconfidence in LLMs by comparing with human confidence patterns.

Read more
AutoScale-Automatic Prediction of Compute-optimal Data Composition for Training LLMs featured image

AutoScale-Automatic Prediction of Compute-optimal Data Composition for Training LLMs

Automatic prediction of compute-optimal data composition for efficient LLM training.

Read more

Mitigating Overconfidence in Large Language Models: A Behavioral Lens on Confidence Estimation and Calibration

Behavioral analysis and mitigation strategies for overconfidence in large language models.

Read more
Characterizing LLM Abstention Behavior in Science QA with Context Perturbations featured image

Characterizing LLM Abstention Behavior in Science QA with Context Perturbations

Characterizing LLM abstention behavior in science QA with context perturbations.

Read more

Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource Settings

Demonstrating that open-weight models can match ChatGPT in low-resource, laboratory-scale AI deployments.

Read more

OmniMotionGPT: Animal Motion Generation with Limited Data

A generative model for animal motion synthesis under limited data constraints.

Read more