Evaluation

SusBench: An Online Benchmark for Evaluating Dark Pattern Susceptibility of Computer-Use Agents

Benchmarking dark pattern susceptibility of computer-use agents in realistic UI environments.

Read more

MMMG: A Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation

Evaluation suite for diverse multitask multimodal generation with large multimodal models.

Read more
ExpScore: Learning metrics for recommendation explanation featured image

ExpScore: Learning metrics for recommendation explanation

Learning metrics for evaluating recommendation explanations.

Read more