MobileDev-Bench: A Comprehensive Benchmark for Evaluating Large Language Models on Mobile Application Development

Under Review (EMNLP), 2026

Mobile software development presents unique challenges for LLM-based code agents, involving multi-platform codebases, platform-specific APIs, and complex build toolchains. This paper introduces MobileDev-Bench, a benchmark for evaluating LLMs on realistic mobile app issue resolution.

Abstract

We present MobileDev-Bench, a benchmark of 384 real-world issue-resolution tasks sourced from 18 production mobile applications spanning Android Native, React Native, and Flutter. Each task is paired with a compilation-aware automated evaluation pipeline that verifies whether a proposed fix produces a buildable artifact. We evaluate four frontier LLMs — GPT, Claude, Gemini, and Qwen — and find that all achieve only 3.39–5.21% task resolution rates, exposing a critical and systematic gap in LLM capability for mobile software engineering. Analysis reveals consistent failure modes in fault localization across multi-file, multi-artifact changes.

Key Contributions

  • A benchmark of 384 real-world issue-resolution tasks across 18 production mobile apps (Android, React Native, Flutter)
  • A compilation-aware automated evaluation pipeline for assessing LLM-generated fixes
  • Empirical evaluation of four frontier LLMs (GPT, Claude, Gemini, Qwen) revealing 3.39–5.21% resolution rates
  • Systematic analysis of failure modes in fault localization for multi-file, multi-artifact mobile changes

Available on arXiv.