Flutterby™! : Beyond Code Snippets 2026-04-22 03:01:26.485033+02

Beyond Code Snippets

2026-04-22 03:01:26.485033+02 by Dan Lyke 0 comments

Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering March 2026

Using this dataset, we systematically evaluate two widely used LLMs (Claude 3.5 Sonnet and GPT-4o) under both direct prompting and agentic configurations. We compare baseline performance with retrieval-augmented generation methods that leverage file-level retrieval and graph-based representations of structural dependencies. Our results show that LLMs achieve moderate accuracy at baseline, with performance improving when structural signals are incorporated. Nonetheless, overall accuracy remains limited for repository- scale comprehension. The analysis reveals that high scores often result from verbatim reproduction of Stack Overflow answers rather than genuine reasoning.

DOI:10.48550/arXiv.2603.26567

Via

Comment policy

We will not edit your comments. However, we may delete your comments, or cause them to be hidden behind another link, if we feel they detract from the conversation. Commercial plugs are fine, if they are relevant to the conversation, and if you don't try to pretend to be a consumer. Annoying endorsements will be deleted if you're lucky, if you're not a whole bunch of people smarter and more articulate than you will ridicule you, and we will leave such ridicule in place.

Flutterby™ is a trademark claimed by