Monday, 13 October 2025 @ 13:30–14:30 CEST
On-site:
University of Vienna
Seminarraum 8 (OG01)
Kolingasse 14–16
1090 Vienna
Online:
https://univienna.zoom.us/j/67032386717?pwd=g8HOG2oRrWK6T5cvmRA7bv17QRzq72.1
Meeting ID: 670 3238 6717
Passcode: 440328
Evaluating Coding Agents for Data Science and Machine Learning Research
Abstract :
Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress in developing agents that can perform parts of the research pipeline in data science, machine learning, and the natural sciences. However, the ability of these agents to reliably produce code that yields accurate research results has not yet been adequately assessed.
In this talk, I will introduce a new benchmark called "REXBench" that evaluates the ability of LLM-based coding agents to autonomously implement novel research extensions. I will argue that research extensions are an ideal testing ground for evaluating such agents and explain how our benchmark circumvents common data contamination issues. I will also present results from evaluating nine recent LLM-based agents and discuss their implications for using LLM agents to write research code.
Bio :
Sebastian Schuster is an assistant professor at the Faculty of Computer Science at the University of Vienna, where he heads a WWTF-funded Vienna Research Group focused on natural language processing. His research focuses on evaluating large language models, developing sophisticated natural language understanding models, and using machine learning models to uncover processes involved in human language processing. Before returning to Vienna this year, he was an assistant professor at University College London and a postdoc at New York University and Saarland University. He holds a PhD in computational linguistics and an MS in computer science from Stanford University, and a BSc in computer science from the University of Vienna.
