Microsoft study claims AI is still struggling to debug software

Man coding programmer, software developer working on digital tablet with binary, html computer code on virtual screen

(Image credit: Shutterstock/TippaPatt)

AI promises a huge revolution for developers, but is it just for code creation?
Popular AI models from Anthropic and OpenAI aren’t great at debugging
Microsoft’s researchers are open-sourcing their tools to facilitate research

Although generative AI is increasingly being integrated into programming workflows, new research from Microsoft claims some large language models still aren’t quite up to scratch when it comes to debugging.

The research suggests even some advanced models still struggle with debugging tasks which are pretty simple for experienced developers, highlighting the continued importance of human programmers.

AI does appear to have a solid use case, though, with Google now claiming that around 25% of new code is AI-generated. Meta has also noted the wide deployment of AI for coding.

AI is good for code creation, but not for debugging

The report explores how 11 Microsoft researchers tested nine AI models on SWE-bench Lite – a popular debugging benchmark. Claude 3.7 Sonnet offered the highest success rate at a far-from-perfect 48.4%. OpenAI’s o1 and o3-mini posted lower success rates of 30.2% and 22.1% respectively.

“Even with debugging tools, our simple prompt-based agent rarely solves more than half of the SWE-bench Lite issues,” the researchers wrote, blaming the suboptimal performance on a lack of data representing sequential decision-making behavior.

All hope is not lost, though. “We believe that training or fine-tuning LLMs can enhance their interactive debugging abilities,” they added.

The researchers intend to fine-tune an info-seeking model specialized in gathering the necessary information to resolve bugs, but in the meantime, they promise to open source debug-gym to make it easier for others to conduct similar research.

Debug-gym is described as an “environment that allows code-repairing agents to access tools for active information-seeking behavior.”

However, for now, artificial intelligence might not be bringing as much value to developers’ lives as AI companies suggest.

“Most developers spend the majority of their time debugging code,” the researchers wrote, indicating that even if they are benefitting from code generation, it might not be saving them that much time.

Enhance productivity with the best AI tools and best AI writers
GitHub Copilot launches new AI tools, but also limits on its premium models
Need an upgrade? Consider asking your boss for the best laptops for programming

TOPICS

With several years’ experience freelancing in tech and automotive circles, Craig’s specific interests lie in technology that is designed to better our lives, including AI and ML, productivity aids, and smart fitness. He is also passionate about cars and the decarbonisation of personal transportation. As an avid bargain-hunter, you can be sure that any deal Craig finds is top value!

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.

Microsoft study claims AI is still struggling to debug software

AI is good for code creation, but not for debugging

You might also like

You must confirm your public display name before commenting

Please wait...

AI is good for code creation, but not for debugging

Are you a pro? Subscribe to our newsletter

You might also like

You must confirm your public display name before commenting