A comparative test of 11 language models showed that structured outputs improve working with code, but still do not guarantee reliable results without human supervision * Even with structured outputs like JSON, XML, and Markdown, large language models still struggle to maintain accuracy and consistency, especially in visual tasks like websites, images, and video
New research from the University of Waterloo shows that artificial intelligence still struggles with some of the most basic tasks of software development, raising questions about how reliably such systems can help developers. As large language models become increasingly integrated into software development, developers are struggling to ensure that the answers generated by AI are accurate, consistent, and easy to integrate into broader workflows.
The study, "StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs," was published in the journal Transactions on Machine Learning Research, and presented at ICLR 2026.
In the past, large language models responded to commands in software development with free-form natural language responses. To address this problem, several AI companies, including OpenAI, Google, and Anthropic, have introduced “structured outputs.” These outputs force language models to operate in predefined formats, such as JSON, XML, or Markdown, making it easier for both humans and software systems to read and process the output.
However, a new comparative study from Waterloo shows that this technology is still not as reliable as many developers had hoped. Even the most advanced models only achieved about 75% accuracy in tests, while open-source models achieved results of about 65%.
The study examined 11 large language models in 18 structured output formats and 44 tasks designed to test how closely the systems adhere to defined structural rules.
"In this type of research, we want to measure not only the syntax of the code, that is, whether it actually obeys the established rules, but also whether the outputs generated for various tasks are accurate," said Dongfu Jiang, a doctoral student in computer science and one of the two lead authors of the study.
"We found that while the models do quite well on text-related tasks, they have a very hard time on tasks that involve creating images, video, or websites."
The study was a collaborative effort involving University of Waterloo undergraduate student Jialin Yang and Dr. Wenhu Chen, assistant professor of computer science. It also included annotations and control from 17 other researchers from Waterloo and elsewhere in the world.
“We’ve had quite a few similar benchmarking projects in our labs recently,” Chen said. “At Waterloo, students sometimes start as taggers, then organize projects, and finally create their own benchmarking studies. They’re not just using AI in their research, they’re also building it, exploring it, and evaluating it.”
The researchers say that while structured outputs from large language models are an intriguing step for software development, the systems are not yet reliable enough to operate without human supervision.
“Developers may have such agents working for them, but they will still need significant human oversight,” Jiang said.
for the scientific article https://arxiv.org/abs/2505.20139
More of the topic in Hayadan:
2 תגובות
"Core tasks"? Who defines what "core tasks" are?
Everyone has their own perspectives. For programmers, for example, the creation of images and videos is much less important, and simple and medium-sized tasks in the software are performed relatively reliably, and if not for a slight correction at the prompts, everything is fine.
The title is a bit confusing.
Was it difficult to reach 75%?
Because reaching 92.5% will be even more difficult.
And before 98% the quality is probably completely crappy.
Another vision for the future, but here's the twist: progress is exponential. If it took 10 years to reach 75%, it will take 5 years to reach 92.5%.