A new study by the Massachusetts Institute of Technology found tag errors in ten of the most cited artificial intelligence data test sets. Researchers estimated a mean of 3.4% errors across the datasets, cautioning that this could destabilize machine learning benchmarks.  

“Researchers rely on benchmark test datasets to assess and measure progress in the state-of-the-art and to confirm theoretical findings, ” wrote the study authors.  

“If tag errors occurred profusely , they could sabotage the framework by which we measure progress in machine learning, ” they continued.  


Since MIT Technology Review senior AI reporter Karen Hao noted in a write-up about the analysis researchers use a core set of information sets to assess ML models and monitor AI capability over time.  

There are known issues with a number of these sets, Hao wrote, such as racist and sexist labels. However, the new study finds that a number of the labels are simply wrong as well.

For instance , researchers found that a photo of a frog in CIFAR-10, a visual dataset, was erroneously labeled as a kitty . In the commonly used ImageNet validation set, a lion was tagged as a patas monkey, a dock was labeled as a paper towel, and giant pandas were repeatedly labeled as red pandas.    

And in QuickDraw, a collection of 50 million drawings across 345 categories, an eye had been labeled as a tiger, a lightbulb was labeled as a tiger, and an apple was labeled as a t-shirt.  

In total, researchers found 2,916 mistakes in the ImageNet validation group or 6% and estimated over 10% errors in QuickDraw.  

“Traditionally, ML professionals choose which version to deploy based on test accuracy our findings advise caution here , proposing that judging models over properly labeled test sets might be more useful, especially for noisy real-world datasets, ” read the analysis . Researchers noted that following labels were corrected , the versions that didn’t perform as well on the incorrect labels were some of the best actors .    

” Specifically , the simpler models appeared to fare better on the corrected data than the more complicated models that are used by technology giants like Google for picture recognition and presumed to be the best in the field , ” wrote Hao.  

” Quite simply , we may have an inflated sense of how great these complicated models are due to faulty testing data. ”  


It’s long been a truism which AI tools are just as good as the information on which they’re trained – but using best practices for testing machine learning is also crucial .

“From our experience, most healthcare organizations do not evaluate algorithms in the context of their intended use,”  said Dr. Sujay Kakarmath, an electronic health scientist at Partners Healthcare, in an interview with Healthcare IT News at 2018.    

“The technical performance of an algorithm for a given job is far from being the only metric that determines its potential impact , ” Kakarmath added.  


“Whereas train set labels in a small number of machine learning datasets, e.g. from the ImageNet dataset, are well-known to contain errors , labeled data in test sets is often considered ‘correct’ as long as it is drawn from the same distribution as the train set, ” wrote the research team in the study .  

“This is a fallacy machine learning test sets can, and do, include pervasive errors and these errors can destabilize ML benchmarks, ” they added.



Kat Jercich is senior editor of Healthcare IT News.
Twitter: @kjercich
Healthcare IT News is a HIMSS Media publication.

Leave a Reply

Your email address will not be published. Required fields are marked *