By Elad Walach, MedCity News | August 5, 2019
Understanding what determines accuracy is confusing, albeit necessary when it comes to evaluating artificial intelligence in radiology.
The term accuracy is becoming increasingly ambiguous. Some report different metrics for accuracy, so it’s hard to compare. Some play around with the data selection, which skews performance. And some just use confusing metrics. As a result, AI performance can be misinterpreted and lead to erroneous conclusions.
AUC – the most damned lie of them all
One of the most common metrics used by AI companies is Area Under Curve (AUC). This way of measuring accuracy was actually invented by radar operators in World War Two, but it’s become very common in the machine learning community.
Without getting too technical, AUC measures how much more likely the AI solution is to correctly classify a positive result (say, to correctly detect a pulmonary embolism in a scan) versus how likely the same AI would be to wrongly detect something when it isn’t there.
Let me start with my personal belief: area under the curve (AUC) is a bad metric. While it is scientifically important, it’s confusing to physicians. That’s because for most clinical users, AUC is so difficult to understand and ‘weight’ appropriately. It’s grossly overemphasized. Consider the following instance:
I recently read an AI research paper showing an ‘impressive’ AUC of 0.95. Since 0.95 is close to 1 (which would perfect performance), it seems like it must be an excellent solution.