Before receiving her PhD in computer science from MIT in 2017, Marzyeh Ghassemi had started to wonder whether the use of AI techniques could improve existing biases in healthcare. He was one of the early researchers on the subject, and he has been exploring it ever since. In a new paper, Ghassemi, now an assistant professor in MIT’s Department of Electrical Science and Engineering (EECS), and three collaborators based in the Laboratory of Computer Science and Artificial Intelligence, have investigated the roots of the differences that can arise in machine learning, often leading to models that the overall good performers falter when it comes to subgroups for which relatively little data has been collected and used in the training process. The paper – written by two MIT PhD students, Yuzhe Yang and Haoran Zhang, EECS computer scientists Dina Katabi (Professors Thuan and Nicole Pham), and Ghassemi – was presented last month at the 40th International Conference on Machine Learning in Honolulu, Hawaii.
In their analysis, the researchers focused on “subpopulation shifts” — differences in how machine learning models work for one subgroup compared to another. “We wanted the model to be fair and work equally well for all groups, but instead we consistently observed shifts between different groups that could lead to lower medical diagnoses and treatments,” said Yang, who along with Zhang were two of the lead authors. on the paper. The main goal of their investigation was to determine the types of subpopulation shifts that could occur and uncover the mechanisms behind them so that, ultimately, more equitable models could be developed.
The new paper “significantly advances our understanding” of the subpopulation shift phenomenon, claims Stanford University computer scientist Sanmi Koyejo. “This research provides valuable insights for future advances in the performance of machine learning models on underrepresented subgroups.”
Camels and cows
The MIT group has identified four main types of shifts—spurious correlations, attribute imbalances, class imbalances, and attribute generalizations—that, according to Yang, “were never unified into a coherent, unified framework. We’ve found one equation that shows you where the bias comes from.”
Bias can, in fact, stem from what researchers call class, or from attributes, or both. To take a simple example, suppose that the task given to a machine learning model is to sort image objects — animals in this case — into two classes: cows and camels. Attributes are descriptors that are not specifically related to the class itself. It might turn out, for example, that all the images used in the analysis show cows standing on grass and camels on sand — both grass and sand serve as attributes here. Given the data available to it, the machine could reach the wrong conclusion – namely that cows can only be found in grass, not sand, otherwise in camels. Such a finding would err, however, giving rise to spurious correlations, which, Yang explains, are a “special case” among subpopulation shifts – “where you have biases in both class and attribute.”
In a medical setting, one can rely on machine learning models to determine whether or not a person has pneumonia based on examination of X-ray images. There will be two classes in this situation, one for people with lung disease, another for those who are infection free. Relatively easy cases will only involve two attributes: the person being X-rayed is either a woman or a man. If, in this particular data set, there were 100 males diagnosed with pneumonia for every one female diagnosed with pneumonia, that could lead to an imbalance of attributes, and the model would likely do a better job of correctly detecting pneumonia for males than woman. . Similarly, having subjects who are 1,000 times more healthy (pneumonia-free) than diseased subjects will lead to a class imbalance, with the model biased toward healthy cases. Attribute generalization is the final shift highlighted in the new study. If your sample contains 100 male patients with pneumonia and zero female subjects with the same disease, you still want the model to be able to generalize and make predictions about female subjects even though there is no sample in the training data for women with pneumonia.
The team then took 20 state-of-the-art algorithms, designed to perform a classification task, and tested them on a dozen data sets to see how they performed across different population groups. They reached some unexpected conclusions: By increasing the “classifier”, which is the last layer of the neural network, they were able to reduce the occurrence of spurious correlations and class imbalances, but other shifts were not affected. Improvements to the “encoder”, one of the top layers in a neural network, can reduce the problem of attribute imbalance. “However, no matter what we do to the encoders or classifiers, we don’t see any improvement in terms of generalization of the attributes,” says Yang, “and we don’t know how to work around that yet.”
Precisely accurate
There are also questions to assess how well your model actually works in terms of fairness among different population groups. A commonly used metric, called worst group accuracy or WGA, is based on the assumption that if you can improve accuracy — say, medical diagnoses — for the group that has the worst model performance, you will improve the model as a whole. “WGA is considered the gold standard in subpopulation evaluation,” the authors argue, but they made a surprising finding: increasing worst-case group accuracy resulted in decreases in what they called “worst-case precision”. In any kind of medical decision-making, one needs accuracy — which refers to the validity of the findings — and precision, which refers to the reliability of the methodology. “Precision and accuracy are very important metrics in the classification task, and that is especially true in medical diagnostics,” explained Yang. “You should never trade precision for accuracy. You always need to balance the two.”
MIT scientists put their theory into practice. In a study they did with medical centers, they looked at public datasets for tens of thousands of patients and hundreds of thousands of chest X-rays, trying to see if it was possible for machine learning models to work in an unbiased environment. way for all populations. That’s far from the case, though more awareness has been drawn to the issue, Yang said. “We found large differences across age, sex, ethnicity, and intersectional groups.”
He and his colleagues agree on the ultimate goal, which is to achieve equity in health care among all populations. But before we can get to that point, they argue, we still need a better understanding of the sources of injustice and how they permeate our current system. Reforming the system as a whole won’t be easy, they admit. In fact, the title of the paper they presented at the Honolulu conference, “Change is Hard,” gives some indication of the challenges they and like-minded researchers face.
This research was funded by the MIT-IBM Watson AI Lab.
#machine #learning #models #amplify #inequalities #medical #diagnosis #care