BACKGROUND
Course Listing: COMP_SCI 349 – Machine Learning
Course Description: Introducing students to concepts, tools, and applications of different machine learning models, with a focus on topics including Bayesian Learning, Decision Trees, Genetic Algorithms, and Neural Networks. Satisfies the AI Breadth of the Northwestern Computer Science major requirements.
Collaborators: Frankie Lucco, Shubhanshi Gaudani, Fai Poungpeth, Zach Wood-Doughty
Timeline: development in Fall 2023 and Winter 2024, implementation in Winter, Spring 2024, and beyond
Overview: The ethics curriculum will introduce students to the basics of what ethical thinking is in the context of machine learning (ML) by teaching them some basic terms/ideas that would be most relevant to specific models of ML, and providing resources for them to further reflect on the implication of the ethics behind it from their own perspectives. Students could be asked to evaluate real world scenarios, hypothetical scenarios, and their own code using these ethical ideas. To help students figure out how to analyze things meaningfully, examples could be given of existing analysis, like asking students to read a paper or opinion piece.
Goals:
- Introduce specific concepts which intersect ML with ethics, and encourage students to reflect on the benefits and limitations of each
- Demonstrate to students on how to use these concepts to analyze problems involving ethical issues in technological contexts
- Promote self-reflection and social awareness without pushing a single moral perspective, so that students have more agency to express their own opinions
OVERARCHING ETHICAL CONCEPTS
Demographic bias
There are various studies about underdiagnosis and overdiagnosis in different minority groups– the distribution of stroke symptom profiles varies by sex, for example, which has led to underdiagnosis of acute strokes in women. The consensus seems to be that this idea comes from studying lots of men having strokes, combined with the assumption that sex was an independent variable from distribution of symptoms. Turns out it’s not. In addition, the following article discusses how women, minorities, and people under 45 years old are likely to be misdiagnosed. Misdiagnosed people disproportionately had a symptom profile including dizziness/headaches.
An important point to consider is that just having an equal number of cases of different demographics (eg: equal male vs female cases in the dataset) is not sufficient to address this issue. It is essential to also analyze the potential underlying bias in the labeling. Due to the medical bias historically present within the diagnostic criteria for a stroke, female patients who have symptom profiles more similar to typical male symptom profiles may have been labeled by humans as “having a stroke” more often than female patients with symptom profiles that are less common in men.
Transparency and Accountability
As a developer, it is important to be able to explain your model to all relevant stakeholders who wish to apply it in an effective and responsible manner. Points to consider include informed consent (since sensitive data might be involved, which could end up coming from patients, clients, etc), harm prevention (ie: how to minimize dangerous effects), accountability (ie: considering who takes responsibility in scenarios where models might be less accurate and/or less fair.
ETHICAL CONCEPTS RELEVANT TO SPECIFIC COURSE TOPICS
Loans
When deciding whether a loan should be granted, a decision tree model will rely on the patterns which exist in the data, which would historically suggest that men have higher income and a more stable employment than women. Thus, the model is likely to create a decision tree that splits on male versus female, and will thus predict for more men to be approved of loans than women. Although this may lead to a high accuracy since men still have similar economic advantages to this day, it could end up denying highly qualified female applicants.
BMI
A setting in which decision trees may be prone to perpetuating systemic issues would be in the medical field, where these models could be used to classify whether someone is diagnosed with a particular disease, or fits under a specific health criteria. This article from the AMA provides an example in the context of using BMI to classify someone as obese, where a decision tree could be implemented to determine whether someone is obese. However, having BMI over a certain threshold does not mean that one should be clinically diagnosed with obesity; rather, it is only correlated with a clinical diagnosis. As a result, it has led to an underestimation of Asian womens’ risk for obesity, and an overestimation of African American womens’ risk for obesity. A large factor which has led to these patterns is that data used to train these models have come from a majority white population, meaning that it is not representative of those from minority groups.
Data Poisoning and Backdoor Attacks
Compared to other supervised learning techniques such as decision trees and deep neural networks, KNN models are extremely prone to data poisoning and backdoor attacks, which aim to corrupt machine learning models by manipulating classification predictions to benefit the attacker. This often occurs because it is used for classification tasks such as anomaly detection, recommendation systems, and document categorization systems. Data collection for these purposes tend to come from online repositories or crowdsourcing, making it extremely vulnerable to contamination. With certain levels of k, however, the model may still be able to maintain its robustness despite this manipulation. Thus, it is important to consider the tradeoff between robustness and accuracy of a KNN model, as well as the real world impact of this tradeoff, when deciding how to tune the hyperparameters.
Source: https://chaowang-vt.github.io/pubDOC/LiWW23_ISSTA_KNNfalsify.pdf
A Facial Recognition in Predictive Policing
KNN classifiers tend to make predictions based on majority vote—looking at the k nearest neighbors and taking the most common class from those. Thus, it can pose problems in cases in which the model is being fitted based on historical data. For instance, training a facial recognition classifier based on historical arrest data may end up perpetuating racial discrimination in law enforcement, leading to further systemic issues in communities which implement this model. In addition, identification data can be considered sensitive and may thus be a violation of privacy or civil liberty regardless of how accurate the model is in mitigating crime. These are all factors which would need to be considered when deciding whether to implement the system.
Bias in NLP models: [Unintended demographic bias in ML for NLP]
Word embeddings encode text into vector spaces by computing the “distance” between words based on important semantic meaning. Thus, analogies such as man is to women as to king is to queen can be made. However, it has been found that this method of word embeddings from Google News articles can lead to analogies like “man is to woman as to computer scientist is to homemaker”. Thus, it perpetuates systemic biases which may encourage the continuation of historical inequalities in society when applying the model.
In addition, methods to quantify word embeddings in order to set labeled word sentiments can also end up being biased. For instance, using a logistic regression classifier to predict sentiments may end up predicting a negative sentiment for identity words due to the characteristics of the dataset used to train the model. As a result, it may cause downstream unfairness. Ideally, it would be better to have a balanced correlation between positive and negative sentiment subspaces for each identity group to help prevent effects of bias. Adversarial learning algorithms can be implemented to help mitigate this issue.
An example of how racial and/or identity biases may occur would be if a company develops an MLP to detect sensitive or obscene content in images (nudity, blood, etc) on a social media website. After implementing the model, they find that pictures of women of color are more likely to be marked obscene than pictures of white women in similar clothing/settings. As a result, it perpetuates racial discrimination, which can lead to an enforcement of stereotypes which cause negative perceptions of women of color.
Material taken from: https://ocw.mit.edu/courses/res-ec-001-exploring-fairness-in-machine-learning-for-international-development-spring-2020/pages/module-four-case-studies/case-study-on-nlp/ (More resources on the website: Slides, short video presentation: 13:02 minutes in total)
LIME and its Ethical Implications
Deep neural networks are very complex and their decisions can be hard to interpret. The LIME technique approximates the classification behavior of a deep neural network using a simpler, more interpretable model, such as a regression tree. Interpreting the decisions of this simpler model provides insight into the decisions of the neural network. The simple model is used to determine the importance of features of the input data, as a proxy for the importance of the features to the deep neural network. When a particular feature is very important to a deep network’s classification decision, removing that feature significantly affects the classification score. That feature is therefore important to the simple model too.
An example of the use of LIME is with the wolf vs husky prediction. In this example, people trained a neural network to distinguish between wolves and huskies, and it was found that the model was quite accurate with the predictions. When the researchers used LIME to evaluate this, however, they found that the predictions were being made based on the presence of snow rather than the characteristics of the animals themselves. This was due to the strong correlation between the presence of snow and the type of animal in the photos in the dataset. [include sentence about the effect of applying in the real world setting for different groups of people.
One case in which someone may have to think about prioritizing accuracy vs explainability would be a medical case in which one uses a model to determine whether a tumor is benign or malignant. On one hand, you could use a post-hoc interpretation tool like LIME to help doctors better understand how the model works, but on the other hand, you could use a longer approach of building a new model which is more intrinsically interpretable. With limited resources, it is important that you are able to justify why one approach would be better than the other and understand the implications of your decision making.
Source: Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. “‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–44. San Francisco California USA: ACM, 2016. https://doi.org/10.1145/2939672.2939778.
Reading: Blog to breakdown the paper and an introduction to the library
Fairness in Statistical Parity
An example scenario of when statistical parity works well is any situation where the basic assumptions of 2NB hold, or are “close enough” to true. The original paper proposing 2NB suggests bank loans, with gender as the sensitive attribute.
Example scenarios of where statistical parity causes problems is when its base assumptions don’t hold, and you run into issues such as the likelihood not being “truly equal” for different groups, or where the observed likelihood won’t be equal even if the “true” likelihood is (like recidivism) or where you’re dealing with intersecting, non-independent group attributes like race and gender. For instance, misclassification in facial recognition isn’t well covered by statistical parity because you can’t really fit it into the base assumptions of positive/negative response.
IMPLEMENTATION
Explore various module types and frameworks for inspiration and ways to carry out these learning goals in your classroom.
LESSONS LEARNED FROM DEVELOPMENT TEAM
- A simple way to think about “ethics” in ML for those without a strong philosophical background is to focus on tradeoffs. For instance, exploring the tradeoffs and balance between fairness and accuracy, privacy and utility, and explainability and accuracy can be a lot more beneficial than trying to characterize aspect/effects of technology as entirely “good” or “bad”
- It is helpful to have specific examples and scenarios to demonstrate the applicability of ethics in a particular context. However, these examples don’t necessarily need to be real life examples—hypothetical examples can be just as helpful in getting the point across.
- In order to maximize engagement with the course material, having assignments count for your grade can make a difference. However, it is important not to underestimate the engagement and interest the students have in the ethics behind machine learning. From our experience, we found that students are very thoughtful and engaged with the content—the main issue is just limited knowledge regarding the relevant background information.
- It is important to find a good balance between the amount of technical and critical thinking components in the module development. In the process of developing content, it is helpful to be open to regularly changing the case studies to be incorporated based on student feedback. Certain struggles that students may have with connecting or understanding the case study can be difficult to anticipate.
- Encouraging students to make connections between the ethical conversations and the technical course content is very important. It will not only help keep discussions on topic, but will also help students see the relevance of the discussion more clearly.
- To ensure that students don’t come out of the discussions feeling extremely powerless or un-hopeful, have some rebuttals and examples ready of how both the negative and positive sides of the story
- It is okay to have awkward silence in discussions! Be comfortable with them occurring and giving students time to think through certain concepts and participate.