Effectiveness of Minimax Group Fairness Algorithms

Jimmy Ren
11 min readJun 16, 2021

See original paper here

This paper proposes two algorithms:
◦ 1. Finds a minimax group fair model from a given statistical class
◦ 2. Looks at tradeoffs between a relaxed model of minimax fairness and overall accuracy

Both algorithms converge and are oracle efficient for which we can test multiple types of models to see how they work on various data sets. There are two algorithms: MinimaxFair and MinimaxFairRelaxed. So, what’s the difference?

Compared to the first algorithm, there is an introduction of γ, which is the maximum group error of the randomized minimax model. We may want to try various forms of gamma to determine what is possible in terms of the trade-off between fairness and accuracy. As a stakeholder, you might want to have your model focus solely on fairness, or you may prefer to have the most accurate model or a mix of both. This is exactly what the model aims to do:

First, we run MinimaxFair to determine the values of two quantities γ_min, the maximum group error of the randomized minimax model, and γ_max, the maximum group error of the population error minimizing model. The former is the minimum feasible value for γ, and the latter is the largest value for γ we would ever want to accept.

Next, we run MinimaxFairRelaxed over a selection of gamma values in [γmin, γmax] to trace the achievable tradeoff between fairness and accuracy and plot the associated Pareto curve.

Perfect — to verify our understanding of the model, we will first recreate the plots you see in the graph. Then, we’ll move onto some models of our own and check out the results. What are the parameters this GitHub repository provides us?

Some important settings are highlighted below:

models = {1: ‘LinearRegression’, 2: ‘LogisticRegression’, 3: ‘Perceptron’, 4: ‘PairedRegressionClassifier’, 5: ‘MLPClassifier’} # WARNING: MLPClassifier is not GPU optimized and may run slowly

numsteps = 2000 # number of steps for learning/game

error_type = ‘0/1 Loss’ # ‘MSE’, ‘0/1 Loss’, ‘FP’, ‘FN’, ‘Log-Loss’, ‘FP-Log-Loss’, ‘FN-Log-Loss’

# NOTE: eta = a * t^(-b) on the t-th round of the game

a = 1 # Multiplicative coefficient on parametrized learning rate

b = 1 / 2 # Negative exponent on parameterized learning rate

gamma = 0.0 # Max groups error if using relaxed variant

Here, we can determine the number of steps for the game, the type of loss function we want to use, the parametrized adaptive learning rate, and the maximum group error in the form of gamma. We will be testing various settings to try to get the most out of this algorithm. Our goal is to test different configurations of the settings and use datasets that do not appear in the paper.

Regression

This model is optimized for the Linear Regression Model with MSE as the loss function. The algorithm actually determines what γ_min and γ_max, which appear when running MinimaxFairRelaxed as it determines the number of gammas in use. We’ll start off by using the Communities dataset.

The exact range is [0.10795892100437832, 0.1434128617155597].

After running our algorithm, we now have our Pareto curves for the selection of the gamma values. These gamma values fall into the range we defined above. See figure below:

Pareto Curve of Fairness vs. Accuracy

Note that our initial population-error minimizing model is labeled by a large pink dot and goes to our final randomized model achieving minimax group fairness, which is colored yellow. Clearly, just like in the paper, the relationship between expected population error and maximum group error is decreasing and convex (both within trajectories and between their endpoints) illustrating a clear tradeoff between the two objectives. Any differences are a result of small differences in the learning rate and datasets (that were different from the paper due to privacy).

Classification

Let’s now attempt to replicate their classification model. Obviously, if possible, we would love to use a 0–1 indicator function for error. However, we want a convex loss function because it lets you perform convex optimization via gradient descent, but the 0–1 indicator function is not convex. This means it would not be possible to efficiently come to a solution using 0–1 loss. For the sake of direct comparison between the equal-error model and minimax, we will first use 0–1 loss. Then, we will move forward to instead use log-loss to determine if it is an acceptable heuristic.

For the COMPAS dataset, we use the standard settings except we switch to a PRC model due to its ability to handle negative weights. We want to show that our minimax solution Pareto dominates the equal-error model, further strengthening the results we see in the paper. Note that our graphs may be different due to the fact that there may have been a training and testing sample difference.

Group Error weighted and trained on 0/1 Loss for Equal-Error Model

Clearly, each category of race and sex in the equal-error model is Pareto-dominated by the minimax model. The equal-error model clearly attempts to artificially inflate the error of all groups in order to satisfy its constraints. This is exactly the same result as we see in the paper.

False Positives/False Negatives

This model also gives us an opportunity to see how our model handles false positive rates (FP). Below, we use the COMPAS dataset, as here, false-positive rates are extremely important to ensure that decisions with heavy consequences are handled with caution. Mistakes, such as a guilty conviction for an innocent man, have severe consequences. Let us see our replications below. Note that when we do this, one can get a false positive rate of zero by using a constant classifier that blindly makes negative predictions on all outputs, so we instead look to examine the tradeoff between the overall error and maximum false-positive rates. These directly translate to False Negative.

Clearly, we can see how the tradeoff curve fits perfectly on the nearly-convex tradeoff curve.

Generalization

We want to show this model works in generalization performance as well. We will first replicate this in regression. Note that the classification dataset was not available to replicate, so we move onto our own testing later. Similar to the paper, we will hold 25% of the data for test validation. Here, we’re looking for the test and training data to come out with similar errors with small margins between them. Hopefully, we can determine if those small margins are accounted for by the generic property of linear classifiers on the dataset by observing the differences between the unweighted model and the minimax model.

Regression

Here, we first simulate what they did in the paper with the Bike dataset with a 75/25 split for validation.

Training (left) vs Test (right)

Here, we can see how marginal the difference between the unweighted model and the minimax model. Every group here remains lower in error from training to testing except winter, and even then the gap is in the thousandths. We also see as steps increase, the gap between the two errors decreases, further supporting what we see in the model and what is proven in the paper; the generalization performance of the minimax algorithm is impressively efficient and powerful with large and powerful datasets for both classification and regression.

My Testing

Now that we have a good grip of the settings and the model, let us try to show similar results for the other datasets that are provided in the Github repository.

Going through a similar process, I will label my graphs and settings below with a comparison to both the paper and my reproductions of their work.

First, we’ll consider the COMPAS dataset, Fire dataset, and the Student dataset for classification and regression tasks.

Regression on Student Dataset

In the Student dataset, the information considered are things like mother or father’s education, family relationships, and student behaviors. With the regression target being from 0 to 20 to determine the final student performance, we’ll use the Linear Regression model with MSE as the loss function and with the learning factor being set with a = 1, 5, and 10 with b = 0.5 and the number of steps being set to 5000.

a = 1
a = 5
a = 10

With these graphs, we can see how the group error decreases as population error increases in this tradeoff between population error and maximum group error. We’re continuing to get further evidence of how the model is able to project a tradeoff between the two objectives. We can also see that a = 10 produces a graph that looks closest to converging, implying a higher a for this model may better suit the best learning rate. By testing values from increments of 25 from 1 to 100, I found that a = 25 approximately allows us to best determine the model’s errors.

Group Error for Student for A = 25

As noted, the sample weights must never become negative for any meaningful comparison between the minimax and the equal error model. Since this is only the case for the Bike dataset, we instead determine if the tradeoff between fairness and overall accuracy exists using MinimaxFairRelaxed.

Pareto Curve of Fairness vs Accuracy

Here, we get a nice convex graph for 10 values of gammas. From this, we have further evidence that the model successfully models an excellent tradeoff between population error and max group error.

Regression on Fire Dataset

In the Fire dataset, the information considered here are things like location and weather while predicting the area of forest that ended up being burned. We’ll follow a similar procedure as before to determine what value of a would work here. After running a = 25, I realized it needed a smaller range of a’s to try. Instead, I tried a = 1, 5, 10, 15, and 20. Settling on 20, we get the following results.

a = 20

Interestingly, the maximum group error initially drops significantly with a slight change in population error. This is a clear position in which one may select around 0.0034 for population error for the sake of dropping the max group error. We can see that the largest error of the groups, September at 0.006, decreases as the algorithm runs. While some groups increase dramatically (May and April), this can easily be accounted for due to the low number of samples for each group. This low number of samples per group can further explain the difference in group error and population error results we see. The group weights directly correspond to these changes in errors.

Classification on COMPAS dataset

To do more classification testing, we’ll start off using the COMPAS dataset, due to its large number of samples per group. Other datasets were either too large to process in a reasonable amount of time or were too small for significant results.

To do further comparisons between equal error and our minimax model, we utilized the 0/1 loss function with the Paired Regression Classifier. We see that our minimax model continues to Pareto-Dominate the equal error model. Each group has a lower error compared to its counterpart. Those in the equal error model, as we expected, are inflated to around 0.365, which is significantly higher than all of the other group errors we see in the minimax model.

COMPAS with model weighted on Log-Loss and measured by log-loss (left) and 0/1 Loss (right)

Using log-loss to train the model, we see that it creates a convex plot when loss is measured on log-loss. However, when measured on 0/1 Loss, we get this complex, non-convex shape. In this case, we clearly see the gap between theory and practice, as log-loss violates the paper’s first assumption.

I want to dedicate a section to false-positive rates. In terms of establishing credibility, it would be more harmful to say someone has good credit with they in reality do not. We want to minimize the number of false positives here, so we’ll attempt to show what the model can do.

Generalization

We now want to show generalization with one of the other datasets. First, we’ll use the Student and Fire datasets for regression, and the Credit dataset for classification. We’ll use an 80/20 training/testing split and try to see what happens.

Regression

For the Student dataset, we display our results below. We actually get quite poor results, which can be explained by the low number of samples available.

Instead, we tried to use the Wine dataset, which was significantly bigger in terms of group size than those for the Student dataset.

a = 100 with 3000 steps, training (left) vs test (right)

While there are only 2 groups here, we still see that the two group errors here are marginally off from each other. Between the test and training dataset, the margin between each group was roughly 0.00025. Although the white wine group decreased subtly, this graph further demonstrates strong generalization claims from the paper.

Classification

Due to the nature of the datasets provided, it was difficult to get any to work. Some had group sizes that were too small, while other datasets would take days to run.

Conclusion

From the work we did with multiple datasets, whether it be reproducing the paper’s results or testing the other datasets, we have concluded that the theoretical results presented in the paper work empirically as well. We worked through various settings that explored the effects of each of these “knobs” that allowed us to better understand how to get the most out of our models.

With machine learning being present amongst a variety of important decisions and predictions, it would be remiss to not consider the unfairness that may become embedded in these models. This paper shows an effective way of considering the two objectives through a minimax model, allowing stakeholders to best decide between the tradeoff off of getting more right or choosing what is fair.

--

--