Please enable JavaScript in your browser.

"Learning Decision Trees and Forests with Algorithmic Recourse" presented on ICML 2024 - fltech - Technology Blog of Fujitsu Research

fltech - Technology Blog of Fujitsu Research

A technology blog where Fujitsu researchers talk about a variety of topics

"Learning Decision Trees and Forests with Algorithmic Recourse" presented on ICML 2024

Introduction

Hi! This is Kentaro Kanamori from Artificial Intelligence Laboratory. We are conducting research and development on "XAI-based Decision-Making Assistant" and trying to apply our technologies to real decision-making tasks (e.g., discovering actions for imprving employees' productivity from the engagement survey data).
Recently, our research paper entitled "Learning Decision Trees and Forests with Algorithmic Recourse" has been accepted by ICML2024 as a spotlight paper (top 3.5% of the submittd papers!). This article briefly introduces the contents of the paper.

Paper

  • Title: Learning Decision Trees and Forests with Algorithmic Recourse
  • Authors: Kentaro Kanamori (Fujitsu Limited), Takuya Takagi (Fujitsu Limited), Ken Kobayashi (Tokyo Institute of Technology), Yuichi Ike (Kyushu University)
  • Conference: The Forty-First International Conference on Machine Learning (ICML2024)
  • Link (OpenReview): https://openreview.net/forum?id=blGpu9aGs6
  • Link (arXiv): https://arxiv.org/abs/2406.01098

Overview

TL;DR

In this paper, we focused on decision trees and tree ensemble models as machine learning models, and on action explanation method based on counterfactal explanation as an explanation method for machine learning models, and proposed a method to efficiently learn a model that achieves both accurate prediction and realistic action explanation.

Background

Action Explanation and Algorithmic Recourse Based on Counterfactual Explanation

With the dramatic development of AI technologies such as deep learning technologies, foundation models, and generative AI, data-driven decision making, which uses predictions from machine learning models to streamline various decision making tasks, is becoming increasingly popular. On the other hand, in the decision-making phase in the real world, it is important to realize explainability that can provide some additional information about the prediction results in order for users to accept the prediction results output by machine learning models with confidence, and various explanation techniques have been studied and proposed [1].

Among them, counterfactual explanation [2] is a method that presents how users should change the feature amount to obtain a desired prediction result, which we call action, as an explanation. For example, if we apply the counterfactual explanation method to a machine learning model that performs loan screening using information such as a user's monthly income and credit history as input, we can obtain an action to pass the loan screening as an explanation, such as “reduce the number of outstanding loans by one” or “increase monthly income by $2K” (Figure 1). As described above, the counterfactual explanation method is one of the explanation techniques that has attracted attention in recent years because it can provide a more constructive explanation of the prediction result in the form of actions than the conventional explanation techniques that provide the basis for the prediction [3].

In addition, this counterfactual explanation method is regarded as a fundamental technology to realize algorithmic recourse, which is a new aspect of AI reliability. This is an attempt to ensure that "If a human user takes an appropriate action, a desired result can be obtained from AI without fail." and has been actively studied in recent years due to its importance [4].

Figure 1. Example of Action Explanation Based on Counterfactual Explanation

Issues with Existing Methods

Much of the previous work on counterfactual explanations has focused on developing methods for extracting actions from models that have already been trained. For example, our group has been researching and developing several action explanation techniques, such as a realistic action explanation technique considering data distribution [5], a technique for optimizing an appropriate execution order of an action by considering the causal relationship between features [6], and an explanation technique that summarizes the actions presented to multiple user groups in an interpretable form [7].

However, there may not always be an action that the user can perform and produces the desired result for a given learned model. Figure 2 shows an example. Figure 2 shows the decision regions of the loan examination model learned from the two features of "Age" and "Income." Here, the red and blue regions correspond to the regions where the machine learning model will reject and approve the user's loan application, respectively. The current feature value of a user is represented by a point in the feature space, and we want to know an action for moving from the red region to the blue region (i.e., how to change the feature values). For the machine learning model shown in Figure 2, the user can move to the blue region by decreasing the value of the feature "Age," however, this is an infeasible action. On the other hand, increasing the feature "Income" seems like a reasonable and feasible action, but as far as Figure 2 shows, such an action on this model would not move the user to the blue region. Thus, there is no guarantee that there will always be a realistic action for a given learned model that has been optimized for its predictive performace.

In order to solve this issue, it is necessary to develop a new learning framework because it is necessary to consider whether there is a valid and executable action for each user at the time of model training. In fact, this issue has already been pointed out in a previous study by Ross et al. [8], and a learning framework considering actions has been proposed. However, this existing technique is a gradient-based learning algorithm, which assumes a deep neural network as a machine learning model, and can not be applied to decision trees and tree ensemble models (such as Random Forest [9]), which are widely used for tabular data frequently encountered in predictive tasks where action explanation is important (such as loan screening).

Figure 2. Demonstration of Issues of Existing Methods

Contributions of This Paper

In this paper, we focuse on decision trees and tree ensemble models as machine learning models, and aimed to propose an efficient learning algorithm for decision trees considering the ratio of input data having at least one valid and executable action. For that purpose, we introduce the recourse risk for evaluating the ratio of training samples having no valid and executable action. Our contributions can be summarized as follows:

  1. By extending the standard top-down decision tree learning algorithm, we proposed an efficient learning algorithm for decision trees that takes into account our recourse risk. In addition, we show that our proposed algorithm has the same time complexity as standard decision tree learning algorithms (such as CART [10]) and can be easily extended to the framework of random forests.
  2. We introduce a new task of editing a learned decision tree under constraints on our recourse risk and show that this task can be reduced to the weighted partial cover problem, which is a variant of the minimum set-cover problem. We also show that we can give a probabilistic theoretical guarantee on our recourse risk for unseen test instances through the PAC learning framework.
  3. Numerical experiments on real data sets in the financial and judicial fields showed that our proposed method could significantly improve the action guarantee rate without degrading the prediction accuracy and learning efficiency. Specifically, we confirmed that we were able to improve the action guarantee rate by up to 50% while suppressing the prediction accuracy degradation to around 1%.

After introducing some definitions, we introduce our proposed method and outline our results. Please refer to the original paper for details.

Technical Details

Formulation of Our Learning Problem

Decision tree

Decision trees are one of the most popular machine learning models that represent a set of if-then-else rules as a binary tree. For an input  x \in \mathbb{R}^D, a decision tree  h \colon \mathbb{R}^ D \to \{ \pm 1\} starts at its root node, follows its branching condition  x_d \leq b at each intermediate node, and outputs the predictive label  \hat{y} \in \{ \pm 1\} of the finally reached leaf as the prediction for  x.

To learn a decision tree from a given training sample, we need to determine the structure of the tree, the branching condition of each internal node, and the predictive label of each leaf. However, it is difficult to obtain the exact solution efficiently, because the problem to obtain the decision tree minimizing the training error is a combinatorial optimization problem which is difficult to solve computationally. Therefore, a top-down greedy algorithm like CART [10] is widely used.

Figure 3. Example of Decision Tree

You can also read about our research on decision trees in the following articles!

Action Explanation

Now assume that the desired prediction result is  y = +1. For a given input  x \in \mathbb{R}^D and model  h \colon \mathbb{R}^ D \to \{ \pm 1 \}, the *action explanation based on counterfactual explanation [2] aims to find an action  a^\ast \in \mathbb{R}^D that is an optimal solution to the following optimization problem:


{\min}_{a \in \mathcal{A}} \; c(a) \;\;\; \text{subject to} \;\; h(x + a) = +1

where  \mathcal{A} \subseteq \mathbb{R}^D is a set of feasible actions, and  c \colon \mathcal{A} \to \mathbb{R}_{\geq 0} is a cost function that evaluates the execution cost of an action  a \in \mathcal{A}.

While the definition of a set of feasible actions depends on the prediction task, we assume that it is expressed in the form of  \mathcal{A} = [l_1, u_ 1 ] \times \dots \times [l_D, u_ D ] \subseteq \mathbb{R}^D. For example, for a feature  d \in [ D ] that cannot be changed, such as age or gender, you can specify  l_d = u_d = 0 as the corresponding upper and lower limit values to restrict the feature value so that it cannot be changed.

Because cost functions play an important role in obtaining realistic actions, various cost functions have been proposed. In this paper, we assume a cost function of  \ell_{\infty}-norm type expressed as  c(a) = \max_{d \in [ D ]} c_d (a_d), where  c_d \colon \mathbb{R} \to \mathbb{R}_{\geq 0} is a cost measure for the feature  d. This form of the cost function includes some major cost functions such as max percentile shift [4].

Figure 4. Image of Action Optimization Problem

You can also read about our previous work on action explanation in the following articles!

Formulation

The goal of this paper is to efficiently learn an accurate decision tree while ensuring valid and executable actions for as many training instances as possble. For that purpose, we first introduce a new loss function named recourse loss:


l_\mathrm{rec}(x; h) = \min_{a \in \mathcal{A}_{\varepsilon} } l_{01}(+1, h(x + a))

where  l_{01}(y, \hat{y}) = \mathbb{I}[y \not = \hat{y}] is the 0-1 loss function and  \mathcal{A}_{\varepsilon} = \{a \in \mathcal{A} \mid c(a) \leq \varepsilon \} is the set of feasible actions whose cost is less than or equal to a given cost budget  \varepsilon > 0. For an input  x and model  h, our recourse loss takes  0 if there exists at least one action that is valid (i.e.,  h (x + a) = +1) and executable (i.e.,  a \in \mathcal{A} and  c(a) \leq \varepsilon); otherwise it takes  1. Let us define the empirical recourse risk as the average value  \hat{\Omega}_{\varepsilon} (h) = \frac{1}{N} \sum_{n=1}^{N} l_\mathrm{rec} (x_n; h) of this recourse loss on a given training sample  S = \{ (x_n, y_n) \}_{n=1}^{N}. By definition, our empirical recourse risk can be interpreted as the ratio of training instances that do not have valid and executable actions.

Using our empirical recourse risk, we formulate our learning problem as follows:


{\min}_{h \in \mathcal{H}} \; \hat{R}(h) \;\;\; \text{subject to} \;\; \hat{\Omega}_{\varepsilon}(h) \leq \delta

where  \mathcal{H} is a set of decision trees,  \hat{R}(h) = \frac{1}{N} \sum_{n=1}^{N} l_{01}(y_n, h (x_n)) is the empirical risk, and  \delta \geq 0 is a given hyperparameter. By solving the above learning problem, we are expected that we can obtain an accurate decision tree  h while guaranteeing the existence of valid and executable actions  a for more than  100 \cdot (1 - \delta)\% instances  x in the training sample  S.

Proposed algorithm

As mentioned above, the learning problem of decision trees is a combinatorial optimization problem which is difficult to solve exactly, and the top-down greedy algorithm is widely used. In a standard learning algorithm, branching conditions and predictive labels are determined by a top-down manner based on the information gain (Roughly speaking, how much prediction error can be reduced by adding a new branching condition).

Our proposed algorithm is a simple extension of this top-down greedy learning algorithm. Suppose that an internal node with a new branching condition  x_ d \leq b and left and right leaf nodes with predicted labels  \hat{y}_{\mathrm{L}}, \hat{y}_{\mathrm{R}} are to be added to the current decision tree  h, and such a decision tree is represented by  h'. In this case, our proposed algorithm determines the parameters of the branching condition  d \in [ D ], b \in \mathbb{R} and the predictive labels  \hat{y}_{\mathrm{L}}, \hat{y}_{\mathrm{R}} \in \{ \pm 1\} that minimize the following objective function:


\Phi_\lambda(d, b, \hat{y}_{\mathrm{L}}, \hat{y}_{\mathrm{R}}) = \hat{R}(h') + \lambda \cdot \hat{\Omega}_{\varepsilon}(h')

where  \lambda \geq 0 is a given trade-off parameter. The second term  \lambda \cdot \hat{\Omega}_{\varepsilon}(h') of  \Phi_\lambda can be viewed as a relaxation of the constraint  \hat{\Omega}_{\varepsilon} (h) \leq \delta on the empirical recourse risk in the original learning problem. The branching condition  x_ d \leq b and the predictive labels  \hat{y}_{\mathrm{L}}, \hat{y}_{\mathrm{R}} are determined by solving the problem of minimizing the above objective function  \Phi_\lambda (greedy splitting problem) at each node. By recursively repeating this procedure in a top-down manner, we train a decision tree.

Figure 5. Example of How Our Algorithm Works

Since the proposed method is a simple extension of the standard decision tree learning algorithm, it is easily applicable to the framework of Random Forest [9], which is a popular tree ensemble model learning method. In the learning algorithm of random forest, multiple distinct decision trees are learned by feature subsampling and bagging. The proposed algorithm can be extended to random forests by replacing the algorithm for learning each decision tree with our proposed algorithm.

Theoretical Results

As mentioned above, the proposed algorithm solves the bgreedy splitting problem under the objective function considering our empirical recourse risk in addition to the empirical risk. If our empirical recourse risk is ignored (that is, set to  \lambda = 0), it is roughly equivalent to the problem solved by the standard decision tree learning algorithm, which is known to be solved by  \mathcal{O}(D \cdot N) by some appropriate preprocessing (such as sorting the training instances). In this paper, we show that the greedy splitting problem can be solved with the same time complexity even when our empirical recourse risk is considered (That is, there is no computational overhead for our additional constraint).

Proposition 1.
The proposed algorithm solves the greedy splitting problem with our empirical recourse risk ( \lambda > 0) in  \mathcal{O}(D \cdot N).

On the other hand, since the proposed algorithm learns a decision tree by relaxing the constraint  \hat{\Omega}_{\varepsilon}(h) \leq \delta of our original learning problem, the obtained decision tree may not satisfy this constraint. Therefore, we introduce a new task of editing the predictive labels of some leaf nodes in a learned decision tree to satisfy the constraints (relabeling problem). In this paper, we show that this task can be reduced to the weighted partial cover problem, which is a variant of the minimum set-cover problem. For the weighted partial covering problem, it is known that there exists a polynomial time algorithm with an approximation guarantee. In other words, by solving the reduced problem, we can efficiently modify a learned decision tree so as to satisfy the constraint on our recourse risk.

Proposition 2.
The relabeling problem is reduced to the weighted partial cover problem.

Furthermore, while our empirical recourse risk  \hat{\Omega}_{\varepsilon} evaluates the action guarantee rate on a given training sample  S, it does not necessarily guarantee the existence of a valid and executable action for an unseen test instance. In this paper, using the framework of PAC learning (probability approximately correct learning), which is a fundamental technique in computational learning theory, we show an upper bound on the estimation error of our empirical recourse risk. Combining this result with our relabeling task described above, we can also provide a theoretical guarantee on the action guarantee rate for the test data.

Proposition 3.
We define the expected recourse risk by  \Omega_{\varepsilon}(h) = \mathbb{E}_{x} [l_\mathrm{rec}(x; h)]. Then, for any  \alpha > 0, the following inequality holds with probability at least  1-\alpha:  \Omega_{\varepsilon}(h) \leq \hat{\Omega}_{\varepsilon}(h) + \sqrt{\frac{\ln |\mathcal{H}| -\ln \alpha}{2 \cdot |S|}}

Experimental Results

By numerical experiments, we compared our method with the baseline methods on four tabular data sets (FICO, COMPAS, Credit, Bail) which are widely used as benchmarks in explainability field. Here, FICO and Credit are real datasets on loan screening in the financial sector, and COMPAS and Bail are real datasets on recidivism prediction in the legal sector. As baseline methods, a standard learning method without constraint (Vanilla) and a learning method using only actionable features (OAF). We performed 10 fold cross validation on each dataset, and measured prediction accuracy, action guarantee rate, and computation time.

Figure 6. Experimental Results of Baseline Comparison

Figure 6 shows the experimental results of our baseline comparison. The upper part shows the prediction accuracy, the middle part shows the action guarantee rate, and the lower part shows the computation time. Based on the experimental results, the proposed method (recurse-aware classification tree; RACT) achieved a higher action guarantee rate than the baseline approach. On the other hand, there was no significant difference in prediction accuracy and computation time among the methods. From the above results, we confirmed that the proposed method (RACT) achieved high action guarantee rate without degrading prediction accuracy and computation efficiency.

Figure 7. Experimental Results of Sensitivity Analysis for  \lambda

Figure 7 shows the experimental results for our sensitivity analysis of the trade-off parameter  \lambda in a random forest. As the value of  \lambda decreases, the AUC (upper row) improved. On the other hand, as the value of  \lambda increases, the action guarantee ratio (lower row) improved. From these results, we can confirm that the proposed method could adjust the trade-off between prediction accuracy and action guarantee rate by controlling the trade-off parameter  \lambda. In real decision-making situations, it is not always necessary to guarantee actions to all users at the expense of predictive accuracy. Our experimental results suggest that the proposed method could be flexibly adapted to such realistic use cases.

Conclusion

This article introduced our research on “Learning Decision Trees and Forests with Algorithmic Recourse," which was accepted as a spotlight paper at ICML2024. Future work includes extending to the framework of gradient boosting, such as XGBoost and LightGBM, another popular tree ensemble model learning framework. Application to actual decision-making tasks is also one of our future work.

Resources

  • [1] Satoshi Hara. Explainable AI (my bookmarks). Journal of the Japanese Societ for Artificial Intelligence, Vol. 34, No. 4, pp. 577–582, 2018.
  • [2] S. Wachter, B. Mittelstadt, and C. Russell. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harvard Journal of Law & Technology, Vol. 31, No. 2, pp. 841–887, 2018.
  • [3] K. Kanamori, T. Takagi. Mathematical Optimization for Explainable Machine Learning ― Counterfactual Explanation Based on Mixed-Integer Linear Optimization ―. Bulletin of the Japan Society for Industrial and Applied Mathematics, vol. 33, No. 4, pp. 207-212, 2023.
  • [4] B. Ustun, A. Spangher, and Y. Liu. Actionable recourse in linear classification. In Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency (FAT* 2019), pp. 10–19, 2019.
  • [5] K. Kanamori, T. Takagi, K. Kobayashi, and H. Arimura. DACE: Distribution-aware counterfactual explanation by mixed-integer linear optimization. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI 2020), pp. 2855–2862, 2020.
  • [6] K. Kanamori, T. Takagi, K. Kobayashi, Y. Ike, K. Uemura, and H. Arimura. Ordered counterfactual explanation by mixed-integer linear optimization. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI 2021), pp. 11564–11574, 2021.
  • [7] K. Kanamori, T. Takagi, K. Kobayashi, and Y. Ike. Counterfactual explanation trees: Transparent and consistent actionable recourse with decision trees. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS 2022), pp. 1846–1870, 2022.
  • [8] A. Ross, H. Lakkaraju, and O. Bastani. Learning models for actionable recourse. In Proceedings of the 35th International Conference on Neural Information Processing Systems (NeurIPS 2021), pp. 18734–18746, 2021.
  • [9] L. Breiman. Random forests. Machine Learning, Vol. 45, No. 1, pp. 5–32, 2001.
  • [10] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, 1984.