Relative Importance Analysis for Psychological Research

Multiple linear regression analysis is widely used among psychological researchers to answer their research question related to causality relationship. Exploring the relative importance of independent variables in explaining the total variation in dependent variable is one of the primary interests upon finding a good fit model from the data. This paper considers two popular methods to obtain relative importance, namely Shapley value regression and relative weight analysis. Both are able to break down the R of the full model into individual contribution proportion of each independent variable while accounting for the correlations between independent variables and thus offer easily interpretable effect size measures for regressions. Kaggle’s empirical data from the World Happiness 2019 will illustrate the theoretical concept of methods above.


Introduction
The next step upon finding a good fit data of multiple linear regression is discovering which of the independent variables are able to explain the highest variability in the dependent variable. For example, a simple study to investigate the impact of attitude and motivation on foreign language learning may investigate the relative importance of each of the two variables in explaining language learning. This will determine if the learner's motivation or learner's attitude truly can explain the success of learning foreign language.
According to Shabuz and Garthwaite (2019), the term importance can have multiple meanings. The first definition states that importance is closely related to the statistical significance of the corresponding regression coefficient. A second definition defines the basis of a variable's practical impact on the dependent variable. Healy (1990) argued that the definition of relative importance is not just about the question of statistical significance.
Numerous methods have been proposed to evaluate relative importance in regression models such as standardized regression coefficients, R 2 -changed, semi-partial correlations, zero-order correlation (Stadler et al., 2017;Johnson & LeBreton, 2004). A method using standardized coefficient is highly susceptible to multicollinearity as a result of the regression coefficient value's inflation and sign changing, thus leading to inaccurate results (Lipovetsky & Conklin, 2015). In addition, when the independent variables correlate with each other, these methods fail to evaluate the relative importance as they cannot properly divide the proportion of variance to the different independent variables (Darlington, 1968). This type of situation is common in psychological research when measuring constructs that consist of correlated facets.
This multicollinearity problem intensifies the development of relative importance analysis in recent years to measure the importance of independent variables by considering the correlations between them. This paper elaborates the theoretical aspects of two popular methods, i.e. Shapley value regression and relative weight analysis, to assess the relative importance in addition to multiple regression analysis. Despite the fact that these two methods have been developed in very different ways, Groemping (2015) proved that they provide similar scores. Furthermore, these methods are demonstrated to have the desirable property to measure the sum of individual contributions of the independent variables to the proportion of the variability in the dependent variable (R 2 ) in the presence of multicollinearity. An empirical study using a dataset from Kaggle is capable to identify the importance of each independent variable by using Shapley value regression and relative weight analysis.

Methods
Consider a multiple regression model as follows: Assuming all independent variables have no correlation with each other, then the standardized regression coefficients can be used to measure relative importance, that is the sum of the squared of the standardized coefficients equals to the total R 2 . This suggests that each individual squared coefficient measures the proportion of total variability in dependent variable by that individual variable. However, this situation almost never exists in psychological research. Therefore, using standardized coefficient to assess the contribution of each independent variable to the total R 2 is not advisable.
R 2 -changed is another common approach used to determine which independent variable contributes most in explaining the dependent variable. It is also called analysis of variance in which the proportions are accounted for by each independent variable when added to the regression model. This method also suffers 14-19 This is an open access article under CC-BY-SA license (https://creativecommons.org/licenses/by-sa/4.0/) from multicollinearity as the resulting proportion's arbitrary nature depends on which independent variable enters the model first.
In this paper, we introduce two methods called Shapley value regression and relative weight analysis to address the situation mentioned above in order to determine the relative importance of each independent variable.

Shapley Value Regression
Shapley value is a solution from a cooperative game theory concept introduced by Shapely (1953). It aims to fairly estimate the importance of each collaborative player to the total gained profits considering that each may have different amounts of contribution. Regression analysis borrowed and applied this idea to estimate the importance of independent variables when high multicollinearity exists in the data. The marginal contribution from independent variable k to the total variability of dependent variable within a multiple linear regression model can be observed in terms of Shapley value (Joseph, 2019;Strumbelj and Kononenko, 2010): Assuming there are p independent variables included in the multiple regression model, \{ } represents the set of all possible models when the k th variable is excluded, | | is the number of variables included in the model, and P is a set of all p independent variables. For 2 decomposition, ( ) ≡ 2 ( ), i.e. the 2 of a regression model including only the variable in S and ( ∪ { }) ≡ 2 ( ∪ { }) is the 2 from the same model including the k th variable. From equation (2), it follows that ( ) = 2 which is the marginal contribution of the k th variable to overall 2 (Coleman, 2017).
For illustration, consider a multiple linear regression with two independent variables 1 and 2 regressed on the dependent variable . The shapley value or marginal contribution of 1 to the overall 2 can be computed by considering = { , + 2 2 }, where represents an intercept model without any independent variables and + 2 2 is a simple regression model where 2 is the only independent variable. Thus, Similarly, the marginal contribution of 1 to the overall 2 can be computed by considering = { , + 1 1 } and thus, It is clear 1 ( ) + 2 ( ) = 2 ( + 1 1 + 2 2 ) shows that each independent variable shares a unique contribution to the overall 2 . This indicates that if multicollinearity exists in the data, the Shapley value regression decomposes the overall 2 into marginal contribution and allows us to determine which independent variable with the greatest contribution to variability of the dependent variable. A regression model containing more than two independent variables obtained by generalizing the Shapley value shows a formula as follows: Package 'relaimpo' in R is available for the users to fit Shapley value regression developed by Groemping (2006).

Relative Weight Analysis
Relative weight (RW) analysis introduced by Johnson (2000) is an alternative method to Shapley regression to derive marginal contribution in multiple linear regression models. The basic idea of this method is transforming correlated independent variables (x) into a new set of orthogonal variables (z) that are not correlated with each other. Consider an example of a regression model with two correlated independent variables on a dependent variable (y) as depicted in Figure 1. This shows that association between any of the two independent variables and the dependent variable can be represented by two different regression equations (Tonidandel et al., 2009). The first equation presents the relationship between the original independent variables (x) and the orthogonal variables (z) defined as follows: Where denotes the standardized regression coefficient linking the j th original independent variable with the k th orthogonal variable or may also be interpreted as correlation between and . The second equation describes the relationship between the orthogonal variables and the dependent variable, as follows: Where denotes the standardized regression coefficient that links orthogonal variable with the dependent variable. By considering these two equations, the relative weight or the variance in dependent variable that can be explained by independent variable i is calculated as the sum of the squared products of the two regression coefficients ( , ): The squared product of regression coefficients ( 2 2 ) describes the proportion of variability in the dependent variable associated with through . Adding these terms across all yields the total proportion of variance attributed to . Therefore, relative weight may be used to measure the total variability in dependent variable that is explained by 1 independent of 2 as shown in equation (6).

Results and Discussion
This paper used Kaggle's World happiness data to demonstrate Shapley regression value and relative weight analysis in calculating relative importance. The dataset contains happiness level from 156 countries measured in 2019. The happiness index acts as dependent variable that is regressed against five independent variables, i.e. GDP per capita (x 1 ), social support (x 2 ), healthy life expectancy (x 3 ), freedom to make life choices (x 4 ), and perception of corruption (x 5 ). The overall model indicates that multiple regression model fits the data (F(5,150)=105, p<0.0001). The R 2 =0.777 showing that the five independent variables are able to explain 77.77% of variability in happiness. After obtaining a good fit model, assessing the contribution of each independent variable to the total variation is the next thing to explore.  Table 1 provides the calculation for marginal contribution from GDP per capita (x 1 ) to the total variability in happiness by using Shapley value regression. In total, 16 possible regression models are good fit when including and excluding x 1 from the model. The term | |!( −| |−1)! ! in equation (2) is denoted by weight ( ) in the table. Partial contribution of x 1 to its marginal contribution when fitting regression model with x 1 and, x 2 is 0.05 × (0.704 − 0.604) = 0.005. By adding these values over all 16 possible models, we get marginal contribution of x 1 to be 0.213 and marginal contribution from other variables can use similar calculation. This method found that the marginal contribution for social support, healthy life expectancy, freedom to make life choices, and perception of corruption are 0.211, 0.201, 0.108, and 0.044, respectively. In order to find the marginal contribution for the five independent variables by using relative weight analysis, estimation of the standardized regression coefficient that represents the link between the original independent variable with the orthogonal variable ( ) and the link between the orthogonal variable with the dependent variable ( ) is required. Table 2 provides these estimates.  (6), the marginal contribution for x 1 to the total variation of the dependent variable is expressed in equation below: 0.818 2 (0.460 2 ) + 0.353 2 (0.469 2 ) + 0.419 2 (0.442 2 ) + 0.129 2 (0.332 2 ) + 0.119 2 (0.202 2 ) = 0.206 Marginal contribution for other variables (x 2 , sx 3 , x 4 , and x 5 ) under relative weight analysis are 0.212, 0.917, 0.114, and 0.049, respectively, by using similar calculations. Table 3 compares the results for relative importance obtained from Shapley value regression and relative weight analysis, including the correlation between independent variables with the presence of multicollinearity in the data as a result of very high correlation between GDP per capita and life expectancy showing correlation coefficient of 0.835. The unstandardized regression coefficients show that all variables have significant effect on happiness, however it cannot determine the most influential variable considering the scales used to measure the variables are different. The order of importance for independent variables closely resembles those obtained from relative weight analysis yet the assessment of contribution to total variation is not viable due to multicollinearity. Advance bootstrapping method is a course of action to take towards finding the statistical difference between GDP and social support, however this paper doesn't examine the matter thus leaving it as a recommendation for future research.
According to several studies, both relative weight analysis and Shapley value regression always consistently yield similar results indicating that there is an underlying construct that both methods appraise (Shabuz & Garthwaite, 2019). The findings in this study confirmed that the drawback of Shapley regression is that it becomes computationally intensive because the time to compute the Shapley values grow exponentially as the number of independent variables increases. It becomes cumbersome when there are more than 10 variables (Aas et al., 2020). Meanwhile, relative weight analysis offers the flexibility to use as many independent variables as the researchers want. It doesn't put any constraint on the number of independent variables to be included in the regression model and thus can be considered as a strong alternative solution to Shapley regression.

Conclusion
Shapley value regression and relative weight analysis are the most widely recommended methods intended to find relative importance of variables. Both methods usually produce similar evaluations and are able to split up the total variation in the dependent variable (R 2 ) into the individual contributions made by each independent variable while accounting for multicollinearity in the data. In addition to multiple linear regression analysis, psychological researchers can consider these approaches as valuable supplements to their primary analysis to explain the most important independent variable that makes the most contribution in their variable of interest. For future research, the evaluation of statistical significance of relative importance is highly advised, such as testing whether the estimated Shapley value or relative weight is significantly different from zero or testing whether the two estimates are significantly different from each other. This recommendation will require advanced statistical methods such as bootstrapping technique to achieve best results.