Pedram Ramezani - Logistic Regression and its Assumptions

Logistic regression is a versatile statistical method used for modeling the probability of a binary outcome, such as whether a patient has a disease or whether an email is spam or not spam. It's also extendable to handle ordinal and multinomial outcomes through similar methodologies, but we will focus on binary logistic regression in this post.

Binary

disease vs. no disease
true vs. false

Ordinal

bad < neutral < good
cold < mild < hot

Multinomial

red vs. blue vs. green
fluid vs. gas vs. solid

How Logistic Regression Works

Unlike linear regression, which predicts continuous numeric values, logistic regression predicts the probability that a given observation belongs to a particular category. It accomplishes this by applying the logistic function (also known as the sigmoid function) to a linear combination of the predictor variables.

The logistic function maps any real-valued number into a range between 0 and 1, making it suitable for representing probabilities:

P(y=1 | x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n)}}

Where:

$P(y=1 | x)$ is the probability of the outcome being 1 given the predictors $x$ .
$\beta_0, \beta_1, ..., \beta_n$ are the coefficients associated with the predictors.
$x_1, x_2, ..., x_n$ are the predictor variables.

Expand for further explanation

The logistic function can also be written in matrix notation as:

P(y=1 | x) = \frac{1}{1 + e^{-\beta X}}

Where $X = [1, x_1, \ldots, x_n]^T$ represents the matrix of predictor variables and $\beta = [\beta_0, \beta_1, \ldots, \beta_n]^T$ represents the vector of coefficients.

Proof for $P(y=1 | x) \in [0, 1]$ : The logistic function constrains its output to the range [0, 1] because $e^{-x}$ in the denominator ensures that the fraction never exceeds 1. Since $e^{-x}$ ranges from 0 to infinity as $x$ ranges from 0 to infinity, the denominator will always be greater than 1, leading to probabilities between 0 and 1.

Furthermore, the coefficients $\beta_k$ in logistic regression represent the change in the log odds of the outcome for a one-unit change in the corresponding predictor variable, holding all other predictors constant. Taking the exponential of the coefficients $\beta_k$ yields the odds ratios associated with each predictor variable. Mathematically, the interpretation of coefficients as odds ratios can be derived from the logistic function:

\text{Odds}(x) = \frac{P(y=1 | x)}{1 - P(y=1 | x)} = e^{\beta X}

\text{Odds-Ratio}(x, k) = \frac{\text{Odds}(x_0, \ldots, x_{k-1}, x_k + 1, x_{k+1}, \ldots, x_n )}{\text{Odds}(x_0, \ldots, x_{k-1}, x_k, x_{k+1}, \ldots, x_n )} = e^{\beta_k}

For example, if the coefficient $\beta_1$ for a predictor variable is 0.5, it means that for every one-unit increase in that predictor variable, the odds of the outcome occurring increase by a factor of $e^{0.5}$ , or approximately 1.65 times.

Assumptions of Logistic Regression

Despite its effectiveness, logistic regression relies on certain assumptions for its validity. Here are the key assumptions:

Binary Outcome: Logistic regression assumes that the dependent variable is binary. That is, it has only two possible outcomes.
Independence of Observations / Errors: Each observation should be independent of the others. In other words, there should be no correlation between the observations in the dataset, such as duplicate responses.
Linearity of Independent Variables and Log Odds: The relationship between the independent variables and the log odds of the dependent variable should be linear. This assumption is crucial because logistic regression works on the log odds scale. This is always true for binary predictors (simple binary variables, dummy-coded categorical variables, etc.). However, you should check this assumption for ordinal and continuous variables and consider splines or polynomial terms if your variable violates this assumption. It's also possible to convert a continuous variable into a categorical variable, but there are potential downsides to this approach, such as loss of information and increased amount of predictor variables. Methods for detecting linearity include the Box-Tidwell test, which assesses the linearity assumption by adding interaction terms between the continuous predictors and their natural logarithms. Another approach is binning the variable, calculating the log odds for every bin, and plotting it to visually inspect linearity.
No Multicollinearity: Multicollinearity, the phenomenon in which independent variables are highly correlated with each other, can lead to unstable estimates of the coefficients. It can be detected using various methods, such as heatmaps (although not suitable for large amounts of predictor variables), variance inflation factors (VIF), tolerance, or generalized variance inflation factors (GVIF) for categorical variables. Common cutoff values for detecting multicollinearity include a VIF greater than 5 or a tolerance less than 0.2.
Exclusion of Influential Outliers: Influential outliers can significantly impact the results of logistic regression models. They can be identified using measures such as standard residuals and Cook's distance. Standard residuals exceeding 2 in absolute value are considered outliers. Cook's distance values greater than 4/n, where n is the number of observations, are often considered indicative of influential samples. However, when excluding influental outliers, it's essential to carefully examine the nature of the outlier. Sometimes outliers may indicate errors in the data or rare but valid instances that should not be excluded without justification.
Adequate Variable Size: There should be at least 10 samples for each independent variable for the smaller binary population. For example, if there are 150 people with the disease and 850 without, there can be maximally 15 predictors. This ensures that there are enough data points to support the estimation of the model parameters and reduce the risk of overfitting (often resulting in high coefficients).
Large Sample Size: While logistic regression doesn't have strict sample size requirements like some other statistical methods, having a larger sample size generally leads to more stable estimates and better model performance.

Conclusion

Logistic regression is a valuable tool for binary classification tasks, but it's important to understand and adhere to its assumptions for reliable results. By ensuring that these assumptions are met, practitioners can make informed decisions and draw meaningful insights from their logistic regression models.

Understanding Logistic Regression and Its Assumptions

How Logistic Regression Works

Assumptions of Logistic Regression

Conclusion