Logistic regression models that regress on an ordinal dependent variable (think: Likert scale responses, course grades, categories that have a natural sequence, etc.) will use an ordered logit. Ordered logits are often run in R using polr in the MASS package, however the vglm function in the VGAM package offers a more advanced suite of options. This tutorial walks through setting up ordered logits with each function.
First, let’s load in our packages.
For this example, suppose we are interested in the question: What drives farmers’ interest in technology x? We have data on farmers’ education and income, their farm size, as well as their attitudes towards things like earning money, protecting the environment, as well has how connected they are to different information sources. We are interested in the dependent variable: Farmer interest in the technology, scored as no interest (None), some interest (Somewhat), or very interested (Very) (1-3). Let’s inspect these data.
head(df)
interest acres income education econ.value envt.value info.count
1 Not 4.84 NA NA 4.33 3.6 1
2 Somewhat 3.01 1 5 4.00 3.0 1
3 Somewhat 4.17 3 2 NA NA 1
4 <NA> 4.40 3 3 5.00 NA 3
5 Not 2.44 4 7 4.67 3.6 9
6 Very 6.64 2 7 2.00 4.2 5
# Make sure your dependent variable is in the right orderdf$interest =factor(df$interest, c('Not', 'Somewhat', 'Very'))
Note: These data have been inspired by farmer survey data sets, but have simulated values and therefore do not reflect any actual farmers.
polr
Most ordered logit models, such as those run with the defaults of the polr function, assume a cumulative link function. A cumulative link function looks at the probability of being at one level, compared to all of the levels below them (e.g. P(Y>=2)). This is the only kind of link function available in polr, which is important to keep in mind when later comparing it to vglm.
Now, let’s take a look at our model, where we are interested in seeing how farm operator/operation variables, attitudes, and information affect a farmer’s interest in a technology.
polr.model <-polr(interest ~ acres + income + education + econ.value + envt.value + info.count, data = df, Hess = T) # We use Hess = T in order to return the Hessian matrix, which allows us to look at the model's summarysummary(polr.model)
Call:
polr(formula = interest ~ acres + income + education + econ.value +
envt.value + info.count, data = df, Hess = T)
Coefficients:
Value Std. Error t value
acres 0.06972 0.06798 1.0256
income -0.18702 0.07160 -2.6120
education -0.02324 0.06123 -0.3795
econ.value -0.50213 0.10715 -4.6861
envt.value 0.57938 0.09664 5.9950
info.count 0.14728 0.03291 4.4759
Intercepts:
Value Std. Error t value
Not|Somewhat -0.8801 0.5639 -1.5606
Somewhat|Very 1.0120 0.5668 1.7855
Residual Deviance: 1002.928
AIC: 1018.928
(106 observations deleted due to missingness)
The summary of the polr function returns the coefficients, standard errors and t-values. These results indicate that economic attitudes and income are significantly negatively correlated with technology interest, and environmental attitudes and information sources is positively correlated. Ordered logit coefficient values, in their raw form, is their logit value, where the strength of the effect is hard to interpret. Typically, estimates are exponentiated to be read as odds.
We can also calculate the p-value manually for an easier read of results, and bind it to the summary table.
# Save an object as the coefficient tablectable <-coef(summary(polr.model))# Calculate and store p valuesp <-round((pnorm(abs(ctable[, "t value"]), lower.tail =FALSE) *2),3)# Bind these objects into one tabletable <-cbind(ctable, "p value"= p)table
Value Std. Error t value p value
acres 0.06972332 0.06798285 1.0256016 0.305
income -0.18701627 0.07159954 -2.6119757 0.009
education -0.02323590 0.06122978 -0.3794869 0.704
econ.value -0.50213351 0.10715380 -4.6861009 0.000
envt.value 0.57937593 0.09664368 5.9949696 0.000
info.count 0.14728163 0.03290579 4.4758571 0.000
Not|Somewhat -0.88005549 0.56391926 -1.5606055 0.119
Somewhat|Very 1.01204550 0.56681362 1.7854996 0.074
Before relying too much on these results, however, one assumption of ordered logits that ought to be tested is the proportional odds assumption. What is assumed here is that the slopes from level to level of the dependent variable are relatively similar, allowing us to use only one coefficient per independent variable. The Brant test was developed to test this assumption, and the brant package and function allow us to quickly test this with polr model objects.
brant(polr.model)
--------------------------------------------
Test for X2 df probability
--------------------------------------------
Omnibus 33.3 6 0
acres 0.43 1 0.51
income 0.35 1 0.55
education 1.47 1 0.22
econ.value 19.17 1 0
envt.value 1 1 0.32
info.count 1.46 1 0.23
--------------------------------------------
H0: Parallel Regression Assumption holds
This test evaluates whether or not the proportional odds assumption has been met, where the null hypothesis is that the assumption has been met. In this case, a p-value < 0.05 is cause to reject the null hypothesis, meaning that reject the proportional odds assumption. Based on these outputs, we see that the overall (Omnibus) model fails the proportional odds test, where probability = 0, and this is particularly driven by economic values, where the p-value is \(1.2^{-5}\).
A failed proportional odds assumption calls for a non-proportional ordered logit, or at least partial proportional model. However, polr doesn’t allow us to do this, which is where we switch to vglm.
vglm
The vglm function can model more advanced ordinal regressions – non-proportional and partial proportional, as well as different link functions. However, it requires a few more model specifications than the polr model. We will explore these using the same data as above.
First, you need to specify the family of the link function. As noted above, the default family for polr and most people thinking about ordered logits is the cumulative link function, so for now we will leave it at that with family = cumulative. (See my post on the sequential ordered logit using the cratio link function). We also get to specify whether or not the parallel odds assumption, also called the proportional odds assumption, has been met with parallel = T/F. Last, the default in vglm is to reverse the order of the ordered dependent variable, so most intuitive interpretations require that the order be reversed with reverse = T. Below are examples of running the proportional, non-proportional, and partial proportional models. Note that the partial proportional model is specified by using a ~ followed by the variable(s) that you want to be left not proportional.
# a proportional modelvgam.model.prop <-vglm(ordered(interest) ~ acres + income + education + econ.value + envt.value + info.count, data = df, family=cumulative(parallel = T, reverse = T))# a non-proportional modelvgam.model.nonprop <-vglm(ordered(interest) ~ acres + income + education + econ.value + envt.value + info.count, data = df, family=cumulative(parallel = F, reverse = T))# a partial proportional modelvgam.model.partialprop <-vglm(ordered(interest) ~ acres + income + education + econ.value + envt.value + info.count, data = df, family=cumulative(parallel = F ~ econ.value, reverse = T))
While polr model objects of the S3 class, as are most objects in R, vglm models are S4. In this case, while some functions operate the same, others vary. For instance, you can use the same summary() function.
# vagm models provide p-values immediatelysummary(vgam.model.partialprop)
Call:
vglm(formula = ordered(interest) ~ acres + income + education +
econ.value + envt.value + info.count, family = cumulative(parallel = F ~
econ.value, reverse = T), data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 -0.46902 0.59324 -0.791 0.42918
(Intercept):2 -0.20922 0.57118 -0.366 0.71415
acres 0.06150 0.06709 0.917 0.35934
income -0.19036 0.07184 -2.650 0.00806 **
education -0.01948 0.06099 -0.319 0.74947
econ.value:1 -0.14429 0.11887 -1.214 0.22481
econ.value:2 -0.66601 0.11258 -5.916 3.30e-09 ***
envt.value 0.54720 0.09669 5.659 1.52e-08 ***
info.count 0.14946 0.03316 4.507 6.57e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Names of linear predictors: logitlink(P[Y>=2]), logitlink(P[Y>=3])
Residual deviance: 982.8936 on 1009 degrees of freedom
Log-likelihood: -491.4468 on 1009 degrees of freedom
Number of Fisher scoring iterations: 8
No Hauck-Donner effect found in any of the estimates
Exponentiated coefficients:
acres income education econ.value:1 econ.value:2 envt.value
1.0634294 0.8266641 0.9807110 0.8656360 0.5137545 1.7284071
info.count
1.1612031
At the bottom of this summary, coefficients are already presented in exponentaited form. Still, it is useful to know how to manipulate this model object to get out the values of interest. In a polr model you can use a $ sign after the model object to get just the coefficients, but in vglm you use the @ sign after the model summary object to pull out a variety of the model parts.
vgam.summary <-summary(vgam.model.partialprop)# coef3 pulls out the whole model output tablevgam.summary@coef3
If you want to neatly compile a table, including exponentiated coefficients, confidence intervals, and p-values, the following code helps compile this.
# Calculate exponentiated coefficients (odds) using the model summary objectodds <-data.frame(exp(vgam.summary@coefficients))# Extrac p-valuep <-data.frame(vgam.summary@coef3[,4])# Calculate the confidence intervals using the original model objectci <-data.frame(exp(confint(vgam.model.partialprop)))# Combine and round vglm.table <-round(cbind(odds, ci, p),3)colnames(vglm.table) <-c("Estimate", "LCI", "UCI", "p-value")vglm.table