Ordered logits introduction

A quick look at polr and vglm

Logistic regression models that regress on an ordinal dependent variable (think: Likert scale responses, course grades, categories that have a natural sequence, etc.) will use an ordered logit. Ordered logits are often run in R using polr in the MASS package, however the vglm function in the VGAM package offers a more advanced suite of options. This tutorial walks through setting up ordered logits with each function.

First, let’s load in our packages.

For this example, suppose we are interested in the question: What drives farmers’ interest in technology x? We have data on farmers’ education and income, their farm size, as well as their attitudes towards things like earning money, protecting the environment, as well has how connected they are to different information sources. We are interested in the dependent variable: Farmer interest in the technology, scored as no interest (None), some interest (Somewhat), or very interested (Very) (1-3). Let’s inspect these data.

head(df)

  interest acres income education econ.value envt.value info.count
1      Not  4.84     NA        NA       4.33        3.6          1
2 Somewhat  3.01      1         5       4.00        3.0          1
3 Somewhat  4.17      3         2         NA         NA          1
4     <NA>  4.40      3         3       5.00         NA          3
5      Not  2.44      4         7       4.67        3.6          9
6     Very  6.64      2         7       2.00        4.2          5

# Make sure your dependent variable is in the right order
df$interest = factor(df$interest, c('Not', 'Somewhat', 'Very'))

Note: These data have been inspired by farmer survey data sets, but have simulated values and therefore do not reflect any actual farmers.

`polr`

Most ordered logit models, such as those run with the defaults of the polr function, assume a cumulative link function. A cumulative link function looks at the probability of being at one level, compared to all of the levels below them (e.g. P(Y>=2)). This is the only kind of link function available in polr, which is important to keep in mind when later comparing it to vglm.

Now, let’s take a look at our model, where we are interested in seeing how farm operator/operation variables, attitudes, and information affect a farmer’s interest in a technology.

polr.model <- polr(interest ~ acres + income + education + econ.value + 
                     envt.value + info.count, data = df, Hess = T) 
# We use Hess = T in order to return the Hessian matrix, which allows us to look at the model's summary
summary(polr.model)

Call:
polr(formula = interest ~ acres + income + education + econ.value + 
    envt.value + info.count, data = df, Hess = T)

Coefficients:
              Value Std. Error t value
acres       0.06972    0.06798  1.0256
income     -0.18702    0.07160 -2.6120
education  -0.02324    0.06123 -0.3795
econ.value -0.50213    0.10715 -4.6861
envt.value  0.57938    0.09664  5.9950
info.count  0.14728    0.03291  4.4759

Intercepts:
              Value   Std. Error t value
Not|Somewhat  -0.8801  0.5639    -1.5606
Somewhat|Very  1.0120  0.5668     1.7855

Residual Deviance: 1002.928 
AIC: 1018.928 
(106 observations deleted due to missingness)

The summary of the polr function returns the coefficients, standard errors and t-values. These results indicate that economic attitudes and income are significantly negatively correlated with technology interest, and environmental attitudes and information sources is positively correlated. Ordered logit coefficient values, in their raw form, is their logit value, where the strength of the effect is hard to interpret. Typically, estimates are exponentiated to be read as odds.

polr.model$coefficients

      acres      income   education  econ.value  envt.value  info.count 
 0.06972332 -0.18701627 -0.02323590 -0.50213351  0.57937593  0.14728163

We can also calculate the p-value manually for an easier read of results, and bind it to the summary table.

# Save an object as the coefficient table
ctable <- coef(summary(polr.model))
# Calculate and store p values
p <- round((pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2),3)
# Bind these objects into one table
table <- cbind(ctable, "p value" = p)

table

                    Value Std. Error    t value p value
acres          0.06972332 0.06798285  1.0256016   0.305
income        -0.18701627 0.07159954 -2.6119757   0.009
education     -0.02323590 0.06122978 -0.3794869   0.704
econ.value    -0.50213351 0.10715380 -4.6861009   0.000
envt.value     0.57937593 0.09664368  5.9949696   0.000
info.count     0.14728163 0.03290579  4.4758571   0.000
Not|Somewhat  -0.88005549 0.56391926 -1.5606055   0.119
Somewhat|Very  1.01204550 0.56681362  1.7854996   0.074

Before relying too much on these results, however, one assumption of ordered logits that ought to be tested is the proportional odds assumption. What is assumed here is that the slopes from level to level of the dependent variable are relatively similar, allowing us to use only one coefficient per independent variable. The Brant test was developed to test this assumption, and the brant package and function allow us to quickly test this with polr model objects.

brant(polr.model)

-------------------------------------------- 
Test for    X2  df  probability 
-------------------------------------------- 
Omnibus     33.3    6   0
acres       0.43    1   0.51
income      0.35    1   0.55
education   1.47    1   0.22
econ.value  19.17   1   0
envt.value  1   1   0.32
info.count  1.46    1   0.23
-------------------------------------------- 

H0: Parallel Regression Assumption holds

This test evaluates whether or not the proportional odds assumption has been met, where the null hypothesis is that the assumption has been met. In this case, a p-value < 0.05 is cause to reject the null hypothesis, meaning that reject the proportional odds assumption. Based on these outputs, we see that the overall (Omnibus) model fails the proportional odds test, where probability = 0, and this is particularly driven by economic values, where the p-value is $1.2^{-5}$.

A failed proportional odds assumption calls for a non-proportional ordered logit, or at least partial proportional model. However, polr doesn’t allow us to do this, which is where we switch to vglm.

`vglm`

The vglm function can model more advanced ordinal regressions – non-proportional and partial proportional, as well as different link functions. However, it requires a few more model specifications than the polr model. We will explore these using the same data as above.

First, you need to specify the family of the link function. As noted above, the default family for polr and most people thinking about ordered logits is the cumulative link function, so for now we will leave it at that with family = cumulative. (See my post on the sequential ordered logit using the cratio link function). We also get to specify whether or not the parallel odds assumption, also called the proportional odds assumption, has been met with parallel = T/F. Last, the default in vglm is to reverse the order of the ordered dependent variable, so most intuitive interpretations require that the order be reversed with reverse = T. Below are examples of running the proportional, non-proportional, and partial proportional models. Note that the partial proportional model is specified by using a ~ followed by the variable(s) that you want to be left not proportional.

# a proportional model
vgam.model.prop <- vglm(ordered(interest) ~ acres + income + education + 
                         econ.value + envt.value + info.count, data = df, 
                       family=cumulative(parallel = T, reverse = T))

# a non-proportional model
vgam.model.nonprop <- vglm(ordered(interest) ~ acres + income + education + 
                            econ.value + envt.value + info.count,  data = df, 
                          family=cumulative(parallel = F, reverse = T))

# a partial proportional model
vgam.model.partialprop <- vglm(ordered(interest) ~ acres + income + education + 
                                econ.value + envt.value + info.count, data = df, 
                              family=cumulative(parallel = F ~ econ.value, reverse = T))

While polr model objects of the S3 class, as are most objects in R, vglm models are S4. In this case, while some functions operate the same, others vary. For instance, you can use the same summary() function.

# vagm models provide p-values immediately
summary(vgam.model.partialprop)


Call:
vglm(formula = ordered(interest) ~ acres + income + education + 
    econ.value + envt.value + info.count, family = cumulative(parallel = F ~ 
    econ.value, reverse = T), data = df)

Coefficients: 
              Estimate Std. Error z value Pr(>|z|)    
(Intercept):1 -0.46902    0.59324  -0.791  0.42918    
(Intercept):2 -0.20922    0.57118  -0.366  0.71415    
acres          0.06150    0.06709   0.917  0.35934    
income        -0.19036    0.07184  -2.650  0.00806 ** 
education     -0.01948    0.06099  -0.319  0.74947    
econ.value:1  -0.14429    0.11887  -1.214  0.22481    
econ.value:2  -0.66601    0.11258  -5.916 3.30e-09 ***
envt.value     0.54720    0.09669   5.659 1.52e-08 ***
info.count     0.14946    0.03316   4.507 6.57e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Names of linear predictors: logitlink(P[Y>=2]), logitlink(P[Y>=3])

Residual deviance: 982.8936 on 1009 degrees of freedom

Log-likelihood: -491.4468 on 1009 degrees of freedom

Number of Fisher scoring iterations: 8 

No Hauck-Donner effect found in any of the estimates


Exponentiated coefficients:
       acres       income    education econ.value:1 econ.value:2   envt.value 
   1.0634294    0.8266641    0.9807110    0.8656360    0.5137545    1.7284071 
  info.count 
   1.1612031

At the bottom of this summary, coefficients are already presented in exponentaited form. Still, it is useful to know how to manipulate this model object to get out the values of interest. In a polr model you can use a $ sign after the model object to get just the coefficients, but in vglm you use the @ sign after the model summary object to pull out a variety of the model parts.

vgam.summary <- summary(vgam.model.partialprop)
# coef3 pulls out the whole model output table
vgam.summary@coef3

                 Estimate Std. Error    z value     Pr(>|z|)
(Intercept):1 -0.46901843 0.59324098 -0.7906036 4.291754e-01
(Intercept):2 -0.20922021 0.57118110 -0.3662940 7.141457e-01
acres          0.06149894 0.06709280  0.9166250 3.593392e-01
income        -0.19035684 0.07184225 -2.6496501 8.057517e-03
education     -0.01947749 0.06099434 -0.3193327 7.494742e-01
econ.value:1  -0.14429075 0.11887181 -1.2138349 2.248108e-01
econ.value:2  -0.66600982 0.11257994 -5.9158837 3.300984e-09
envt.value     0.54720023 0.09669496  5.6590357 1.522259e-08
info.count     0.14945660 0.03316010  4.5071221 6.571281e-06

# coefficient pulls out only the coefficients
vgam.summary@coefficients

(Intercept):1 (Intercept):2         acres        income     education 
  -0.46901843   -0.20922021    0.06149894   -0.19035684   -0.01947749 
 econ.value:1  econ.value:2    envt.value    info.count 
  -0.14429075   -0.66600982    0.54720023    0.14945660

If you want to neatly compile a table, including exponentiated coefficients, confidence intervals, and p-values, the following code helps compile this.

# Calculate exponentiated coefficients (odds) using the model summary object
odds <- data.frame(exp(vgam.summary@coefficients))
# Extrac p-value
p <- data.frame(vgam.summary@coef3[,4])
# Calculate the confidence intervals using the original model object
ci <- data.frame(exp(confint(vgam.model.partialprop)))
# Combine and round 
vglm.table <- round(cbind(odds, ci, p),3)
colnames(vglm.table) <- c("Estimate", "LCI", "UCI", "p-value")

vglm.table

              Estimate   LCI   UCI p-value
(Intercept):1    0.626 0.196 2.001   0.429
(Intercept):2    0.811 0.265 2.485   0.714
acres            1.063 0.932 1.213   0.359
income           0.827 0.718 0.952   0.008
education        0.981 0.870 1.105   0.749
econ.value:1     0.866 0.686 1.093   0.225
econ.value:2     0.514 0.412 0.641   0.000
envt.value       1.728 1.430 2.089   0.000
info.count       1.161 1.088 1.239   0.000

For more on ordered logit model, see posts on: