Fill in Blanks
Home
1.1 Bivariate Relationships
1.2 Probabilistic Models
1.3 Estimation of the Line
1.4 Properties of the Least Squares Estimators
1.5 Estimation of the Variance
2.1 The Normal Errors Model
2.2 Inferences for the Slope
2.3 Inferences for the Intercept
2.4 Correlation and Coefficient of Determination
2.5 Estimating the Mean Response
2.6 Predicting the Response
3.1 Residual Diagnostics
3.2 The Linearity Assumption
3.3 Homogeneity of Variance
3.4 Checking for Outliers
3.5 Correlated Error Terms
3.6 Normality of the Residuals
4.1 More Than One Predictor Variable
4.2 Estimating the Multiple Regression Model
4.3 A Primer on Matrices
4.4 The Regression Model in Matrix Terms
4.5 Least Squares and Inferences Using Matrices
4.6 ANOVA and Adjusted Coefficient of Determination
4.7 Estimation and Prediction of the Response
5.1 Multicollinearity and Its Effects
5.2 Adding a Predictor Variable
5.3 Outliers and Influential Cases
5.4 Residual Diagnostics
5.5 Remedial Measures
8.2 Multinomial Logistic Regression
"Never tell me the odds!" - Han Solo
Logistic regression is most frequently used to model the relationship between a dichotomous
response variable and a set of predictor variables.
On occasion, however. the response variable may have more than two levels.
Logistic regression can still be employed by means of apolytomous -or multinomial -logistic regression model.
Multinomial logistic regression models are used in many fields. In business, for instance, a market researcher may wish to relate a consumer's choice of product (product A, product B. product C) to the consumer's age, gender, geographic location and several other potential explanatory variables.
This is an example of nominal multinomial regression because the response categories are purely qualitative and notordered in any way.
Ordinal response categories can
also be modeled using multinomial regression.
For example, the relation between severity of disease measured on an ordinal scale (mild, moderate, severe) and age of patient, gender of patient, and some other explanatory variables may be of interest.
The response variable $Y$ must be coded in the same way as indicator variables except we will need an indicator variable for each category of $Y$.
Instead of going through the details, we will illustrate with R.
On occasion, however. the response variable may have more than two levels.
Logistic regression can still be employed by means of a
Multinomial logistic regression models are used in many fields. In business, for instance, a market researcher may wish to relate a consumer's choice of product (product A, product B. product C) to the consumer's age, gender, geographic location and several other potential explanatory variables.
This is an example of nominal multinomial regression because the response categories are purely qualitative and not
For example, the relation between severity of disease measured on an ordinal scale (mild, moderate, severe) and age of patient, gender of patient, and some other explanatory variables may be of interest.
The response variable $Y$ must be coded in the same way as indicator variables except we will need an indicator variable for each category of $Y$.
Instead of going through the details, we will illustrate with R.
A study was undertaken to determine the strength of association between several risk factors
and the duration of pregnancies.
The risk factors considered were mother's age, nutritional status, history of tobacco use, and history of alcohol use.
The response of interest, pregnancy duration, is a three-category variable that was coded as follows: \begin{align*} Y_{i} &\quad \textbf{Pregnancy Duration Category}\\ 1 &\quad \text{Preterm (less than 36 weeks)}\\ 2 &\quad \text{Intermediate term (36 to 37 weeks)}\\ 3 &\quad \text{Full term (38 weeks or greater)} \end{align*} Relevant data for 102 women who had recently given birth at a large metropolitan hospital were obtained.
X variables include:
- Nutritional status ($X_1$), is an index of nutritional status (higher score denotes better nutritional status).
- Age was categorized into three groups: less than 20 years of age (coded 1), from 21 to 30 years of age (coded 2), and greater than 30 years of age (coded 3). It is represented by two indicator variables (X2 and X3): \begin{align*} X_{2} &\quad X_{3} & \textbf{Class}\\ 1 &\quad 0 & \text{Less than or equal to 20}\\ 0 &\quad 0 & \text{21 to 30}\\ 0 &\quad 1 & \text{Greater than 30} \end{align*} - Alcohol ($X_4$) and smoking history ($X_5$) were also qualitative predictors; the categories were "Yes" (coded 1) and "No" (coded 0).
Note that the Y variable is already coded with the variables Y1, Y2, and Y3 in the dataset. We do not have to do this ourselves, R will code the indicator variables for us if we tell R the variable is a factor.
Note in the summary, we have two regression fits. What we actually have done is model the odds of the probability of being in category 2 to the probability of being in category 1.
We also modeled the odds of the probability of being in category 3 to the probability of being in category 1.
So we call category 1 the baseline category.
To get the effect on the odds ratio for each coefficient:
We can use a different baseline category such as full term:
The risk factors considered were mother's age, nutritional status, history of tobacco use, and history of alcohol use.
The response of interest, pregnancy duration, is a three-category variable that was coded as follows: \begin{align*} Y_{i} &\quad \textbf{Pregnancy Duration Category}\\ 1 &\quad \text{Preterm (less than 36 weeks)}\\ 2 &\quad \text{Intermediate term (36 to 37 weeks)}\\ 3 &\quad \text{Full term (38 weeks or greater)} \end{align*} Relevant data for 102 women who had recently given birth at a large metropolitan hospital were obtained.
X variables include:
- Nutritional status ($X_1$), is an index of nutritional status (higher score denotes better nutritional status).
- Age was categorized into three groups: less than 20 years of age (coded 1), from 21 to 30 years of age (coded 2), and greater than 30 years of age (coded 3). It is represented by two indicator variables (X2 and X3): \begin{align*} X_{2} &\quad X_{3} & \textbf{Class}\\ 1 &\quad 0 & \text{Less than or equal to 20}\\ 0 &\quad 0 & \text{21 to 30}\\ 0 &\quad 1 & \text{Greater than 30} \end{align*} - Alcohol ($X_4$) and smoking history ($X_5$) were also qualitative predictors; the categories were "Yes" (coded 1) and "No" (coded 0).
dat = read.table("http://users.stat.ufl.edu/~rrandles/sta4210/Rclassnotes/data/
textdatasets/KutnerData/Chapter%2014%20Data%20Sets/CH14TA13.txt")
names(dat) = c("id", "Y", "Y1", "Y2", "Y3", "X1", "X2", "X3", "X4", "X5")
head(dat)
id Y Y1 Y2 Y3 X1 X2 X3 X4 X5
1 1 1 1 0 0 150 0 0 0 1
2 2 1 1 0 0 124 1 0 0 0
3 3 1 1 0 0 128 0 0 0 1
4 4 1 1 0 0 128 1 0 0 1
5 5 1 1 0 0 133 0 0 1 1
6 6 1 1 0 0 130 0 0 1 1
Note that the Y variable is already coded with the variables Y1, Y2, and Y3 in the dataset. We do not have to do this ourselves, R will code the indicator variables for us if we tell R the variable is a factor.
library(nnet)
reg = multinom(factor(Y)~X1+X2+X3+X4+X5, data=dat)
summary(reg)
Call:
multinom(formula = factor(Y) ~ X1 + X2 + X3 + X4 + X5, data = dat)
Coefficients:
(Intercept) X1 X2 X3 X4 X5
2 -1.517009 0.01897219 -0.04355881 -0.1721573 -0.9758857 -0.2218629
3 -5.475353 0.06542052 -2.95693397 -2.0596355 -2.0428964 -2.4522828
Std. Errors:
(Intercept) X1 X2 X3 X4 X5
2 2.047984 0.01631570 0.7373261 0.6943946 0.5869073 0.5856117
3 2.271691 0.01823932 0.9644844 0.8947673 0.7097457 0.7315046
Residual Deviance: 168.6754
AIC: 192.6754
Note in the summary, we have two regression fits. What we actually have done is model the odds of the probability of being in category 2 to the probability of being in category 1.
We also modeled the odds of the probability of being in category 3 to the probability of being in category 1.
So we call category 1 the baseline category.
To get the effect on the odds ratio for each coefficient:
exp(coef(reg))
(Intercept) X1 X2 X3 X4 X5
2 0.219366942 1.019153 0.95737625 0.8418467 0.3768584 0.80102519
3 0.004188751 1.067608 0.05197804 0.1275004 0.1296526 0.08609682
We can use a different baseline category such as full term:
dat$Y = as.factor(dat$Y)
dat$Y = relevel(dat$Y, ref = "3")
reg.nom = multinom(factor(Y)~X1+X2+X3+X4+X5, data=dat)
summary(reg.nom)
Call:
multinom(formula = factor(Y) ~ X1 + X2 + X3 + X4 + X5, data = dat)
Coefficients:
(Intercept) X1 X2 X3 X4 X5
1 5.475147 -0.06541919 2.957028 2.059662 2.042900 2.452362
2 3.958370 -0.04644903 2.913475 1.887550 1.067001 2.230492
Std. Errors:
(Intercept) X1 X2 X3 X4 X5
1 2.271677 0.01823916 0.9644921 0.8947727 0.7097461 0.7315106
2 1.941063 0.01488581 0.8575544 0.8088255 0.6495262 0.6681955
Residual Deviance: 168.6754
AIC: 192.6754
#to get the predicted probability for each observation
preds.nom = fitted(reg.nom)
preds.nom[1:10,]
3 1 2
1 0.62078219 0.094217907 0.28499991
2 0.18455857 0.254210377 0.56123105
3 0.34297651 0.219535548 0.43748794
4 0.02716468 0.334553819 0.63828151
5 0.13335384 0.474690286 0.39195587
6 0.11480795 0.497290719 0.38790133
7 0.95144461 0.009569423 0.03898596
8 0.09738287 0.335195611 0.56742152
9 0.02775016 0.658559402 0.31369044
10 0.40441329 0.186642234 0.40894447
# if we pedict the category with the highest probability:
ind = apply(preds.nom, 1, which.max)
preds.nom.cat = colnames(preds.nom)[ind]
preds.nom.cat
[1] "3" "2" "2" "2" "1" "1" "3" "2" "1" "2" "3" "1" "2" "1" "1" "3" "2" "2" "1" "1" "1"
[22] "1" "1" "2" "1" "2" "3" "3" "2" "2" "2" "2" "2" "2" "3" "2" "2" "2" "2" "3" "2" "2"
[43] "2" "2" "2" "2" "3" "2" "1" "2" "2" "1" "2" "3" "2" "2" "2" "1" "3" "1" "3" "3" "3"
[64] "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3"
[85] "2" "3" "3" "3" "1" "3" "3" "3" "3" "3" "2" "2" "2" "2" "1" "1" "3" "1"
#compare to observed categories:
table(preds.nom.cat, dat$Y)
preds.nom.cat 3 1 2
1 4 12 4
2 5 10 23
3 32 4 8