Fill in Blanks
Home
1.1 Bivariate Relationships
1.2 Probabilistic Models
1.3 Estimation of the Line
1.4 Properties of the Least Squares Estimators
1.5 Estimation of the Variance
2.1 The Normal Errors Model
2.2 Inferences for the Slope
2.3 Inferences for the Intercept
2.4 Correlation and Coefficient of Determination
2.5 Estimating the Mean Response
2.6 Predicting the Response
3.1 Residual Diagnostics
3.2 The Linearity Assumption
3.3 Homogeneity of Variance
3.4 Checking for Outliers
3.5 Correlated Error Terms
3.6 Normality of the Residuals
4.1 More Than One Predictor Variable
4.2 Estimating the Multiple Regression Model
4.3 A Primer on Matrices
4.4 The Regression Model in Matrix Terms
4.5 Least Squares and Inferences Using Matrices
4.6 ANOVA and Adjusted Coefficient of Determination
4.7 Estimation and Prediction of the Response
5.1 Multicollinearity and Its Effects
5.2 Adding a Predictor Variable
5.3 Outliers and Influential Cases
5.4 Residual Diagnostics
5.5 Remedial Measures
8.4 Poisson Regression
"As far as the laws of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality."
- Albert Einstein
We consider now another nonlinear regression model where the response outcomes are
discrete.
Poisson regression is useful when the outcome is a count, with large-count outcomes being rare events.
For instance, the number of times a household shops at a particular supermarket in a week is a count, with a large number of shopping trips to the store during the week being a rare event. A researcher may wish to study the relation between a family's number of shopping trips to the store during a particular week and the family's income, number of children, distance from the store, and some other explanatory variables.
As another example, the relation between the number of hospitalizations of a member of a health maintenance organization during the past year and the member's age, income, and previous health status may be of interest.
The Poisson distribution can be utilized for outcomes that are counts ($Y_i = 0, 1,2, ...$ ), with a large count or frequency being a rare event.
Poisson regression is useful when the outcome is a count, with large-count outcomes being rare events.
For instance, the number of times a household shops at a particular supermarket in a week is a count, with a large number of shopping trips to the store during the week being a rare event. A researcher may wish to study the relation between a family's number of shopping trips to the store during a particular week and the family's income, number of children, distance from the store, and some other explanatory variables.
As another example, the relation between the number of hospitalizations of a member of a health maintenance organization during the past year and the member's age, income, and previous health status may be of interest.
The Poisson distribution can be utilized for outcomes that are counts ($Y_i = 0, 1,2, ...$ ), with a large count or frequency being a rare event.
The Poisson probability distribution is
$$
f(Y) = \frac{\mu^Y \exp(-\mu)}{Y!}
$$
The mean and variance of a Poisson distribution are
$$
\begin{align*}
E\{Y\} &= \mu\\
\sigma^2\{Y\} &=\mu
\end{align*}
$$
Note that the variance is the same as the mean.
Hence, if the number of store trips follows the Poisson distribution and the mean number of store trips for a family with three children is larger than the mean number of trips for a family with no children, the variances of the distributions of outcomes for the two families will also differ.
Hence, if the number of store trips follows the Poisson distribution and the mean number of store trips for a family with three children is larger than the mean number of trips for a family with no children, the variances of the distributions of outcomes for the two families will also differ.
We start with the regression model
\begin{align*}
Y_{i} & =E\left\{ Y_{i}\right\} +\varepsilon_{i}\qquad i=1,2,\ldots,n
\end{align*}
The mean response for the $i$th case, to be denoted now my $\mu_{i}$
for simplicity, is assumed as always to be a function of the set of
predictor variables $X_{1},\ldots,X_{p-1}$.
We use the notation $\mu\left(\textbf{X}_{i},\boldsymbol{\beta}\right)$ to denote the function that relates the mean response $\mu_{i}$ to $\textbf{X}_{i}$, the values of the predictor variables for case $i$, and $\boldsymbol{\beta}$, the values of the regression coefficients.
Some commonly used functions for Poisson regression are \begin{align*} \mu_{i}= & \mu\left(\textbf{X}_{i},\boldsymbol{\beta}\right)=\textbf{X}_{i}^{\prime}\boldsymbol{\beta}\\ \mu_{i}= & \mu\left(\textbf{X}_{i},\boldsymbol{\beta}\right)=\exp\left(\textbf{X}_{i}^{\prime}\boldsymbol{\beta}\right)\\ \mu_{i}= & \mu\left(\textbf{X}_{i},\boldsymbol{\beta}\right)=\ln\left(\textbf{X}_{i}^{\prime}\boldsymbol{\beta}\right) \end{align*} In all three cases, the mean response $\mu_{i}$ must be nonnegative.
Since the distribution of the error terms $\varepsilon_{i}$ for Poisson regression is a function of the distribution of the response $Y_{i}$, which is Poisson, it is easiest to state the Poisson regression model in the following form:
$Y_{i}$ are independent Poisson random variables with expected values $\mu_{i}$ where \begin{align*} \mu_{i} & =\mu\left(\textbf{X}_{i},\boldsymbol{\beta}\right) \end{align*} The most commonly used response function is $\mu_{i}=\exp\left(\textbf{X}_{i}^{\prime}\boldsymbol{\beta}\right)$.
We use the notation $\mu\left(\textbf{X}_{i},\boldsymbol{\beta}\right)$ to denote the function that relates the mean response $\mu_{i}$ to $\textbf{X}_{i}$, the values of the predictor variables for case $i$, and $\boldsymbol{\beta}$, the values of the regression coefficients.
Some commonly used functions for Poisson regression are \begin{align*} \mu_{i}= & \mu\left(\textbf{X}_{i},\boldsymbol{\beta}\right)=\textbf{X}_{i}^{\prime}\boldsymbol{\beta}\\ \mu_{i}= & \mu\left(\textbf{X}_{i},\boldsymbol{\beta}\right)=\exp\left(\textbf{X}_{i}^{\prime}\boldsymbol{\beta}\right)\\ \mu_{i}= & \mu\left(\textbf{X}_{i},\boldsymbol{\beta}\right)=\ln\left(\textbf{X}_{i}^{\prime}\boldsymbol{\beta}\right) \end{align*} In all three cases, the mean response $\mu_{i}$ must be nonnegative.
Since the distribution of the error terms $\varepsilon_{i}$ for Poisson regression is a function of the distribution of the response $Y_{i}$, which is Poisson, it is easiest to state the Poisson regression model in the following form:
$Y_{i}$ are independent Poisson random variables with expected values $\mu_{i}$ where \begin{align*} \mu_{i} & =\mu\left(\textbf{X}_{i},\boldsymbol{\beta}\right) \end{align*} The most commonly used response function is $\mu_{i}=\exp\left(\textbf{X}_{i}^{\prime}\boldsymbol{\beta}\right)$.
The Miller Lumber Company is a large retailer of lumber and paint, as well as of plumbing,
electrical, and other household supplies.
During a representative two-week period, in-store surveys were conducted and addresses of customers were obtained. The addresses were then used to identify the metropolitan area census tracts in which the customers reside.
At the end of the survey period, the total number of customers who visited the store from each census tract within a 10-mile radius was determined and relevant demographic information for each tract (average income, number of housing units, etc.) was obtained.
Several other variables expected to be related to customer counts were constructed from maps, including distance from census tract to nearest competitor and distance to store.
Initial screening of the potential predictor variables was conducted which led to the retention of five predictor variables:
- $X_1$: Number of housing units
- $X_2$: Average income, in dollars
- $X_3$: Average housing unit age, in years
- $X_4$ : Distance to nearest competitor, in miles
- $X_5$: Distance to store, in miles
- $Y_i$ : Number of customers who visited store from census tract
During a representative two-week period, in-store surveys were conducted and addresses of customers were obtained. The addresses were then used to identify the metropolitan area census tracts in which the customers reside.
At the end of the survey period, the total number of customers who visited the store from each census tract within a 10-mile radius was determined and relevant demographic information for each tract (average income, number of housing units, etc.) was obtained.
Several other variables expected to be related to customer counts were constructed from maps, including distance from census tract to nearest competitor and distance to store.
Initial screening of the potential predictor variables was conducted which led to the retention of five predictor variables:
- $X_1$: Number of housing units
- $X_2$: Average income, in dollars
- $X_3$: Average housing unit age, in years
- $X_4$ : Distance to nearest competitor, in miles
- $X_5$: Distance to store, in miles
- $Y_i$ : Number of customers who visited store from census tract
dat = read.table("http://users.stat.ufl.edu/~rrandles/sta4210/Rclassnotes/data/
textdatasets/KutnerData/Chapter%2014%20Data%20Sets/CH14TA14.txt")
names(dat) = c("Y", "X1", "X2", "X3", "X4", "X5")
reg = glm(Y~., family = "poisson", data=dat)
summary(reg)
Call:
glm(formula = Y ~ ., family = "poisson", data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.93195 -0.58868 -0.00009 0.59269 2.23441
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.942e+00 2.072e-01 14.198 < 2e-16 ***
X1 6.058e-04 1.421e-04 4.262 2.02e-05 ***
X2 -1.169e-05 2.112e-06 -5.534 3.13e-08 ***
X3 -3.726e-03 1.782e-03 -2.091 0.0365 *
X4 1.684e-01 2.577e-02 6.534 6.39e-11 ***
X5 -1.288e-01 1.620e-02 -7.948 1.89e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 422.22 on 109 degrees of freedom
Residual deviance: 114.99 on 104 degrees of freedom
AIC: 571.02
Number of Fisher Scoring iterations: 4
predict(reg)
1 2 3 4 5 6 7
2.512666 2.171006 3.336689 2.129078 1.982460 2.184004 1.458190
8 9 10
2.397791 2.670277 2.453965