Just spent a bit of time trying to work out why arima() and lm() were giving different estimates on some dengue data, and on solving the problem, thought I’d share what I found via simulated data.

pars=c(0.5,0.3,0.2,0.1,-0.1);sigma=1

set.seed(666) #for luck

n=10000 #large sample size

errors=rnorm(n,0,sigma)

y=c(1,2,3,4) #initial values

for(i in 5:n)

{

meany=sum(pars*c(1,y[(i-1):(i-4)]))

y[i]=meany+errors[i]

}

plot(y,type=’l’)

arima(y,order=c(4,0,0))

fit=lm(y[5:n]~y[4:(n-1)]+y[3:(n-2)]+y[2:(n-3)]+y[1:(n-4)])

summary(fit)

Calling arima() gives the following coefficients:

- ar1 0.29 [should be 0.3, ok]
- ar2 0.19 [should be 0.2, ok]
- ar3 0.09 [should be 0.1, ok]
- ar4 -0.09 [should be -0.1, ok]
- intercept 1.0 [should be 0.5, wrong!]

while lm() gives

- ar1 0.29
- ar2 0.19
- ar3 0.09
- ar4 -0.09
- intercept 0.53 [ok!]

An explanation can be found via the website of Shumway and Stoffer: what R provides in arima() is the mean of the stationary distribution, not the “intercept” that you’d expect having learned basic regression. I wonder how many people have been caught out by this?

PS Just noticed in the original version I wrote anova in place of arima. Doh!

## Leave a Reply