Just spent a bit of time trying to work out why arima() and lm() were giving different estimates on some dengue data, and on solving the problem, thought I’d share what I found via simulated data.
pars=c(0.5,0.3,0.2,0.1,-0.1);sigma=1
set.seed(666) #for luck
n=10000 #large sample size
errors=rnorm(n,0,sigma)
y=c(1,2,3,4) #initial values
for(i in 5:n)
{
meany=sum(pars*c(1,y[(i-1):(i-4)]))
y[i]=meany+errors[i]
}
plot(y,type=’l’)
arima(y,order=c(4,0,0))
fit=lm(y[5:n]~y[4:(n-1)]+y[3:(n-2)]+y[2:(n-3)]+y[1:(n-4)])
summary(fit)

Calling arima() gives the following coefficients:

• ar1 0.29 [should be 0.3, ok]
• ar2 0.19 [should be 0.2, ok]
• ar3 0.09 [should be 0.1, ok]
• ar4 -0.09 [should be -0.1, ok]
• intercept 1.0 [should be 0.5, wrong!]

while lm() gives

• ar1 0.29
• ar2 0.19
• ar3 0.09
• ar4 -0.09
• intercept 0.53 [ok!]

An explanation can be found via the website of Shumway and Stoffer: what R provides in arima() is the mean of the stationary distribution, not the “intercept” that you’d expect having learned basic regression.  I wonder how many people have been caught out by this?

PS Just noticed in the original version I wrote anova in place of arima. Doh!