The following code will attempt to replicate the results of the ` lm()`

function in R. For this exercise, we will be using a cross sectional data set provided by R called “women”, that has height and weight data for 15 individuals.

The OLS regression equation:

where a white noise error term. For this example weight, and height. the marginal impact a one unit change in height has on weight.

[sourcecode language=”r”]

## This is the OLS regression we will manually calculate:

reg = lm(weight ~ height, data=women)

summary(reg)

[/sourcecode]

Recall that the following matrix equation is used to calculate the vector of estimated coefficients of an OLS regression:

where the matrix of regressor data (the first column is all 1’s for the intercept), and the vector of the dependent variable data.

## Matrix operators in R

`as.matrix()`

coerces an object into the matrix class.`t()`

transposes a matrix.`%*%`

is the operator for matrix multiplication.`solve()`

takes the inverse of a matrix. Note, the matrix must be invertible.

For a more complete introduction to doing matrix operations in R, check out this page.

## Back to OLS

The following code calculates the 2 x 1 matrix of coefficients, :

[sourcecode language=”r”]

## Create X and Y matrices for this specific regression

X = as.matrix(cbind(1,women$height))

Y = as.matrix(women$weight)

## Choose beta-hat to minimize the sum of squared residuals

## resulting in matrix of estimated coefficients:

bh = round(solve(t(X)%*%X)%*%t(X)%*%Y, digits=2)

## Label and organize results into a data frame

beta.hat = as.data.frame(cbind(c("Intercept","Height"),bh))

names(beta.hat) = c("Coeff.","Est")

beta.hat

[/sourcecode]

## Calculating Standard Errors

To calculate the standard errors, you must first calculate the variance-covariance (VCV) matrix, as follows:

The VCV matrix will be a square k x k matrix. Standard errors for the estimated coefficients are found by taking the square root of the diagonal elements of the VCV matrix.

[sourcecode language=”r”]

## Calculate vector of residuals

res = as.matrix(women$weight-bh[1]-bh[2]*women$height)

## Define n and k parameters

n = nrow(women)

k = ncol(X)

## Calculate Variance-Covariance Matrix

VCV = 1/(n-k) * as.numeric(t(res)%*%res) * solve(t(X)%*%X)

## Standard errors of the estimated coefficients

StdErr = sqrt(diag(VCV))

## Calculate p-value for a t-test of coefficient significance

P.Value = rbind(2*pt(abs(bh[1]/StdErr[1]), df=n-k,lower.tail= FALSE),

2*pt(abs(bh[2]/StdErr[2]), df=n-k,lower.tail= FALSE))

## concatenate into a single data.frame

beta.hat = cbind(beta.hat,StdErr,P.Value)

beta.hat

[/sourcecode]

## A Scatterplot with OLS line

[sourcecode language=”r”]

## Plot results

plot(women$height,women$weight, xlab = "Height", ylab = "Weight",

main = "OLS: Height and Weight")

abline(a = bh[1], b = bh[2], col = ‘red’, lwd = 2, lty="dashed")

[/sourcecode]

Now you can check the results above using the canned `lm()`

function:

[sourcecode language=”r”]

summary(lm(weight ~ height, data = women))

[/sourcecode]

Hi.

Thank you very much for this interesting blog about econometrics topics. I liked this contiĀ”ribution but I think ther is a mistake when you calculate the p-values of standard errors, because I think the code would be:

P.Value = rbind(2*pt(abs(bh[1]/StdErr[1]), df=n-k,lower.tail= FALSE),2*pt(abs(bh[2]/StdErr[2]), df=n-k,lower.tail= FALSE))

Hi Vicente – Great catch; I have updated the code to reflect your comment. Thanks for reading!

Hi Kevin, this is really great! Thanks for making it available. I’m new to R and having my first crack trying to build the OLS model. Just curious how one might extend this to the multivariate case? Alternatively, might there be a way to examine the ‘souce code’ for the lm() function. I say ‘souce code’ because I’m thinking in Stata mode…I want to see what the lm() does behind the scenes!

Kyle – that’s the beauty of matrix notation, it’s exactly the same for the multivariate case: (X’X)^(-1)X’Y. Note that you could think of the example as a 2-variable regression, with one regressor that doesn’t vary (the intercept). So the X matrix has a column of 1’s and then a column with the data women$height. See the code: X = as.matrix(cbind(1,women$height)).

For additional regressors, just extend that line of code by appending new columns of data: X = as.matrix(cbind(1,women$height, newdata$x2, newdata$x3, … )) and the rest should work as shown.

To see the inner workings of any function in R, just execute the function without the parentheses, e.g. lm. I don’t think it will be too much help, though, because R uses a “QR decomposition” to do OLS, which basically is a different approach that is more computationally efficient. Cheers-

Fantastic!!! I’ve done it! Man, the learning curve is huge but well worth it. Thanks very much!

And yes, it turns out the source code is useless for these purposes! Cheers, Kyle

Dear Kevin,

great code! I wonder what n and k would be in the case of a fixed-effects panel data model? I have 744 observations in 24 countries over 31 years using 6 continuous variables. Not sure how to compute the p-values in such a case, given the standard errors and beta.hat.

Hello, this post is really great.

This makes me understand what’s going on in detail.

I would like to know if it’s possible to make a similar post for a logistic regression.

Thanks.