Calculate OLS regression manually using matrix algebra in R

The following code will attempt to replicate the results of the lm() function in R. For this exercise, we will be using a cross sectional data set provided by R called “women”, that has height and weight data for 15 individuals.

The OLS regression equation:

Y = X\beta + \varepsilon

where \varepsilon = a white noise error term. For this example Y = weight, and X = height. \beta = the marginal impact a one unit change in height has on weight.

[sourcecode language=”r”]
## This is the OLS regression we will manually calculate:
reg = lm(weight ~ height, data=women)

Recall that the following matrix equation is used to calculate the vector of estimated coefficients \hat{\beta} of an OLS regression:

\hat{\beta} = (X'X)^{-1}X'Y

where X = the matrix of regressor data (the first column is all 1’s for the intercept), and Y = the vector of the dependent variable data.

Matrix operators in R

  • as.matrix() coerces an object into the matrix class.
  • t() transposes a matrix.
  • %*% is the operator for matrix multiplication.
  • solve() takes the inverse of a matrix. Note, the matrix must be invertible.

For a more complete introduction to doing matrix operations in R, check out this page.

Back to OLS

The following code calculates the 2 x 1 matrix of coefficients, \hat{\beta}:

[sourcecode language=”r”]
## Create X and Y matrices for this specific regression
X = as.matrix(cbind(1,women$height))
Y = as.matrix(women$weight)

## Choose beta-hat to minimize the sum of squared residuals
## resulting in matrix of estimated coefficients:
bh = round(solve(t(X)%*%X)%*%t(X)%*%Y, digits=2)

## Label and organize results into a data frame
beta.hat ="Intercept","Height"),bh))
names(beta.hat) = c("Coeff.","Est")

Calculating Standard Errors

To calculate the standard errors, you must first calculate the variance-covariance (VCV) matrix, as follows:

Var(\hat{\beta}|X) = \frac{1}{n-k}\hat{\varepsilon}'\hat{\varepsilon}(X'X)^{-1}

The VCV matrix will be a square k x k matrix. Standard errors for the estimated coefficients \hat{\beta} are found by taking the square root of the diagonal elements of the VCV matrix.

[sourcecode language=”r”]
## Calculate vector of residuals
res = as.matrix(women$weight-bh[1]-bh[2]*women$height)

## Define n and k parameters
n = nrow(women)
k = ncol(X)

## Calculate Variance-Covariance Matrix
VCV = 1/(n-k) * as.numeric(t(res)%*%res) * solve(t(X)%*%X)

## Standard errors of the estimated coefficients
StdErr = sqrt(diag(VCV))

## Calculate p-value for a t-test of coefficient significance
P.Value = rbind(2*pt(abs(bh[1]/StdErr[1]), df=n-k,lower.tail= FALSE),
2*pt(abs(bh[2]/StdErr[2]), df=n-k,lower.tail= FALSE))

## concatenate into a single data.frame
beta.hat = cbind(beta.hat,StdErr,P.Value)

A Scatterplot with OLS line

Women's height vs. weight using plot() and abline() functions in R.

[sourcecode language=”r”]
## Plot results
plot(women$height,women$weight, xlab = "Height", ylab = "Weight",
main = "OLS: Height and Weight")
abline(a = bh[1], b = bh[2], col = ‘red’, lwd = 2, lty="dashed")
Now you can check the results above using the canned lm() function:
[sourcecode language=”r”]
summary(lm(weight ~ height, data = women))


7 thoughts on “Calculate OLS regression manually using matrix algebra in R

  1. Hi.

    Thank you very much for this interesting blog about econometrics topics. I liked this conti¡ribution but I think ther is a mistake when you calculate the p-values of standard errors, because I think the code would be:

    P.Value = rbind(2*pt(abs(bh[1]/StdErr[1]), df=n-k,lower.tail= FALSE),2*pt(abs(bh[2]/StdErr[2]), df=n-k,lower.tail= FALSE))

  2. Hi Kevin, this is really great! Thanks for making it available. I’m new to R and having my first crack trying to build the OLS model. Just curious how one might extend this to the multivariate case? Alternatively, might there be a way to examine the ‘souce code’ for the lm() function. I say ‘souce code’ because I’m thinking in Stata mode…I want to see what the lm() does behind the scenes!

    1. Kyle – that’s the beauty of matrix notation, it’s exactly the same for the multivariate case: (X’X)^(-1)X’Y. Note that you could think of the example as a 2-variable regression, with one regressor that doesn’t vary (the intercept). So the X matrix has a column of 1’s and then a column with the data women$height. See the code: X = as.matrix(cbind(1,women$height)).

      For additional regressors, just extend that line of code by appending new columns of data: X = as.matrix(cbind(1,women$height, newdata$x2, newdata$x3, … )) and the rest should work as shown.

      To see the inner workings of any function in R, just execute the function without the parentheses, e.g. lm. I don’t think it will be too much help, though, because R uses a “QR decomposition” to do OLS, which basically is a different approach that is more computationally efficient. Cheers-

  3. Fantastic!!! I’ve done it! Man, the learning curve is huge but well worth it. Thanks very much!

    And yes, it turns out the source code is useless for these purposes! Cheers, Kyle

  4. Dear Kevin,
    great code! I wonder what n and k would be in the case of a fixed-effects panel data model? I have 744 observations in 24 countries over 31 years using 6 continuous variables. Not sure how to compute the p-values in such a case, given the standard errors and beta.hat.

  5. Hello, this post is really great.
    This makes me understand what’s going on in detail.
    I would like to know if it’s possible to make a similar post for a logistic regression.

Leave a Reply