Simple Example of Linear Regression in R

In this practical example of linear regression in R, we will learn how to predict the fuel efficiency of a car based on its weight. We will start by importing a dataset, calculating linear regression using the lm() function and making predictions using the predict() function, and learning how to interpret the linear regression results in R.

Though we will use mtcars R demo dataset to demonstrate how to calculate linear regression, remember that you can use any R datasets available that contains a predictor variable and a response variable.

Without further ado, launch R or R Studio on your computer and let's get started.

Step 1: Import a Dataset in R

To get started, we need a dataset to work with. We will use the mtcars dataset, which contains the weight and fuel efficiency (in miles per gallon) of different cars. This dataset is built-in to R and can be loaded using the data() function. Type the following in the R shell:

data(mtcars)

You can take a look at the data by using the head() function, which will show you the first few rows of the dataset:

head(mtcars)

The output should look something like this:

 mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

As you can see, the dataset contains information on the mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear and carb for different car brands.

Step 2: Calculate Linear Regression in R

Now that we have our data loaded, we can start performing linear regression. To perform linear regression, we use the lm() function.

model <- lm(mpg ~ wt, data = mtcars)

The first argument of the function is a formula that specifies the model. In this case, the model is predicting mpg (fuel efficiency) using wt (weight). The data = mtcars argument specifies that the data set to use is mtcars.

Step 3: Get the Summary of the Regression Model

Once you have fit the model, you can get a summary of the model by using the summary() function. The summary includes information on the residuals, coefficients, R-squared value, F-statistic, and p-value.

Here is an example of how to get the summary of the model:

summary(model)

This summary of the linear regression model should look like this:

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max
-4.5275 -2.3279 -0.4826  1.2975  6.8724

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  37.2851     1.8245  20.527  < 2e-16 ***
wt           -5.3445     0.5534  -9.659 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.576 on 30 degrees of freedom
Multiple R-squared:  0.7528,	Adjusted R-squared:  0.7446
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Step 4: Interpret the Linear Regression Results

Now that we got our output, here is how you should interpret the linear regression results for our example:

The output shows that the coefficient for wt (weight) is -5.34447, and the intercept is 37.28536.
The p-value is less than 0.05, indicating that the relationship between weight and fuel efficiency is statistically significant.
The R-squared value, which measures the proportion of the variation in the response variable explained by the predictor variable, is 0.7528. This means that the car's weight can explain 75.28% of the variation in fuel efficiency.

Step 5: Plot the Regression Line in a Graph

Just using numbers without visualizing them in a graph is not fun. You can plot the regression line in a graph using the ggplot2 package in R. Here is a guide on how to install packages in R, just in case you need it.

library(ggplot2)
 
# Create a scatterplot
ggplot(mtcars, aes(x=wt, y=mpg)) +
 geom_point() +
 geom_smooth(method='lm', se=FALSE) +
 ggtitle("Linear Regression of mpg vs wt")

The scatterplot shows the relationship between weight (wt) and fuel efficiency (mpg). The line shows the regression line, which represents the best fit line through the data.

The line is based on the coefficients from the linear regression model that we fitted earlier. The scatterplot provides a visual representation of the relationship between the predictor and response variables, and the regression line provides a summary of that relationship.

Step 6: Using Linear Regression in R to Make Predictions

Now that we have our linear regression model in R, it is time to use it to make predictions using the following syntax:

predictions <- predict(lm_fit, newdata=data.frame(wt=c(3,4)))
predictions

The output of this code will be the predicted values of mpg for two values of wt where:

predict is the R function used to make predictions based on a linear regression model.
lm_fit is the object that stores the fitted linear regression model. In this example, lm_fit is the object created from the linear regression analysis using the lm function.
newdata is an argument that specifies the values of the independent variable (in this case, wt) for which you want to make predictions. The values are passed in as a data frame using the data.frame function. The values for wt in this example are c(3,4), meaning the predictions will be made for cars weighing 3,000 and 4,000 pounds, respectively.

The output of this code will be the predicted values of the dependent variable (in this case, mpg) based on the values of the independent variable specified in newdata.

And here is the output for this prediction using the function above:

 1 2 
22.56687 20.80958

These numbers represent the predicted values of mpg for two cars with weights of 3,000 pounds and 4,000 pounds, respectively. The predicted values can be interpreted as follows:

For a car weighing 3,000 pounds, the linear regression model predicts a value of 22.57 mpg.
For a car weighing 4,000 pounds, the linear regression model predicts a value of 20.81 mpg.

It's important to note that these are only predictions and may not necessarily match the actual mpg values for these cars. However, the linear regression model provides us with a way to estimate the relationship between wt and mpg and make predictions based on this relationship. This can be useful for making decisions and predictions in real-world applications.

If needed, you can compare the predictions with the actual values by using the cbind() function to combine the predictions and actual values into a single data frame using the following syntax:

results <- cbind(predictions, mtcars$mpg)

Frequently Asked Questions

What is the lm() function in R?

The lm() function in R is used to fit linear regression models. It stands for 'linear model' and calculates the relationship between a dependent variable (response) and one or more independent variables (predictors). The basic syntax is lm(y ~ x, data = dataset), where y is the response variable and x is the predictor variable. The function returns a model object containing coefficients, residuals, fitted values, and other statistical information needed for analysis and prediction.

How do I interpret the R-squared value in linear regression?

The R-squared value (also called coefficient of determination) measures how well your linear regression model fits the data. It ranges from 0 to 1, where 0 means the model explains none of the variability and 1 means it explains all of it. For example, an R-squared of 0.75 means that 75% of the variation in the dependent variable is explained by the independent variable(s). Generally, higher R-squared values indicate better fit, but the acceptable threshold depends on your field - social sciences often accept 0.50+, while physical sciences may require 0.90+.

What does a negative coefficient mean in linear regression?

A negative coefficient in linear regression indicates an inverse relationship between the predictor and response variables. For example, in our car weight example, the coefficient for weight is -5.34, meaning that for every 1-unit increase in weight (1000 lbs), the fuel efficiency (mpg) decreases by 5.34 miles per gallon. Negative coefficients show that as one variable increases, the other decreases. The magnitude of the coefficient tells you the strength of this relationship.

How do I make predictions with a linear regression model in R?

To make predictions with your linear regression model in R, use the predict() function. First, create your model with lm(), then use predict(model, newdata) where newdata is a data frame containing the predictor values. For example: model <- lm(mpg ~ wt, data = mtcars), then predictions <- predict(model, newdata = data.frame(wt = c(2.5, 3.0, 3.5))). This will return predicted mpg values for cars weighing 2.5, 3.0, and 3.5 thousand pounds. You can also add interval = 'confidence' or interval = 'prediction' to get confidence or prediction intervals.

What is the p-value in linear regression and how do I interpret it?

The p-value in linear regression tests whether the relationship between your predictor and response variable is statistically significant. A p-value less than 0.05 (5% significance level) typically indicates that the relationship is statistically significant and not due to random chance. In the summary output, you'll see p-values for each coefficient. For example, Pr(>|t|) = 1.29e-10 (which is 0.000000000129) is highly significant, meaning you can confidently reject the null hypothesis that there's no relationship between the variables.

What does the Estimate column mean in the regression output?

The Estimate column in regression output shows the coefficient values for your model. The Intercept estimate is the predicted value of the response variable when all predictors are zero. For predictor variables, the estimate shows how much the response variable changes for a one-unit increase in that predictor, holding other variables constant. For example, if weight has an estimate of -5.34, it means mpg decreases by 5.34 for every 1-unit increase in weight. These estimates form your regression equation: y = Intercept + (Estimate × x).

How do I check if my linear regression assumptions are met?

Linear regression has four main assumptions you should check: (1) Linearity - the relationship between variables is linear; (2) Independence - observations are independent; (3) Homoscedasticity - residuals have constant variance; (4) Normality - residuals are normally distributed. In R, use diagnostic plots with plot(model) to check these. The first plot (Residuals vs Fitted) checks linearity and homoscedasticity, the Q-Q plot checks normality, and you can use the Durbin-Watson test for independence. Violations of these assumptions may require data transformation or alternative modeling approaches.

Can I perform multiple linear regression in R with the lm() function?

Yes, the lm() function handles both simple and multiple linear regression. For multiple predictors, simply add them to your formula with + signs: lm(y ~ x1 + x2 + x3, data = dataset). For example, to predict mpg using both weight and horsepower: model <- lm(mpg ~ wt + hp, data = mtcars). The summary() output will show coefficients for each predictor, allowing you to understand how each variable independently affects the response. You can also include interaction terms using * (e.g., wt * hp) to test whether the effect of one variable depends on another.

Conclusion

Conducting linear regression in R is a powerful way to understand the relationship between variables and make predictions. In this article, we have shown how to perform linear regression in R using the lm() function, how to make predictions using the predict() function and interpret the results.