2.1: Linear Regression - Introduction to Core ML Algorithms
Dive into the basics of core Machine Learning algorithms. Learn about Linear Regression, its role in predicting continuous variables, and how to interpret its components like slope and intercept.
What is Linear Regression?
Linear Regression is a way to predict a number based on some input data. Imagine you're trying to guess someone's height based on their age. If you know that taller people are generally older, you can use Linear Regression to draw a straight line through the data points and make predictions.
It’s like drawing the best-fit straight line on a graph that connects the relationship between the input (like age) and output (like height).
Key Ideas Behind Linear Regression:
-
What Does It Do?
Linear Regression tries to find the relationship between:- An input (independent variable): The thing you know, like the size of a house.
- An output (dependent variable): The thing you want to predict, like the price of the house.
-
Equation of a Line:
Do you remember the equation of a straight line from math class?
y = mx + cHere:
y
is what we are predicting (e.g., house price).x
is the input variable (e.g., house size).m
is the slope of the line (it shows how muchy
changes whenx
changes).c
is the intercept (it shows the starting value ofy
whenx
is zero).
Simple Example: Predicting Grades Based on Study Hours
Imagine you want to predict a student’s exam score based on how many hours they study.
- If students who study 2 hours get a score of 50, and students who study 4 hours get a score of 70, then the line might look like this:
Score = 10 × Hours + 30- The slope (
m
) is 10 because for every extra hour of study, the score increases by 10. - The intercept (
c
) is 30, meaning if you study 0 hours, your score starts at 30 (not a good idea, though!).
- The slope (
How Does It Work?
Linear Regression finds the best-fit line by:
- Looking at all the points (e.g., data of hours and scores).
- Figuring out a line that is closest to all those points (using a method called "Least Squares").
How to Understand Slope and Intercept?
- Slope (
m
): Think of it as how steep the hill is. A bigger slope means a stronger relationship betweenx
andy
.
Example: If you earn $10 for every hour you work, the slope is 10. - Intercept (
c
): Think of it as where the line starts. It shows the baseline value whenx
is zero.
Example: If you have $30 in savings before you start working, that's the intercept.
Advantages of Linear Regression:
- It’s easy to understand and use.
- It’s great for problems with straight-line relationships.
Limitations of Linear Regression:
- It doesn’t work well if the data doesn’t follow a straight-line pattern.
Example: If studying 5 hours increases scores a lot but studying 10 hours doesn’t help as much, Linear Regression might not work well. - It’s sensitive to outliers (extremely high or low values).
Step-by-Step Example in Python: Predicting House Prices
Here’s how you can use Linear Regression to predict house prices based on their size.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Data: Size of house (in 1000 sq ft) and price (in $1000)
X = [[1], [2], [3], [4], [5]] # Size of house
y = [100, 150, 200, 250, 300] # Price of house
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Model parameters
slope = model.coef_[0]
intercept = model.intercept_
print(f"Slope (m): {slope}")
print(f"Intercept (c): {intercept}")
print(f"Mean Squared Error: {mean_squared_error(y_test, y_pred)}")
- Slope: If the slope is 50, it means for every additional 1000 sq ft, the price increases by $50,000.
- Intercept: If the intercept is 100, it means a house with size 0 still has a base price of $100,000 (land value, for example).
Summary:
Linear Regression is like drawing the straightest line through your data to make predictions. It’s simple, powerful, and a great starting point for learning machine learning. With real-world examples like predicting grades, house prices, or sales, you can see how useful it can be!