Logo

Cyber Freeze AI

1.5: Mastering Data Manipulation and Visualization with Python

Learn how to manipulate data with Numpy and Pandas, and visualize insights using Matplotlib in Python.

·
·5 min. read
Cover Image for 1.5: Mastering Data Manipulation and Visualization with Python

1.5: Mastering Data Manipulation and Visualization with Python

Welcome to the next post in the AI Zero to Mastery series! In this chapter, we’ll dive deeper into Numpy, Pandas, and Matplotlib to learn how to manipulate and visualize data effectively. These tools are the cornerstone of data analysis and are essential for anyone working in AI, data science, or analytics.


Follow Along on Google Colab!

To practice as you read, open the interactive notebook on Google Colab: Try this tutorial on Colab.


What Are Libraries in Python?

Definition of Libraries

In Python, a library is a collection of pre-written code that simplifies common tasks. For example:

  • If you need to perform matrix calculations, use Numpy.
  • To analyze structured data, use Pandas.
  • For creating charts and graphs, use Matplotlib.

Think of libraries as pre-packed toolkits filled with ready-made tools for specific tasks.


Why Use Libraries?

  1. Simplify Complex Tasks:
    Writing code to calculate the average of an array? Libraries like Numpy make it a one-liner:

    import numpy as np
    data = [1, 2, 3, 4, 5]
    print(np.mean(data))  # Output: 3.0
    

Result

3.0
  1. Save Time:
    Instead of writing custom code to filter data, use Pandas:

    import pandas as pd
    df = pd.DataFrame({"Name": ["Alice", "Bob"], "Age": [25, 30]})
    print(df[df["Age"] > 25])  # Filter rows
    
  2. Community Support:
    Popular libraries like Numpy and Pandas are well-documented and frequently updated by their communities.


How Libraries Are Similar to Functions

Libraries and functions are both reusable pieces of code, but libraries take it to the next level.

Functions: Reusable Code for a Single Task

A function is a reusable block of code that performs one task:

def square(num):
    return num ** 2

print(square(5))  # Output: 25

Libraries: A Toolkit of Functions

Libraries are collections of related functions and tools for broader tasks:

  • Numpy includes tools for mathematical operations.
  • Pandas provides tools for tabular data analysis.
  • Matplotlib enables creating professional plots.

For example, Numpy lets you calculate the mean of an array with np.mean(), while Pandas lets you handle missing data with df.dropna().


2. Numpy and Pandas: Introduction to Data Manipulation

Why Numpy and Pandas?

Both libraries are essential for handling large datasets:

  • Numpy provides fast, memory-efficient array operations.
  • Pandas extends this with labeled data and advanced analysis tools.

Getting Started with Numpy

Numpy Arrays

Numpy arrays are faster and more memory-efficient than Python lists:

import numpy as np

# Create an array
data = np.array([1, 2, 3, 4, 5])
print(data)  # Output: [1 2 3 4 5]

# Perform operations
print(data + 5)  # Add 5 to each element: [6 7 8 9 10]
print(data * 2)  # Multiply each element by 2: [2 4 6 8 10]

Result

[1 2 3 4 5]
[6 7 8 9 10]
[ 2  4  6  8 10]

Working with 2D Arrays

Numpy also supports multi-dimensional arrays:

# Create a 2D array
matrix = np.array([[1, 2], [3, 4]])
print(matrix)

# Calculate the sum of all elements
print(np.sum(matrix))  # Output: 10

# Transpose the array
print(matrix.T)  # Output: [[1 3] [2 4]]

Result

[[1 2]
 [3 4]]
10
[[1 3]
 [2 4]]

Introduction to Pandas

Pandas DataFrames

A DataFrame is like a spreadsheet in Python. Each column has a name, and rows can be indexed:

import pandas as pd

# Create a DataFrame
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "Score": [85, 90, 95]
}
df = pd.DataFrame(data)
print(df)

Result

      Name  Age  Score
0    Alice   25     85
1      Bob   30     90
2  Charlie   35     95

Basic DataFrame Operations

# Access a column
print(df["Name"])  # Output: ["Alice", "Bob", "Charlie"]

# Filter rows
filtered = df[df["Score"] > 90]
print(filtered)  # Output: Rows where Score > 90

Data Manipulation

You can add, remove, or modify columns easily:

# Add a new column
df["Pass"] = df["Score"] >= 90
print(df)

# Drop a column
df = df.drop(columns=["Pass"])
print(df)

3. Matplotlib Basics: Data Visualization

Why Matplotlib?

Visualization is a critical step in understanding data. With Matplotlib, you can create:

  • Line charts
  • Bar charts
  • Scatter plots

Creating Basic Plots

import matplotlib.pyplot as plt

# Data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

# Line chart
plt.plot(x, y, marker="o")
plt.title("Line Chart")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()

Result

Line chart example

Bar Chart Example

# Data
categories = ["A", "B", "C"]
values = [10, 20, 15]

# Bar chart
plt.bar(categories, values, color="skyblue")
plt.title("Bar Chart")
plt.xlabel("Categories")
plt.ylabel("Values")
plt.show()

Result

Bar chart example


4. Real-Life Example: Sales Analysis

Let’s see how these libraries work together to analyze sales data.

Scenario

You have sales data with the following columns:

  • Date
  • Product
  • Quantity
  • Price

Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Create a dataset
data = {
    "Date": ["2023-01-01", "2023-01-02", "2023-01-03"],
    "Product": ["Laptop", "Tablet", "Smartphone"],
    "Quantity": [2, 5, 3],
    "Price": [1000, 500, 800]
}
df = pd.DataFrame(data)

# Calculate total revenue
df["Revenue"] = df["Quantity"] * df["Price"]
print(df)

# Analyze trends
print("Average Revenue:", np.mean(df["Revenue"]))

# Plot revenue by product
plt.bar(df["Product"], df["Revenue"], color="orange")
plt.title("Revenue by Product")
plt.xlabel("Product")
plt.ylabel("Revenue")
plt.show()

Result

         Date     Product  Quantity  Price  Revenue
0  2023-01-01     Laptop         2   1000     2000
1  2023-01-02     Tablet         5    500     2500
2  2023-01-03  Smartphone         3    800     2400

Average Revenue: 2300.0
(Bar chart of revenue by product displayed)

Example code result

5. Conclusion

With Numpy and Pandas for data manipulation and Matplotlib for visualization, you can handle real-world data efficiently. Practice these libraries to build a strong foundation for AI and data science.


Next in the Series: 1.6 Basic Math for ML

Stay tuned for the next post, where we’ll cover:

  • Linear Algebra Basics: An introduction to vectors and matrices with real-life analogies.
  • Probability and Statistics: Covering mean, median, standard deviation, and their applications in machine learning.

Feedback and Next Steps
Consistency and practice are key to mastery—share your thoughts and questions in the comments!

Be First to Like