Data Cleaning, Filtering, and Grouping in Pandas


Data cleaning, filtering, and grouping are essential steps in the data analysis process. In Pandas, there are various methods to clean, filter, and group your data, which can help you prepare the data for further analysis. This article will cover how to clean, filter, and group data using Pandas with examples.

Importing Pandas

To start, import the Pandas library:

    import pandas as pd
        

Data Cleaning in Pandas

Data cleaning involves handling missing values, duplicates, and incorrect data types.

Handling Missing Values

There are several methods to handle missing values:

    # Creating a DataFrame with missing values
    data = {
        "Name": ["Alice", "Bob", "Charlie", None],
        "Age": [25, None, 35, 40],
        "City": ["New York", "Los Angeles", "Chicago", "San Francisco"]
    }
    df = pd.DataFrame(data)

    # Dropping rows with missing values
    df_cleaned = df.dropna()
    print(df_cleaned)

    # Filling missing values with a specific value
    df_filled = df.fillna({"Name": "Unknown", "Age": 30})
    print(df_filled)

    # Filling missing values with the mean of the column
    df["Age"] = df["Age"].fillna(df["Age"].mean())
    print(df)
        

Handling Duplicates

You can identify and remove duplicate rows in a DataFrame:

    # Creating a DataFrame with duplicate rows
    data = {
        "Name": ["Alice", "Bob", "Alice", "Charlie"],
        "Age": [25, 30, 25, 35]
    }
    df = pd.DataFrame(data)

    # Checking for duplicate rows
    duplicates = df.duplicated()
    print(duplicates)

    # Removing duplicate rows
    df_no_duplicates = df.drop_duplicates()
    print(df_no_duplicates)
        

Correcting Data Types

You can convert data types using the astype() function:

    # Converting a column to a specific data type
    df["Age"] = df["Age"].astype(int)
    print(df.dtypes)
        

Filtering Data in Pandas

Filtering is the process of selecting rows from a DataFrame that meet specific criteria.

Filtering Rows Based on Conditions

You can filter rows by applying conditions to columns:

    # Filtering rows where Age is greater than 30
    filtered_df = df[df["Age"] > 30]
    print(filtered_df)

    # Filtering rows where Name is 'Alice'
    filtered_df = df[df["Name"] == "Alice"]
    print(filtered_df)
        

Filtering Rows Using Multiple Conditions

You can combine multiple conditions using the & (and) and | (or) operators:

    # Filtering rows where Age is greater than 25 and Name is 'Alice'
    filtered_df = df[(df["Age"] > 25) & (df["Name"] == "Alice")]
    print(filtered_df)

    # Filtering rows where Age is greater than 25 or Name is 'Bob'
    filtered_df = df[(df["Age"] > 25) | (df["Name"] == "Bob")]
    print(filtered_df)
        

Grouping Data in Pandas

Grouping is a way to aggregate data based on certain criteria. Pandas provides the groupby() function to group data and perform aggregation operations such as sum, mean, or count on each group.

Grouping by One Column

You can group data by a single column:

    # Grouping by 'Name' and calculating the mean Age for each group
    grouped = df.groupby("Name")["Age"].mean()
    print(grouped)
        

Grouping by Multiple Columns

You can also group data by multiple columns:

    # Grouping by 'Name' and 'Age' and calculating the count of each group
    grouped = df.groupby(["Name", "Age"]).size()
    print(grouped)
        

Aggregation on Grouped Data

You can perform multiple aggregation functions on grouped data:

    # Grouping by 'Name' and calculating both the mean and sum of Age
    grouped = df.groupby("Name")["Age"].agg(["mean", "sum"])
    print(grouped)
        

Using Grouping with Other Functions

You can apply custom functions to grouped data using apply():

    # Applying a custom function to each group
    def custom_function(group):
        return group["Age"].max() - group["Age"].min()

    grouped = df.groupby("Name").apply(custom_function)
    print(grouped)
        

Conclusion

Data cleaning, filtering, and grouping are essential skills when working with data. Pandas provides powerful functions to clean data by handling missing values, duplicates, and incorrect data types. It also allows you to filter data based on specific conditions and group data for aggregation and analysis. By mastering these techniques, you can efficiently prepare and analyze data in Python.





Advertisement