Basic Data Analysis Operations in Pandas


Pandas is a powerful library in Python used for data analysis and manipulation. It provides a range of functions to perform basic operations on data such as summarizing, filtering, and aggregating. This article explores some of the basic data analysis operations you can perform using Pandas.

Importing Pandas

Before starting with the data analysis, you need to import the Pandas library:

    import pandas as pd
        

Loading Data

To begin with, you need to load your data into a Pandas DataFrame. Data can be loaded from various sources like CSV, Excel, or a SQL database. Here is an example of loading data from a CSV file:

    # Reading a CSV file into a DataFrame
    df = pd.read_csv("data.csv")
    print(df)
        

Exploring the Data

Once the data is loaded, you can explore it to understand its structure. Some common methods for exploring the data include:

Viewing the First Few Rows

The head() method is used to view the first few rows of the DataFrame:

    # Displaying the first 5 rows of the DataFrame
    print(df.head())
        

Getting DataFrame Info

The info() method gives information about the DataFrame, such as the column names, non-null counts, and data types:

    # Displaying the DataFrame info
    print(df.info())
        

Descriptive Statistics

The describe() method provides summary statistics of numeric columns:

    # Getting summary statistics of the DataFrame
    print(df.describe())
        

Basic Data Selection

In Pandas, you can select data based on column names, row indices, or specific conditions.

Selecting Specific Columns

You can select one or more columns from a DataFrame:

    # Selecting a single column
    age_column = df["Age"]

    # Selecting multiple columns
    age_and_name = df[["Age", "Name"]]
        

Selecting Specific Rows

You can select rows based on their index position using the iloc[] function:

    # Selecting the first row
    first_row = df.iloc[0]

    # Selecting rows based on index range
    rows_range = df.iloc[1:5]
        

Selecting Rows by Condition

Filtering rows based on specific conditions can be done using boolean indexing:

    # Selecting rows where Age is greater than 30
    filtered_df = df[df["Age"] > 30]
    print(filtered_df)
        

Handling Missing Data

In real-world data, it’s common to have missing or null values. Pandas provides several ways to handle these missing values.

Identifying Missing Values

Use the isnull() function to check for missing values:

    # Identifying missing values in the DataFrame
    missing_values = df.isnull()
    print(missing_values)
        

Removing Missing Values

You can remove rows with missing values using the dropna() function:

    # Dropping rows with missing values
    df_cleaned = df.dropna()
    print(df_cleaned)
        

Filling Missing Values

To fill missing values, use the fillna() function. You can replace missing values with a specific value, such as the mean:

    # Filling missing values with the mean of the column
    df["Age"] = df["Age"].fillna(df["Age"].mean())
    print(df)
        

Grouping and Aggregating Data

Pandas provides powerful functions for grouping and aggregating data.

Grouping Data

You can group data based on one or more columns using the groupby() function:

    # Grouping by 'City' and calculating the mean age in each city
    grouped = df.groupby("City")["Age"].mean()
    print(grouped)
        

Aggregating Data

You can perform multiple aggregation functions on grouped data:

    # Grouping by 'City' and calculating both mean and sum of age
    aggregated = df.groupby("City")["Age"].agg(["mean", "sum"])
    print(aggregated)
        

Sorting Data

Sorting data is a common operation in data analysis. You can sort data by one or more columns:

Sorting by a Single Column

Use the sort_values() function to sort the DataFrame by a specific column:

    # Sorting by the 'Age' column in ascending order
    sorted_df = df.sort_values("Age")
    print(sorted_df)
        

Sorting by Multiple Columns

You can also sort by multiple columns by passing a list of column names:

    # Sorting by 'City' and 'Age'
    sorted_df = df.sort_values(["City", "Age"], ascending=[True, False])
    print(sorted_df)
        

Conclusion

Pandas provides a wide range of operations for basic data analysis, from loading and exploring data to cleaning, filtering, grouping, and sorting. These operations form the foundation for more advanced data manipulation and analysis. By mastering these basic operations, you can begin to uncover valuable insights from your data.





Advertisement