2882 Drop Duplicate Rows

Problem Statement

DataFrame customers

Column Name

Type

customer_id

int

name

object

There are some duplicate rows in the DataFrame based on the email column.

Write a solution to remove these duplicate rows and keep only the first occurrence.

The result format is in the following example.

For the whole problem statement, please refer here.

Plans

Use pandas to handle the data.
Find duplicate rows based on the email column.
Keep only the first occurrence of each duplicate.
Provide the cleaned DataFrame.

Solution

import pandas as pd

def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
    # Drop duplicate rows based on email column
    return customers.drop_duplicates(subset='email', keep='first').reset_index(drop=True)

Explanation

Import Pandas
- We start by importing the Pandas library, which provides data structures and operations for manipulating numerical tables and time series.
Define the Function
- We define a function dropDuplicateEmails that takes a single argument customers, which is a DataFrame containing customer data.
Dropping Duplicate Rows
- We use the drop_duplicates method on the DataFrame customers to remove duplicate rows based on the email column.
- The subset='email' argument specifies that we are looking for duplicates in the email column.
- The keep='first' argument specifies that we want to keep only the first occurrence of each duplicate.
- The reset_index(drop=True) method is used to reset the index of the resulting DataFrame after dropping duplicates.
Return the Result
- We return the cleaned DataFrame after dropping duplicate rows based on the email column.

Previous2881 Create a New Column Next2883 Drop Missing Data

Last updated 3 months ago