2882 Drop Duplicate Rows

Problem Statement

DataFrame customers

Column NameType

customer_id

int

name

object

email

object

There are some duplicate rows in the DataFrame based on the email column.

Write a solution to remove these duplicate rows and keep only the first occurrence.

The result format is in the following example.

For the whole problem statement, please refer here.

Plans

  • Use pandas to handle the data.

  • Find duplicate rows based on the email column.

  • Keep only the first occurrence of each duplicate.

  • Provide the cleaned DataFrame.

Solution

import pandas as pd

def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
    # Drop duplicate rows based on email column
    return customers.drop_duplicates(subset='email', keep='first').reset_index(drop=True)

Explanation

  1. Import Pandas

    • We start by importing the Pandas library, which provides data structures and operations for manipulating numerical tables and time series.

  2. Define the Function

    • We define a function dropDuplicateEmails that takes a single argument customers, which is a DataFrame containing customer data.

  3. Dropping Duplicate Rows

    • We use the drop_duplicates method on the DataFrame customers to remove duplicate rows based on the email column.

    • The subset='email' argument specifies that we are looking for duplicates in the email column.

    • The keep='first' argument specifies that we want to keep only the first occurrence of each duplicate.

    • The reset_index(drop=True) method is used to reset the index of the resulting DataFrame after dropping duplicates.

  4. Return the Result

    • We return the cleaned DataFrame after dropping duplicate rows based on the email column.

Last updated