Handling Duplicacy Data Frame in Python

Find duplicate rows: duplicated()
- Determines which duplicates to mark: keep
- Specify the column to find duplicate: subset
- Count duplicate / non-duplicate rows
Remove duplicate rows: drop_duplicates()
- keep, subset
- inplace
Aggregate based on duplicate elements: groupby()

The following data is used as an example. row #6 is a duplicate of row #3.

import pandas as pd

df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

df = df.append({'name': 'Dave', 'age': 68, 'state': 'TX', 'point': 70}, ignore_index=True)
print(df)
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57
# 6     Dave   68    TX     70

The sample csv file is linked below.

Find duplicate rows: duplicated()

duplicated() method returns boolean pandas.Series with duplicate rows as True. By default, all columns are used to determine if a row is a duplicate or not.

print(df.duplicated())
# 0    False
# 1    False
# 2    False
# 3    False
# 4    False
# 5    False
# 6     True
# dtype: bool

You can use this pandas.Series to extract duplicate rows from the original pandas.DataFrame.

print(df[df.duplicated()])
#    name  age state  point
# 6  Dave   68    TX     70

Determines which duplicates to mark: keep

The default value of the argument keep is 'first', and the first duplicate row is determined to be False like the example above.

If you set keep='last', the last duplicate row is determined to be False.

print(df.duplicated(keep='last'))
# 0    False
# 1    False
# 2    False
# 3     True
# 4    False
# 5    False
# 6    False
# dtype: bool

With keep = False, all duplicate rows are determined to be True.

print(df.duplicated(keep=False))
# 0    False
# 1    False
# 2    False
# 3     True
# 4    False
# 5    False
# 6     True
# dtype: bool

Specify the column to find duplicate: subset

As mentioned above, by default, all columns are used to identify duplicates.

You can specify which column to use for identifying duplicates in the argument subset.

print(df.duplicated(subset='state'))
# 0    False
# 1    False
# 2     True
# 3    False
# 4     True
# 5     True
# 6     True
# dtype: bool

It is also possible to specify multiple columns with a list.

print(df.duplicated(subset=['state', 'point']))
# 0    False
# 1    False
# 2    False
# 3    False
# 4    False
# 5    False
# 6     True
# dtype: bool

Count duplicate / non-duplicate rows

You can count the number of duplicate rows by counting True in pandas.Series obtained with duplicated(). The number of True can be counted with sum() method.

print(df.duplicated().sum())
# 1

If you want to count the number of False (= the number of non-duplicate rows), you can invert it with negation ~ and then count True with sum().

print(~df.duplicated())
# 0     True
# 1     True
# 2     True
# 3     True
# 4     True
# 5     True
# 6    False
# dtype: bool

print((~df.duplicated()).sum())
# 6

You can also count True and False together with value_counts().

print(df.duplicated().value_counts())
# False    6
# True     1
# dtype: int64

Note that the result depends on the argument keep. Use properly according to your purpose.

print(df.duplicated(keep=False).value_counts())
# False    5
# True     2
# dtype: int64

Remove duplicate rows: drop_duplicates()

You can use duplicated() and the negation operator ~ to remove duplicate rows.

print(df[~df.duplicated()])
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

You can also remove duplicate rows with drop_duplicates().

print(df.drop_duplicates())
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

keep, subset

Arguments keep andsubset can be set for drop_duplicates() as well as duplicated().

print(df.drop_duplicates(keep=False))
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

print(df.drop_duplicates(subset='state'))
#     name  age state  point
# 0  Alice   24    NY     64
# 1    Bob   42    CA     92
# 3   Dave   68    TX     70

inplace

By default, new DataFrame with duplicate rows removed is returned. With the argument inplace = True, duplicate rows are removed from the original DataFrame.

df.drop_duplicates(subset='state', keep='last', inplace=True)
print(df)
#     name  age state  point
# 4  Ellen   24    CA     88
# 5  Frank   30    NY     57
# 6   Dave   68    TX     70

Aggregate based on duplicate elements: groupby()

Use groupby() to aggregate values based on duplicate elements.

In the following example, the average of the values of the numeric columns age and point is calculated for each duplicate elements in the state column.

df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

print(df.groupby('state').mean())
#         age      point
# state                 
# CA     28.0  83.333333
# NY     27.0  60.500000
# TX     68.0  70.000000

It is also possible to concatenate strings and convert them to lists.

print(df.groupby('state').agg(
    {'name': lambda x: ','.join(x),
     'age': 'mean',
     'point': 'mean'}))
#                     name  age      point
# state                                   
# CA     Bob,Charlie,Ellen   28  83.333333
# NY           Alice,Frank   27  60.500000
# TX                  Dave   68  70.000000

print(df.groupby('state').agg(
    {'name': list,
     'age': 'mean',
     'point': 'mean'}))
#                         name  age      point
# state                                       
# CA     [Bob, Charlie, Ellen]   28  83.333333
# NY            [Alice, Frank]   27  60.500000
# TX                    [Dave]   68  70.000000

The string method join() is applied to concatenate strings in a lambda expression.