- Find duplicate rows:
duplicated()
- Determines which duplicates to mark:
keep
- Specify the column to find duplicate:
subset
- Count duplicate / non-duplicate rows
- Determines which duplicates to mark:
- Remove duplicate rows:
drop_duplicates()
keep
,subset
inplace
- Aggregate based on duplicate elements:
groupby()
The following data is used as an example. row #6
is a duplicate of row #3
.
import pandas as pd
df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
df = df.append({'name': 'Dave', 'age': 68, 'state': 'TX', 'point': 70}, ignore_index=True)
print(df)
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
# 6 Dave 68 TX 70
The sample csv file is linked below.
Find duplicate rows: duplicated()
duplicated()
method returns boolean pandas.Series
with duplicate rows as True
. By default, all columns are used to determine if a row is a duplicate or not.
print(df.duplicated())
# 0 False
# 1 False
# 2 False
# 3 False
# 4 False
# 5 False
# 6 True
# dtype: bool
You can use this pandas.Series
to extract duplicate rows from the original pandas.DataFrame
.
print(df[df.duplicated()])
# name age state point
# 6 Dave 68 TX 70
Determines which duplicates to mark: keep
The default value of the argument keep
is 'first'
, and the first duplicate row is determined to be False
like the example above.
If you set keep='last'
, the last duplicate row is determined to be False
.
print(df.duplicated(keep='last'))
# 0 False
# 1 False
# 2 False
# 3 True
# 4 False
# 5 False
# 6 False
# dtype: bool
With keep = False
, all duplicate rows are determined to be True
.
print(df.duplicated(keep=False))
# 0 False
# 1 False
# 2 False
# 3 True
# 4 False
# 5 False
# 6 True
# dtype: bool
Specify the column to find duplicate: subset
As mentioned above, by default, all columns are used to identify duplicates.
You can specify which column to use for identifying duplicates in the argument subset
.
print(df.duplicated(subset='state'))
# 0 False
# 1 False
# 2 True
# 3 False
# 4 True
# 5 True
# 6 True
# dtype: bool
It is also possible to specify multiple columns with a list.
print(df.duplicated(subset=['state', 'point']))
# 0 False
# 1 False
# 2 False
# 3 False
# 4 False
# 5 False
# 6 True
# dtype: bool
Count duplicate / non-duplicate rows
You can count the number of duplicate rows by counting True
in pandas.Series
obtained with duplicated()
. The number of True
can be counted with sum()
method.
print(df.duplicated().sum())
# 1
If you want to count the number of False
(= the number of non-duplicate rows), you can invert it with negation ~
and then count True
with sum()
.
print(~df.duplicated())
# 0 True
# 1 True
# 2 True
# 3 True
# 4 True
# 5 True
# 6 False
# dtype: bool
print((~df.duplicated()).sum())
# 6
You can also count True
and False
together with value_counts()
.
print(df.duplicated().value_counts())
# False 6
# True 1
# dtype: int64
Note that the result depends on the argument keep
. Use properly according to your purpose.
print(df.duplicated(keep=False).value_counts())
# False 5
# True 2
# dtype: int64
Remove duplicate rows: drop_duplicates()
You can use duplicated()
and the negation operator ~
to remove duplicate rows.
print(df[~df.duplicated()])
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
You can also remove duplicate rows with drop_duplicates()
.
print(df.drop_duplicates())
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
keep, subset
Arguments keep
andsubset
can be set for drop_duplicates()
as well as duplicated()
.
print(df.drop_duplicates(keep=False))
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
print(df.drop_duplicates(subset='state'))
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 3 Dave 68 TX 70
inplace
By default, new DataFrame
with duplicate rows removed is returned. With the argument inplace = True
, duplicate rows are removed from the original DataFrame
.
df.drop_duplicates(subset='state', keep='last', inplace=True)
print(df)
# name age state point
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
# 6 Dave 68 TX 70
Aggregate based on duplicate elements: groupby()
Use groupby()
to aggregate values based on duplicate elements.
In the following example, the average of the values of the numeric columns age
and point
is calculated for each duplicate elements in the state
column.
df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
print(df.groupby('state').mean())
# age point
# state
# CA 28.0 83.333333
# NY 27.0 60.500000
# TX 68.0 70.000000
It is also possible to concatenate strings and convert them to lists.
print(df.groupby('state').agg(
{'name': lambda x: ','.join(x),
'age': 'mean',
'point': 'mean'}))
# name age point
# state
# CA Bob,Charlie,Ellen 28 83.333333
# NY Alice,Frank 27 60.500000
# TX Dave 68 70.000000
print(df.groupby('state').agg(
{'name': list,
'age': 'mean',
'point': 'mean'}))
# name age point
# state
# CA [Bob, Charlie, Ellen] 28 83.333333
# NY [Alice, Frank] 27 60.500000
# TX [Dave] 68 70.000000
The string method join()
is applied to concatenate strings in a lambda expression.