Coding Diary.

(데이터분석) 중복된 값 제거하기 본문

Coding/Data analysis

(데이터분석) 중복된 값 제거하기

life-of-nomad 2024. 5. 17. 11:02
728x90
반응형
#import pandas and numpy
import pandas as pd
import numpy as np

#Load small test scores dataframe
test_scores = pd.read_csv('test_scores.csv')

#Make a copy of the dataframe
clean_scores = test_scores.copy()
clean_scores.head()

if_duplicated = clean_scores.duplicated(['Name', 'Age'])
if_duplicated

 

Get duplicated rows

#Access the duplicated rows for duplicates in the Name and Age column
duplicate_rows = clean_scores.loc[clean_scores.duplicated(['Name', 'Age'])
duplicate_rows

#all duplicated rows for Amy Linn
Amy = clean_scores.loc[clean_scores['Name'] == 'Amy Linn']
Amy

 

Gather information around duplicated rows

#Get the count of duplicated rows
clean_scores.duplicated(['Name', 'Age']).sum()

#Visually inspect the dataframe for any trends in the duplicates
#Are we seeing duplicate rows only for students who are 15 years-old?
clean_scores

 

Determine which duplicated row to remove

#Duplicated rows with some different values
Amy = clean_scores.loc[clean_scores['Name'] == 'Amy Linn']
Amy

#Load a dataframe where duplicate scores on Test A are wrong
#But all scores (including duplicate ones) on Test B are correct.
multi_test_scores = pd.read_csv('multiple_test_scores.csv')
mulit_test_scores

 

Steps to potentially remediate : 

1) Check with data providers to confirm the data accuracy

2) Remove duplicate data if is incorrect or keep the duplicated data if it is correct.

#Access the duplicated rows for duplicates in the Name and Age column
multi_test_scores[multi_test_scores.duplicated(['Name', 'Age'])]

Steps to potentially remediate:

1) Check with data providers, see example respondse below :

  • Duplicated students'data in "Test A score" is incorrect and incorrect rows should be removed
  • Duplicated students' data in "Test B score" is correct and should be kept

2) Mark the incorrect duplicate values for "Test A score" as NaNs; Or simply data structure by creating a seperate table for the repeated values in Test B score.

 

 

Resolve the duplicated rows

#Remove the values where the duplicates are in the Name and Age columns
#By default, drop_duplicates() keep the first occurrence
remove_dup = clean_scores.drop_duplicates(subset=['Name', 'Age'])
remove_dup

#The following defines keep=last, keeping the last occurrence
clean_scores.drop_duplicates(subset=['Name', 'Age'], keep='last')

remove_dup.duplicated(['Name', 'Age']).sum()

How to drop rows that are neither the first or last occurrence

#Duplicated rows with some diffrent values
Amy = clean_scores.loc[clean_scores['Name'] == 'Amy Linn']
Amy

row_drop_example = Amy.drop([5])
row_drop_example

 

 

How to convert duplicate values to NaNs

#Access the index of the duplicated rows for duplicates
dupe_index = multi_test_scores[multi_test_scores.duplicated(['Name', 'Age'])].index
dupe_index

#Set duplicated values in Test A Score column to NaNs
mulit_test_scores.loc[dupe_index, 'Test A Score'] = np.nan
#Visually inspect to confirm the operation worked
multi_test_scores

 

728x90
반응형