(데이터분석) 중복된 값 제거하기

250x250

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

영국 척척석사 유학생 일기장👩🏻‍🎓

(데이터분석) 중복된 값 제거하기 본문

코딩공부/Data analysis

(데이터분석) 중복된 값 제거하기

life-of-nomad 2024. 5. 17. 11:02

728x90

#import pandas and numpy
import pandas as pd
import numpy as np

#Load small test scores dataframe
test_scores = pd.read_csv('test_scores.csv')

#Make a copy of the dataframe
clean_scores = test_scores.copy()
clean_scores.head()

if_duplicated = clean_scores.duplicated(['Name', 'Age'])
if_duplicated

Get duplicated rows

#Access the duplicated rows for duplicates in the Name and Age column
duplicate_rows = clean_scores.loc[clean_scores.duplicated(['Name', 'Age'])
duplicate_rows

#all duplicated rows for Amy Linn
Amy = clean_scores.loc[clean_scores['Name'] == 'Amy Linn']
Amy

Gather information around duplicated rows

#Get the count of duplicated rows
clean_scores.duplicated(['Name', 'Age']).sum()

#Visually inspect the dataframe for any trends in the duplicates
#Are we seeing duplicate rows only for students who are 15 years-old?
clean_scores

Determine which duplicated row to remove

#Duplicated rows with some different values
Amy = clean_scores.loc[clean_scores['Name'] == 'Amy Linn']
Amy

#Load a dataframe where duplicate scores on Test A are wrong
#But all scores (including duplicate ones) on Test B are correct.
multi_test_scores = pd.read_csv('multiple_test_scores.csv')
mulit_test_scores

Steps to potentially remediate :

1) Check with data providers to confirm the data accuracy

2) Remove duplicate data if is incorrect or keep the duplicated data if it is correct.

#Access the duplicated rows for duplicates in the Name and Age column
multi_test_scores[multi_test_scores.duplicated(['Name', 'Age'])]

Steps to potentially remediate:

1) Check with data providers, see example respondse below :

Duplicated students'data in "Test A score" is incorrect and incorrect rows should be removed
Duplicated students' data in "Test B score" is correct and should be kept

2) Mark the incorrect duplicate values for "Test A score" as NaNs; Or simply data structure by creating a seperate table for the repeated values in Test B score.

Resolve the duplicated rows

#Remove the values where the duplicates are in the Name and Age columns
#By default, drop_duplicates() keep the first occurrence
remove_dup = clean_scores.drop_duplicates(subset=['Name', 'Age'])
remove_dup

#The following defines keep=last, keeping the last occurrence
clean_scores.drop_duplicates(subset=['Name', 'Age'], keep='last')

remove_dup.duplicated(['Name', 'Age']).sum()

How to drop rows that are neither the first or last occurrence

#Duplicated rows with some diffrent values
Amy = clean_scores.loc[clean_scores['Name'] == 'Amy Linn']
Amy

row_drop_example = Amy.drop([5])
row_drop_example

How to convert duplicate values to NaNs

#Access the index of the duplicated rows for duplicates
dupe_index = multi_test_scores[multi_test_scores.duplicated(['Name', 'Age'])].index
dupe_index

#Set duplicated values in Test A Score column to NaNs
mulit_test_scores.loc[dupe_index, 'Test A Score'] = np.nan

#Visually inspect to confirm the operation worked
multi_test_scores

728x90

저작자표시 (새창열림)

'코딩공부 > Data analysis' 카테고리의 다른 글

(데이터분석) 파이썬으로 데이터 정제하기 (0)	2024.05.19
(데이터분석) 결측 데이터 처리하기 (0)	2024.05.17
(데이터분석) 웹 페이지 스크래핑 연습문제 (0)	2024.05.13
(데이터분석) 파이썬 BeautifulSoup으로 웹 페이지 스크래핑하기 (0)	2024.05.13
(데이터분석) 파이썬의 glob 라이브러리 사용하여 텍스트파일 읽기 (0)	2024.05.10

'코딩공부/Data analysis' Related Articles

영국 척척석사 유학생 일기장👩🏻‍🎓

(데이터분석) 중복된 값 제거하기 본문

(데이터분석) 중복된 값 제거하기

Get duplicated rows

Gather information around duplicated rows

Determine which duplicated row to remove

Resolve the duplicated rows

How to drop rows that are neither the first or last occurrence

How to convert duplicate values to NaNs

'코딩공부 > Data analysis' 카테고리의 다른 글

티스토리툴바