반응형
250x250
Notice
Recent Posts
Recent Comments
Link
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | |||||
3 | 4 | 5 | 6 | 7 | 8 | 9 |
10 | 11 | 12 | 13 | 14 | 15 | 16 |
17 | 18 | 19 | 20 | 21 | 22 | 23 |
24 | 25 | 26 | 27 | 28 | 29 | 30 |
Tags
- CSS
- Seaborn
- 결정트리
- 로지스틱회귀
- SQL
- 코딩
- 영국석사
- sql연습문제
- 유학생
- matplotlib
- HTML
- 코드잇
- 파이썬
- 판다스
- 코딩공부
- 머신러닝
- 코딩독학
- 행렬
- 코드잇 TIL
- 데이터분석
- 코드잇TIL
- 선형회귀
- 런던
- numpy
- 윈도우함수
- 경사하강법
- 나혼자코딩
- for반복문
- 다항회귀
- 오늘도코드잇
Archives
- Today
- Total
영국 척척석사 유학생 일기장👩🏻🎓
(데이터분석) 중복된 값 제거하기 본문
728x90
반응형
#import pandas and numpy
import pandas as pd
import numpy as np
#Load small test scores dataframe
test_scores = pd.read_csv('test_scores.csv')
#Make a copy of the dataframe
clean_scores = test_scores.copy()
clean_scores.head()
if_duplicated = clean_scores.duplicated(['Name', 'Age'])
if_duplicated
Get duplicated rows
#Access the duplicated rows for duplicates in the Name and Age column
duplicate_rows = clean_scores.loc[clean_scores.duplicated(['Name', 'Age'])
duplicate_rows
#all duplicated rows for Amy Linn
Amy = clean_scores.loc[clean_scores['Name'] == 'Amy Linn']
Amy
Gather information around duplicated rows
#Get the count of duplicated rows
clean_scores.duplicated(['Name', 'Age']).sum()
#Visually inspect the dataframe for any trends in the duplicates
#Are we seeing duplicate rows only for students who are 15 years-old?
clean_scores
Determine which duplicated row to remove
#Duplicated rows with some different values
Amy = clean_scores.loc[clean_scores['Name'] == 'Amy Linn']
Amy
#Load a dataframe where duplicate scores on Test A are wrong
#But all scores (including duplicate ones) on Test B are correct.
multi_test_scores = pd.read_csv('multiple_test_scores.csv')
mulit_test_scores
Steps to potentially remediate :
1) Check with data providers to confirm the data accuracy
2) Remove duplicate data if is incorrect or keep the duplicated data if it is correct.
#Access the duplicated rows for duplicates in the Name and Age column
multi_test_scores[multi_test_scores.duplicated(['Name', 'Age'])]
Steps to potentially remediate:
1) Check with data providers, see example respondse below :
- Duplicated students'data in "Test A score" is incorrect and incorrect rows should be removed
- Duplicated students' data in "Test B score" is correct and should be kept
2) Mark the incorrect duplicate values for "Test A score" as NaNs; Or simply data structure by creating a seperate table for the repeated values in Test B score.
Resolve the duplicated rows
#Remove the values where the duplicates are in the Name and Age columns
#By default, drop_duplicates() keep the first occurrence
remove_dup = clean_scores.drop_duplicates(subset=['Name', 'Age'])
remove_dup
#The following defines keep=last, keeping the last occurrence
clean_scores.drop_duplicates(subset=['Name', 'Age'], keep='last')
remove_dup.duplicated(['Name', 'Age']).sum()
How to drop rows that are neither the first or last occurrence
#Duplicated rows with some diffrent values
Amy = clean_scores.loc[clean_scores['Name'] == 'Amy Linn']
Amy
row_drop_example = Amy.drop([5])
row_drop_example
How to convert duplicate values to NaNs
#Access the index of the duplicated rows for duplicates
dupe_index = multi_test_scores[multi_test_scores.duplicated(['Name', 'Age'])].index
dupe_index
#Set duplicated values in Test A Score column to NaNs
mulit_test_scores.loc[dupe_index, 'Test A Score'] = np.nan
#Visually inspect to confirm the operation worked
multi_test_scores
728x90
반응형
'코딩공부 > Data analysis' 카테고리의 다른 글
(데이터분석) 파이썬으로 데이터 정제하기 (0) | 2024.05.19 |
---|---|
(데이터분석) 결측 데이터 처리하기 (0) | 2024.05.17 |
(데이터분석) 웹 페이지 스크래핑 연습문제 (0) | 2024.05.13 |
(데이터분석) 파이썬 BeautifulSoup으로 웹 페이지 스크래핑하기 (0) | 2024.05.13 |
(데이터분석) 파이썬의 glob 라이브러리 사용하여 텍스트파일 읽기 (0) | 2024.05.10 |