코딩공부/Data analysis
(데이터분석) 결측 데이터 처리하기
life-of-nomad
2024. 5. 17. 11:30
728x90
반응형
import pandas as pd
import numpy as np
#read dataframe
df = pd.read_csv('assessment.csv')
#Drop a rows
df.head()
df.describe()
df.info()
df.sample(5, random_state = 70)
df.loc[df['assessment score 2'].isin(['#'])]
df['assessment score 2'] = df['assessment score 2'].replace({'#':np.nan})
df
df.loc[df['assessment score 2'].isin(['#'])]
df.isna().sum()
Option 1 : drop rows
cleaned_df = df.dropna()
cleaned_df.describe()
cleaned_df.isna().sum()
Option 2 : drop columns
problem_df = pd.read_csv("assessment_problem.csv")
problem_df.head()
problem_df.isna().sum()
problem_df_cleaned = problem_df.drop('assessment score 2', axis=1)
problem_df_cleaned.head()
problem_df_cleaned.isna().sum()
Option 3 : impute NaNs
df = pd.read_csv('assessment.csv')
#replace '#' to nan
df['assessment score 2'] = df['assessment score2'].replace({'#':np.nan})
#convert 'assessment score 2' data type from object to float
df['assessment score 2'] = df['assessment score 2'].astype(float)
df.isna().sum()
cleaned_df = df.fillna(df.mean())
cleaned_df.isna().sum()
t_df = df.copy()
t_df['assessment score 2'] = t_df['assessment score 2'].fillna(
t_df['assessment score 2'].mean())
t_df.isna().sum()
#A quick check on the states after imputing the data
cleand_df.describe()
df.describe()
Option 4 : create bins
df['assessment score 1'] = pd.cut(df['assessment score 1'], 4)
df['assessment score 2'] = pd.cut(df['assessment score 2'], 4)
df['assessment score 2'].value_counts()
df[df.isnull().any(axis=1)]
728x90
반응형