Data Cleaning Techniques using Python

Content

1. Identifying Data Types

list(df.columns[df.dtypes == 'object'])
df.nunique().sort_values()

Perform operations on numerical data

uniqueCount = df.nunique()
numerical_columns = list (uniqueCount [ uniqueCount > 30 ].keys())
df[numerical_columns] .corr()

Perform Visualisations on numerical data

2. Fixing the rows and columns

2.1 Check Formatting

pd.read_csv() / pd.read_tsv()
pd.read_csv(sep = '|')

2.2 Fixing rows

df = df.loc['condition']

2.3 Fixing columns

df.columns.values[i] = ‘Column_name’
df.rename(columns = {'old_name' : 'new_name'} )
df.drop(columns = ['col1', 'col2'])##ordf.drop(['col1', 'col2'], axis = 1)
df[‘avg’] = (df[‘M1’] + df[‘M2’]) /2 ;df[‘fullname’] = df[‘firstname] + ‘ ‘ + df[lastname]

3. Imputing/removing missing values

3.1 Replace Empty values with Nulls

df.replace(r'^\s*$', np.NaN, regex=True)df.replace(r'NA', np.NaN, regex=True)df.replace(r'XX', np.NaN, regex=True)

3.2 Check Nulls

df.columns[df.isna().any()].tolist()
df.isnull().sum()/len(df)*100).sort_values(ascending = False)

3.3 Removing Rows/Columns

df = df[df['field1'] < Outliers]
df1 = df[~df['target-value'].isnull()].copy()
df.drop(columns = ['col1', 'col2']) / df.drop(['col1', 'col2'], axis = 1)df = df.loc[:, df.isnull().mean() < .95]

3.4.1 Missing Values Types

3.4.2 Missing Values Treatment

df.fillna(0)
df['field1'].mean()
df['field1'].median()
df['field1'].mode()[0]
df['field'] = df.apply(lambda x : transform(x), axis =1)

4. Handling outliers

Q3 = np.percentile(df['field1'], 75)Q1 = np.percentile(df['field1'], 25)IQR = Q3 — Q1Outliers = Q3 + 1.5 * IQRdf = df[df['field1'] < Outliers] df['field1'].plot(kind='box')

5. Standardising the values

df['field'] = df['field'] .apply(lambda x : round(x,2))
df['field'] = df['field'] .apply(lambda x : transform(x))
df['field1'] = pd.to_datetime(df['field1'],format ='%d%b%Y:%H:%M:%S. %f')
df['field1'] = df['field1'].str.title()
df['field1'] = arrests['field1'].str.replace('$', '' )df['field1'] = df['field1'].str.strip()

6. Fixing invalid values

df= pd.read_csv('file.csv',encoding ='cp1252')
df['field1'] = df['field1'].astype(int)df[toNumFieldsArray] = df[toNumFieldsArray].apply(pd.to_numeric, errors='coerce',axis=1)
df.['field1'].describe() -> min/max valuesdf.['field1'].plot(kind = 'box')
df['field'].value_counts()
df['len'] = df['filed1'].apply(lambda x : len(str(x)))
df[df['field1']>df['field2']]

7. Filtering the data

df.drop_duplicates(subset ='field1',keep = first/last, inplace = True)
df = df.loc['condition']
df_derived = df.filter(regex = '^COMMON_EXP', axis = 1)
df['field-group'] = pd.cut(df['field1'], 
bins = np.linspace (min,max,bin_count))

A Backend Engineer working at SquarePanda.. Currently pursuing Data Science courses.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store