Content

  1. Identifying data types
  2. Fixing the rows and columns
  3. Imputing/removing missing values
  4. Handling outliers
  5. Standardising the values
  6. Fixing invalid values
  7. Filtering the data

1. Identifying Data Types

  1. Find Categorical Data
list(df.columns[df.dtypes == 'object'])

But, Categorical data can exist in Numerical format. eg. , days of a month, months(1–12), waist-size (24–38).

2. Distinguish between Numerical and Categorical Data

df.nunique().sort_values()

Categorical — The count of unique values should be 30 or less.

Perform operations on numerical data

Correlations — should only be done on numeric variables.

uniqueCount = df.nunique()
numerical_columns = list (uniqueCount [ uniqueCount > 30 ].keys())
df[numerical_columns] .corr()

Perform Visualisations on numerical data

Scatter plot should always be feed with numerical data each side.

2. Fixing the rows and columns

2.1 Check Formatting

  1. Check…

Rohan Dua

A Backend Engineer working at SquarePanda.. Currently pursuing Data Science courses.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store