• Hervé Froc

Data Cleansing: The Unsung Hero of People Analytics.


Data cleansing is a key element in People Analytics and People Data Management. In this article, I’ll briefly explain what data cleansing is and give concrete real-life examples.



Table of Contents
Merge data sets
Clean missing data
Standardize
Normalize
De-duplicate
Enrich
Validate

What is Data Cleansing?


Data cleansing is the process of identifying incorrect, incomplete, inaccurate or missing parts of data and then modifying, replacing or deleting them. You can also visualize it as an iterative process with the following steps :


Merge Data sets


Data is most of the time stored in different systems or files and even stored in different formats. Before doing Analytics, you often need to merge several datasets together in order to have all the data points necessary for a given analysis. 


Below is an example with 2 tables, Employee and Department table, the employee table has the department code but it doesn’t have the details about the division and the team.


Clean Missing Data


There are usually 2 cases with missing data:

  • Missing data in mandatory fields: Mandatory fields like the employee ID is very important and a missing value would invalidate the integrity of the dataset. In this case we need to drop the record or correct it if possible.

  • Missing data in other fields: Missing data in these fields would affect analysis and potentially give wrong insights so we need to fill-in the value as much as possible.


Standardize Data


The purpose of standardization is to clean disorganized data to give a unified view of it. With data coming from multiple sources, business terms often have different meanings or different definitions which also lead to confusion between employees inside a company. 

Categorical variables like Departments or team names can have lots of unique values for the same meaning. This happens especially when data is entered manually and people enter team names with their own definition.


Normalize Data


In statistics, normalization usually refers to adjusting values measured on different scales to a common scale. Data normalization follows a similar concept, the aim is to modify and reorganize our data in order to have a unique structure optimized for analytics. (Note that we don’t talk about database normalization here- wait for the next article!)


What does it mean concretely? With client data, we usually build or clean hierarchies as well as re-organize their data to meet the design of our Data Warehouse.

The Panalyt Data Warehouse is designed specifically for HR analytics and stores data in a structure that optimizes the query time. When we integrate client data in our Data Warehouse, we often have to reorganize it. It happens that clients data have a structure that is convenient when they are doing a specific analysis but does not fit for analysis across all datasets.



This organization is ok if you want to know the score for a specific performance cycle by employee but it is not usable if you want to visualize all scores across time. In terms of data modeling, the score represents a measure while the date is a dimension, they shouldn’t be merged in a single field or we lose flexibility.


De-duplicate Data


Having duplicates when merging multiple datasets is a very common case especially when the relationship from dataset A and dataset B is many-to-many. Before loading data into our DWH we always check and remove duplicate records.




Enrich Data


Data enrichment is the process of appending information to an existing dataset. It’s something we do case by case when we feel it is necessary and it adds value to the customer.

Here’s an example that we did for a client. The client had the location where its employees are working but had no clear view of the distribution of its employees by regions or by cities. So we implemented an RPA service to fetch the addresses of all their stores and merge this information to their original data.



Validate Data


Data validation is an important part of the process to ensure that cleansed data is consistent and trustable. It can be implemented with a series of tests run on data against a set of validation rules or constraints.

Example rules : 

  • The age of an employee is at least equal to the legal working age.

  • Employee IDs must be unique.

Note : Data validation can also be part of a data quality process but the purpose would be different. In a data cleansing process, we want to ensure we don’t put garbage data in our DWH while a data quality process aims to track the quality of the data. A data quality process will help us identify anomalies to improve our data cleansing.



Last words


Data cleansing is critical to the business, we hear it often but we don’t realize how time-consuming it is. A survey from CrowdFlower has shown that data scientists spend 60% of their time cleaning and organizing data.


If you’d like to utilize Panalyt’s data warehouse to do this work for you instead of having to go through the cleansing process yourself, contact us at hello@panalyt.com.