The data cleaning process can be done using different techniques and software. Using spreadsheets such as Excel is one of them.
Spreadsheet tools are quite adequate for small to mid-level data chores. They offer several capabilities to easily sort data, calculate new columns, move and delete columns, and aggregate data.
They also include tasks such as getting rid of extra spaces, selecting and treating all blank cells (missing data), converting numbers stored as text into numbers, removing duplicates, highlighting and correcting errors, and changing text to lower/upper/proper case, deleting all formatting, etc. However, while using Spreadsheets, there are some important risks to be aware of.
Risks of cleaning data in Spreadsheets
Spreadsheets are desktop-class, file-oriented applications that store their entire data contents in volatile RAM while in use and on disk while not in use. It means that the data is stored in RAM between saves and can be lost.
Spreadsheet tools also lack auditing, change control, and meta-data features. Without these features, there is a higher chance of unintended user errors, and hence, caution must be exercised when using spreadsheets.
Unnoticed sorting and paste errors are another risk. Once the data save to disk, undoing the damage and reverting to an earlier version can be very hard, if not impossible. Repeatable processes and automation are also lacking in spreadsheets. Assume you spend hours cleaning a data file, only to repeat nearly all of those hours the next time a refreshed data file arrives.
On data files as small as 50-100MB, Excel on midrange hardware can begin to slow to the point of being unusable. You can be running at redline even if you have the patience to operate in this slow state because crashes and data loss are much more likely!
Furthermore, most people do not bother to check the performance limits of Spreadsheet tools before using them. If you start to notice sluggish performance, it’s good to check the limits to see where you’re at and make sure you’re not going too far.
A single worksheet in Excel can only have 1,048,576 rows and 16,384 columns. Excel, the workbook, and any add-ins that run in the same process share 2 gigabytes (GB) of virtual address space in a 32-bit Excel environment. These limitations do not apply to 64-bit Excel, which can use as much memory as you give it. The address space occupied by a data model can range from 500 to 700 megabytes (MB), but it could be less if other data models and add-ins are loaded.
Meanwhile, Google Spreadsheets have a maximum of 5,000,000 cells and 256 columns per sheet. (This means that if your file has a lot of columns, the rows limit can be as low as 19,231)
Therefore, always remember the following key principles while cleaning data in spreadsheets.
- Always create a backup of the original data in a separate workbook.
- Regularly backup the working file at successive points during collating, cleaning, and analyzing. Save documents with file names that combine date and time (yy-mm-dd-time prefixed allow for files to be sorted by order of creation).
- When integrating or merging different datasets, ensure the data is in a tabular format, with similar data in each column, all rows and columns visible, and no blank rows in the range. Check that there are no subtotals, totals, or other calculated records down the columns. Calculated variables can remain on the right of the data.
- Format the database for readability and easy navigation: Left align text, number right-aligned, the variable title is positioned horizontally, text variables fully visible, column separated by bold lines, header with background colors, numbers comma separated every three digits, etc.
- Start with tasks that do not require columns, such as spell-checking or the Find and Replace function.
- Then, work on tasks that require you to manipulate columns. Follow the general steps for manipulating a column: Add a new column (B) next to the one (A) that needs to be cleaned. Transform the data in column (B). Remove the first column (A), and change the column name from B to A.
- Keep the questionnaire close. As each check is done, a list of issues will be produced. The questionnaires should be consulted to double-check or identify the problems.
- When checking for one type of problem for one site or key informant, verify that the data for the other variables for that case have been entered correctly.
- Look at the values in all the variables and cases for that site, key informant, or enumerator.