Cleaning Web Scraped Data — Aircraft Accidents

Anjani
3 min readJul 24, 2019

Last week, I tried my hand at scraping a Wikipedia table using the Beautiful Soup library in Python. You can read about it here and find the code here. Today, I’m attempting to clean this data using the Pandas library.

Prior to diving into Python, I had earlier seen snippets of code used for several data manipulation and analysis techniques and frankly, I was scared. But I slowly started my Python journey with the Kaggle micro-course and followed it up with the tedious aforementioned web scraping activity, learning everything on-the-project. Though it looks easy now, I definitely did struggle in the beginning. I, thankfully, chose to persevere and this eventually led me to the Pandas library that is oh so famous in the Data Science community for Data Cleaning. People could not praise it enough. And thus began the most fun week I’ve had in a long time!

Pandas

The Pandas library in Python offers data structures and tools for effective data manipulation and analysis. The biggest advantage of using Pandas is it’s ability to provide easy access to structured data. The primary instrument in the Pandas library is the two-dimensional table consisting of column and row labels called Data Frames. Data Frames are designed to provide easy indexing functionality in the structured data. You can read up more about Pandas here.

Pandas MOOC

For those starting out in Pandas, I highly recommend the Data Analysis with Python course in Coursera offered by IBM. If you are determined, you can complete the 6-week course in <2 days like I did. While it can be a bit much, it’s particularly beneficial if you cannot pay the course fee of $40/month. If you complete the course within the 7-day trail period, it’s free and you also get the course certificate and an IBM Digital Badge!

My course completion certificate!

I finished the course at about 2 AM on Monday and couldn’t wait to get started with implementing all those techniques on the data set I created. Amid weekly work commitments, I did manage to learn about more functions and methods in Pandas and attempted to clean the data set to the best of my current abilities.

The Code

My code, though definitely not optimal, does a good job of cleaning the data as I see fit. There definitely will be much better versions of optimal code that are less redundant. I did try to make it as concise as I could but if you do know of a better way to achieve the results, please do let me know. I appreciate all and any critique as I’m only to benefit from it.

The code for this is available on my GitHub.

My Notebook from GitHub

Since there are only 549 rows, analyzing data and writing cell-specific code for data cleaning was not too time-consuming. I intend to continue exploring Pandas and as I begin working on larger data sets, I believe my coding skills will improve.

All in all, my experience with cleaning data has been more fun than people say it is. And they call this the least fun part!

--

--

Anjani

Seeking catharsis - through reading your life stories and sharing mine.