Web Scraping Wikipedia using Beautiful Soup — Aircraft Accidents

Anjani
4 min readJul 17, 2019

I’m an aspiring data analyst, trying my hands on an end-to-end Data Science project. As I enthusiastically began searching for a suitable data set that interested me, I was sadly met with data sets that were too large and ambitious for me, a beginner. During this time, I happened to stumble upon the concept of Web Scraping that enabled me to create my own data set.

Now, I’m a total data science newbie with a good grasp of basic programming and I decided to learn Python while working on the project. I read up on Python basics and syntax from the Python micro-course in Kaggle and was ready to get into Beautiful Soup for web scraping.

Beautiful Soup

Beautiful Soup is a content parser that aids in parsing HTML pages and gives you options to extract the required information from HTML and XML structures in a friendly manner.

You can also use Scrapy which is a more powerful website-scraping Python framework than Beautiful Soup (BS). I chose BS for this problem as I wanted to scrape a Wikipedia page and BS does the intended job effectively while being relatively less complex.

/robots.txt

Web scraping can be done on Wikipedia pages without any legal implications. But if you intend to scrape a website, you must first and foremost check the robots.txt file to avoid any legal issues. This can be done by adding ‘/robots.txt’ to the website’s domain URL and this contains the website owner’s instructions to all web crawlers and bots. You can read up more about understanding the robots.txt file here.

Inspecting the Wiki page

My Wikipedia page of choice for web scraping is the List of aircraft accidents and incidents resulting in at least 50 fatalities. This article lists 549 incidents in a table along with

1. Total deaths,

2. # Crew deaths

3. # Passengers deaths

4. # Ground staff deaths

5. Survivors indicator

6. Type of accident

7. Incident

8. Aircraft

9. Location

10. Phase

11. Airport

12. Distance

13. Date

Right-clicking on the table and selecting ‘Inspect’ opens up the page’s HTML from where we can observe that the table containing the required details has the class attribute ‘wikitable sortable’. It is interesting to note that the text in the first 5 columns is listed within the table headers, the <th> tag while the text from the rest of the columns is listed within the table data, <td> tag.

My approach is to select the required table using the class attribute, extract the data from the first 5 columns into a data frame, the rest of the data into another data frame and concatenate them both to form a single data frame that can be exported to an Excel file.

The Code

This code is written to the best of my knowledge in Python. There will surely be better versions that are simpler, shorter and more efficient. And as I continue learning, I intend to continue editing this code. If you know of a better way, please let me know below. All suggestions are most welcome.

I’ve included comments before each line to explain the desired action.

Importing all required libraries
Using the requests and BeautifulSoup libraries to parse content
Extract all tables and selecting the desired table
Extract data from first 5 columns into separate lists, zip them and convert to data frame
Extract data from remaining columns and follow the same procedure as above
Concatenate data frames along the columns using pandas
Export the data frame as an Excel document

You can find the entire code on my GitHub.

Also, extracting the data from the ‘Location’ column as displayed in the Wiki table was too complex for me. I’d appreciate any suggestions you have on it.

Thus ends my first web scraping project. I cannot wait to begin scraping a website next!

--

--

Anjani

Seeking catharsis - through reading your life stories and sharing mine.