7 - Scraping the interwebs
Imagine you really want to extract some information from a website. Let’s say you really need to know the weather information at lots of airports immediately. I’m thinking ‘Airplane’ set in 2018 with on-board Wi-fi: “We need to land this plane where there isn’t a storm. And don’t call me Shirley!”
The most immediate data comes from a national agency: National Weather Service.
If Shirley wanted to check what the weather was like at my local airport (Midway in the South Side of Chicago), he would need to type its code in and click Get METAR data, as shown:
But Leslie Nielsen ain’t got time to be clicking around on a website that looks like I built it for a class project! He needs info on ALL 19,299! NOW!
Solution: Web Scraping
Shirley can automate this process of clicking around and gathering the data so he can focus on important things like inflating the pilot.
Shirley could do a bit of sleuthing and realize that there is a pattern to the URLs of the pages he is taken to:
So he can now just enlist a ‘headless browser’ to go to the page for each airport code.
However, he is only interested in the weather data (which I have highlighted), not the hundred-odd links on this page.
HyperText Markup Language (HTML) is the standard language in which websites are written. I am actually writing this blog post in HTML to get accustomed to it! If you want to get a real scare, take a look at this very web page in HTML. In Chrome: View -> Developer -> View Source. It’s like the end of ‘Insidious’. How the hell do we make sense of this?
If you are using Python you can use a package called Beautiful Soup
Not quite as cool as Andy Warhol, but it allows you to navigate the HTML file and extract the data you need!
If you, like Shirley, have a task to do this week which involves clicking through a bunch of websites and extracting information: come and find me, I would be happy to do some web scraping to save you hours! Leave a comment on this post, or Contact me