7 - Scraping the interwebs

Why?

Imagine you really want to extract some information from a website. Let's say you really need to know the weather information at lots of airports immediately. I'm thinking 'Airplane' set in 2018 with on-board Wi-fi: "We need to land this plane where there isn't a storm. And don't call me Shirley!"

The most immediate data comes from a national agency: National Weather Service. If Shirley wanted to check what the weather was like at my local airport (Midway in the South Side of Chicago), he would need to type its code in and click Get METAR data, as shown: Screenshot-2018-2-1 AWC - ADDS METARs But Leslie Nielsen ain’t got time to be clicking around on a website that looks like I built it for a class project! He needs info on ALL 19,299! NOW!

Solution: Web Scraping

Shirley can automate this process of clicking around and gathering the data so he can focus on important things like inflating the pilot.

URLs

Shirley could do a bit of sleuthing and realize that there is a pattern to the URLs of the pages he is taken to: http://www.aviationweather.gov/metar/data?ids=Kmdw&format=raw&date=0&hours=0 So he can now just enlist a 'headless browser' to go to the page for each airport code. Screen Shot 2018-02-06 at 8.12.19 PM

However, he is only interested in the weather data (which I have highlighted), not the hundred-odd links on this page.

HTML

HyperText Markup Language (HTML) is the standard language in which websites are written. I am actually writing this blog post in HTML to get accustomed to it! If you want to get a real scare, take a look at this very web page in HTML. In Chrome: View -> Developer -> View Source. It's like the end of 'Insidious'. How the hell do we make sense of this?

Beautiful Soup

If you are using Python you can use a package called Beautiful Soup. campbells Not quite as cool as Andy Warhol, but it allows you to navigate the HTML file and extract the data you need!

Requests

If you, like Shirley, have a task to do this week which involves clicking through a bunch of websites and extracting information: come and find me, I would be happy to do some web scraping to save you hours! Leave a comment on this post, or Contact me.
*****
Written on
Sub-Topics | Search | Subscribe


© 2023 L Warner
To email me humans:
take this site’s domain name
and replace the a. with @
A Moderner Website
in an IndieWebRing 🕸💍