The Requests library is vital to add to your data science toolkit. It’s a simple yet powerful HTTP library, which means you can use it to access web pages.
We call it The Farm because you’ll be using it to get the raw ingredients (i.e. raw HTML) for your dishes (i.e. usable data).
Its simplicity is definitely its greatest strength. It’s so easy use that you could jump right in without reading documentation.
After you have your ingredients, now what? Now you make them into a stew… a beautiful stew.
Beautiful Soup (BS4) is a parsing library that can use different parsers. A parser is simply a program that can extract data from HTML and XML documents.
Beautiful Soup’s default parser comes from Python’s standard library. It’s flexible and forgiving, but a little slow. The good news is that you can swap out its parser with a faster one if you need the speed.
One advantage of BS4 is its ability to automatically detect encodings. This allows it to gracefully handle HTML documents with special characters.
In addition, BS4 can help you navigate a parsed document and find what you need. This makes it quick and painless to build common applications. For example, if you wanted to find all the links in the web page we pulled down earlier, it’s only a few lines:
Lxml is a high-performance, production-quality HTML and XML parsing library. We call it The Salad because you can rely on it to be good for you, no matter which diet you’re following.
Among all the Python web scraping libraries, we’ve enjoyed using lxml the most. It’s straightforward, fast, and feature-rich.
Even so, it’s quite easy to pick up if you have experience with either XPaths or CSS. Its raw speed and power has also helped it become widely adopted in the industry.
Beautiful Soup vs lxml
Historically, the rule of thumb was:
If you need speed, go for lxml.
If you need to handle messy documents, choose Beautiful Soup.
Yet, this distinction no longer holds. Beautiful Soup now supports using the lxml parser,
Sometimes, you do need to go to a restaurant to eat certain dishes. The farm is great, but you can’t find everything there.
Likewise, sometimes the Requests library is not enough to scrape a website. Some sites out there use JavaScript to serve content. For example, they might wait until you scroll down on the page or click a button before loading certain content.
Other sites may require you to click through forms before seeing their content. Or select options from a dropdown. Or perform a tribal rain dance…
For these sites, you’ll need something more powerful. You’ll need Selenium (which can handle everything except tribal rain dancing).
Selenium is a tool that automates browsers, also known as a web-driver. With it, you can actually open a Google Chrome window, visit a site, and click on links. Pretty cool, right?
It also comes with Python bindings for controlling it right from your application. This makes it a breeze to integrate with your chosen parsing library.
https://scrapinghub.com/#products-services
https://scrapy.org/
Ok, we covered a lot just now. You’ve got Requests and Selenium for fetching HTML/XML from web pages. Then, you can use Beautiful Soup or lxml to parse it into useful data.
But what if you need more than that? What if you need a complete spider that can crawl through entire websites in a systematic way?
Introducing: Scrapy! Scrapy is technically not even a library… it’s a complete web scraping framework. That means you can use it to manage requests, preserve user sessions, follow redirects, and handle output pipelines.
It also means you can swap out individual modules with other Python web scraping libraries. For instance, if you need to insert Selenium for scraping dynamic web pages, you can do that
If you need to reuse your crawler, scale it, manage complex data pipelines, or cook up some other sophisticated spider, then Scrapy was made for you.
scrapy startproject <proj_name>