Free Springer books, hurry up!
Springer allows us to download ~500 textbooks during the Covid-19 outbreak and here is the way I got them with coding. Read on...
For those who haven’t read the news yet, here it is: https://www.springernature.com/gp/librarians/news-events/all-news-articles/industry-news-initiatives/free-access-to-textbooks-for-institutions-affected-by-coronaviru/17855960
I just knew about this through one of my friends’ stories on Facebook, and the guy is kind enough to also provide me the PDF file with all the links that I can use to download. You can also find it here: https://drive.google.com/file/d/1Of4RomN9qX_ufIFW86N5_LF8ivwfV2vC/view
There are 408 books on the list, that’s huge. And I don’t have both the time and patience to download them, one by one. Hence, it’s time to start my IDE and write some code to make it happen.
Setup the environment
NowI’m using Miniconda to manage my virtual environments, but you can use yours and create a new one, I call mine springer-download
.
After activating the environment, I installed the following libraries:
- PyPDF2: library to parse the PDF file that I have from my friend.
- Selenium: library to start a headless chrome browser.
- requests: library to download a book given its URL.
Besides, I also need to download chromedriver
from here and store it in my same project folder. This file is needed to use Selenium with my Chrome browser.
Break down the code
I’ll try to explain all the functions I have so that you can re-use them in any of your future projects. However, if you just want to check the final code, here is the Github repo.
Parse the PDF and extract the books’ URLs
PyPDF2 is a goto library if I need to extract text from a PDF file. However, if you need to work with scanned PDF files (where the content is actual images), you will need a different library. In my case, the PDF file I got from my friend is just a simple text PDF, so it’s straightforward:
I looped through all the pages, converted, and combined them into a largetext
variable. The last step is to use the Python regular expression library to extract all the books’ webpages from it.
NowUtility function to download a book given its URL
requests
is the best Python library for HTTP clients, and here I used the stream
feature to download a file.
Utility function to select a Selenium element, given a timeout
Okay, so here’s a fact. When you have to work with Web crawler and Selenium, it’s ideal to manage the elements with timeouts. Sometimes, the web page is slow and if you try to access an element immediately, the code will fail. For example, this is not a good snippet:
That’s why I have this helper function:
Utility function to start Chrome browser
For this simple project, I used the minimum setup for my Chrome browser:
Function to download a book, given its Springer page
The PDF file I have stores links to the books’ webpages, not their direct download URLs. So we will write another function that does the following jobs:
- Extract the book title from the webpage, we’ll use this as our file name.
- Find the PDF link and use the
_download_file
function above to download it. - Find the EPUB link (if available) and use the
_download_file
function above to download it.
Glue all into the main function
Given all the functions above, it’s quite simple to put the logic together. We will need to:
- Extract all the books’ pages URLs from the PDF file
- Fire up a Chrome browser in headless mode.
- Loop through all the pages, and download the books.
Final thought
It took me about 20 minutes to write this code, and my laptop ran another 2 hours to download all the books. Presuming that I try downloading them manually, each book would take ~40 seconds and I would have wasted about 5.5 hours straight.
There are other ways to do this same task (e.g., use Springer CMS API or download the Excel file with similar information). However, in this article, I just want to show you a generic method that can be used for other websites as well. I hope you can grab the idea and apply it to your own projects. And just in case of missing it, here is the Github repo again.
Cheers, ❤️