Free Springer books, hurry up!

Springer allows us to download ~500 textbooks during the Covid-19 outbreak and here is the way I got them with coding. Read on...

Hai Le Gia
3 min readApr 26, 2020
Photo by Max Duzij on Unsplash

For those who haven’t read the news yet, here it is: https://www.springernature.com/gp/librarians/news-events/all-news-articles/industry-news-initiatives/free-access-to-textbooks-for-institutions-affected-by-coronaviru/17855960

I just knew about this through one of my friends’ stories on Facebook, and the guy is kind enough to also provide me the PDF file with all the links that I can use to download. You can also find it here: https://drive.google.com/file/d/1Of4RomN9qX_ufIFW86N5_LF8ivwfV2vC/view

There are 408 books on the list, that’s huge. And I don’t have both the time and patience to download them, one by one. Hence, it’s time to start my IDE and write some code to make it happen.

Setup the environment

NowI’m using Miniconda to manage my virtual environments, but you can use yours and create a new one, I call mine springer-download.

After activating the environment, I installed the following libraries:

  1. PyPDF2: library to parse the PDF file that I have from my friend.
  2. Selenium: library to start a headless chrome browser.
  3. requests: library to download a book given its URL.

Besides, I also need to download chromedriver from here and store it in my same project folder. This file is needed to use Selenium with my Chrome browser.

Break down the code

I’ll try to explain all the functions I have so that you can re-use them in any of your future projects. However, if you just want to check the final code, here is the Github repo.

Parse the PDF and extract the books’ URLs

PyPDF2 is a goto library if I need to extract text from a PDF file. However, if you need to work with scanned PDF files (where the content is actual images), you will need a different library. In my case, the PDF file I got from my friend is just a simple text PDF, so it’s straightforward:

I looped through all the pages, converted, and combined them into a largetext variable. The last step is to use the Python regular expression library to extract all the books’ webpages from it.

NowUtility function to download a book given its URL

requests is the best Python library for HTTP clients, and here I used the stream feature to download a file.

Utility function to select a Selenium element, given a timeout

Okay, so here’s a fact. When you have to work with Web crawler and Selenium, it’s ideal to manage the elements with timeouts. Sometimes, the web page is slow and if you try to access an element immediately, the code will fail. For example, this is not a good snippet:

That’s why I have this helper function:

Utility function to start Chrome browser

For this simple project, I used the minimum setup for my Chrome browser:

Function to download a book, given its Springer page

The PDF file I have stores links to the books’ webpages, not their direct download URLs. So we will write another function that does the following jobs:

  1. Extract the book title from the webpage, we’ll use this as our file name.
  2. Find the PDF link and use the _download_file function above to download it.
  3. Find the EPUB link (if available) and use the _download_file function above to download it.
https://gist.github.com/hailg/cc1dd4e7296f900a5405d28a783522a5

Glue all into the main function

Given all the functions above, it’s quite simple to put the logic together. We will need to:

  1. Extract all the books’ pages URLs from the PDF file
  2. Fire up a Chrome browser in headless mode.
  3. Loop through all the pages, and download the books.

Final thought

It took me about 20 minutes to write this code, and my laptop ran another 2 hours to download all the books. Presuming that I try downloading them manually, each book would take ~40 seconds and I would have wasted about 5.5 hours straight.

There are other ways to do this same task (e.g., use Springer CMS API or download the Excel file with similar information). However, in this article, I just want to show you a generic method that can be used for other websites as well. I hope you can grab the idea and apply it to your own projects. And just in case of missing it, here is the Github repo again.

Cheers, ❤️

--

--