Week 7-9: First Python-based Tool!

My experience writing a Python tool that scraps number of citations of papers.

I started to pick up the basics of Python in the past few weeks – thanks to a 7-day hotel quarantine and a misaligned jetlag. I have been following Al Sweigart’s free to read book (and ¬£13.99 course on Udemy) – Automate the Boring Stuff with Python. Last week, I’m proud to have written myself a little tool using Python!

Citation Scrapper

Have you ever had a list of papers titles and thought “Hmm.. Wouldn’t be nice if they are sorted by number of citations?” This little gadget is the tool for you! (Yeah I am selling it too much >v<!) “Number of Citation” information is not readily available on Databases (apart from Scopus Web of Science). Fortunately, this information, whilst less reliable, is available on Google Scholar. The tool doesn’t do anything ground-breaking – you feed the program a list of paper titles, it scraps and print the number of citations of those papers on your spreadsheet.

There are existing solutions on the market that achieves this already, such as the Publish or Perish citation tool. I just thought this could be an entry-level task to test myself. “Written” is truly an overstatement – it’s more like copying and adapting codes from GitHub and Stack Overflow. But the sense of accomplishment is real.

Sense of accomplishment is real!
Photo by Temo Berishvili on Pexels.com

One barrier I encountered was that, whilst the codes appear to work quite well independently when I was testing them, they do not seem to be performing consistently. One hour it worked, the next hour it stopped working. The codes were identical, I couldn’t understand how it wasn’t working. I was in hotel quarantine when this problem first appeared, and I was joking to my brother that I must have been blocked by Google – which I later realised was exactly the case!

Turns out, scrapping information from other people’s website may violate their terms and conditions – and could be borderline illegal. Sites like Amazon and Google (and many many others) set up timeouts that automatically blocks IP addresses when they detected a large number of requests (accesses/searches) within a short amount of time. I did not put in a time-out in my original codes, which sends in thousands of searches in minutes. No wonder I was blocked out!

Anyhow, this experience of testing and problem-solving has been fun! I began to understand more about the magic that fuels enthusiasm within the programming/software engineering community. I’m eager to be in a position to contribute to the conversation soon – one day I shall!

To Be Part of the Community!
Photo by Pixabay on Pexels.com