Wikipedia as a Valuable Data Science Tool


A set of TDS articles about Wikipedia

Image by Author

Just a few days in the past, I got here throughout this article by Nicola Melluso within the enhancing queue and was instantly intrigued — Wikipedia is the most important public platform for data on the earth, but I not often encounter TDS articles that use this platform for varied analyses, tutorials, and so on. I sought to seek out and compile TDS articles and tutorials that broaden on Wikipedia as a worthwhile useful resource and gear for knowledge science tasks.

Melluso’s publish provides a wonderful overview of utilizing Wikipedia to enhance NLP duties like named-entity recognition and matter modeling.

“Wikipedia has been exploited as source of knowledge for more than a decade and has been used repeatedly in a variety of applications: text annotation, categorization, indexing, clustering, searching and automatic taxonomies generation. The structure of Wikipedia has in fact a number of useful features that make it a good candidate for these applications.”

He expands on Wikipedia’s advantage by way of his textual class undertaking accompanied by a number of visualizations of the outcomes, too.

Back in February 2020, Felipe Hoffa printed this piece that tracks the worldwide unfold and pattern of COVID-19 information inside Wikipedia. The publish is brief and to-the-point, and incorporates visuals that serve to enhance his evaluation. Felipe discovered that “Coronavirus” in Mandarin began trending 9 days sooner than every other language, Japanese and Korean have been the primary languages to catch-up to Mandarin, and Italian, Norwegian, and Persian had the strongest rebounds. Tracking information by way of Wikipedia is fulfilling by itself, however particularly fascinating when contemplating modifications to pages coping with a pandemic that took over the world in a span of days.

In 2018, Will Koehrsen wrote a extra common article about working with Wikipedia, as its huge and expansive nature might be intimidating for a lot of. This piece serves as a tutorial for these studying the best way to programmatically obtain all of English-language Wikipedia, parsing by way of the information in an environment friendly method, working operations in parallel to get probably the most from our {hardware}, and organising and working benchmarking exams to seek out environment friendly options. “Having a ton of data is not useful unless we can make sense of it, and so we developed a set of methods for efficiently processing all of the articles for the information we need for our projects,” writes Will.

Tanu N Prabhu authored this post in 2020 merely outlining the best way to use Wikipedia’s API together with a GitHub repository, too. This article serves as a fundamental and succinct tutorial for accessing and parsing the numerous items of data out there.


Please enter your comment!
Please enter your name here