Archive | Latent Dirichlet Allocation RSS for this section

Here’s what Wikipedia looks like as a map

wikipedia map

There is a lot of data housed in the servers that power — about 40GB worth of data, to be exact. With 1.5 million articles, it has reached a point where it’s almost impossible to grasp the entire scope of the site anymore. That is, unless you present all that data in a captivating visual representation. Such was the project of one Olivier Beauchesne.

Beauchesne decided to spend a few days of his life combing through the entire unfiltered repository of Wikipedia articles, and figured out how to present the information in a way that had never been done before — on a map. He took the roughly 400,000 articles that were geocoded and then broke them up into different categories to see exactly how different regions of the world were represented in the online encyclopedia. For example, if you look at the geocoded information on articles about football and basketball, they almost exclusively point to the United States. Topics that focus on things like Hindu or Islam, however, are more concentrated around India and the Middle East.

wikipedia spanish map

The “Spanish” map

Articles about airports, airplanes, and passengers, on the other hand, light up virtually ever corner of the planet, with North America receiving only a slightly, albeit expected, larger emphasis.

The various topics were not chosen at random, but rather by an algorithm called Latent Dirichlet Allocation (LDA), which organizes large amounts of conversational text into topics that can be organized. The geocoded information was then mapped using a variety of tools, mostly Unix-based.

When you look at each individual topic, it comes across as rather underwhelming and not all that surprising. When you step back and look at the bigger picture, though, it really illuminates the fact that Wikipedia is a truly global collaboration. It would be difficult to find any other online platform where the involvement and activity among users is so widespread and universal. A full slideshow of the different maps can be seen on Beauchesne’s blog.