Wikipedia Category Taxonomy

This project aims to construct the taxonomic tree created from the relationships between categories. There are two components to the output of this project:

How: To build the taxonomy, I parsed the Wikipedia en XML dump looking for articles whose title began with "Category:". I found that despite what is displayed on Wikipedia's website, only the parent categories are included in the xml dump. I used this relationship to build the taxonomic structure.

Stats:

Downloads: Currently you can download an SQL file containing the taxonomic structure of wikipedia and the Java program I wrote to collect this information. The program is more proof of concept than production grade, enhancements and improvements are welcomed.

Running Instructions Running instructions can be found here in the javadoc. You will need to supply an XML parser (I use Xerces) in the classpath to make the code run. Supply to these programs genuine Wikipedia XML wiki dumps.

News:

License: Remember my friends, the license on Wikipedia content is found here. It is the Creative Commons Attribution Share-a-like license. It may not jive with your purposes.