Before I dig into software libraries I want to talk to you about, I must make a distinction:
- CRAPL quality code -- This is the code that a researcher builds ad-hoc in order to get something done. There's little thought spent on generalizability or portability. With this code, it's usually better and faster to fix a problem by adding a step in the workflow rather than fixing the original problem. So, you end up with quite a mess usually. While good for documenting the process of data science work, this is not useful to others.
- Library quality code -- This code has been designed to generalize to new problems. It's intended to be used as a utility by others who are doing similar work -- but not the exact same work. It's usually well documented and well tested. With this code, it is sacreligious to add more code on top of broken code to fix a problem. This is just one of the many disciplines that must be applied to the practice of writing software to have good, library quality code.
I've been analyzing Wikipedia data for nearly a decade (!!!) -- and I can tell you that it was never easy. The English Wikipedia XML dumps that I have done most of my work with are on the order of 10 terrabytes uncompressed. The database, web API and XML dumps all use different field names to refer to the same thing. In each one, the absence of a field -- or NULLing of the field can mean different things. Worse, the MediaWiki software has been changing over time, so in order to do historical analyses, you need to take that into account. In the process of working out these details and getting my work done, I've produced reams of CRAPL quality code. See https://github.com/halfak/Activity-sessions-research for an example. In this case, I have a Makefile that, if executed, would replicate my research project. But if you look inside that Makefile, you'll see things like this:
# datasets/originals/enwiki_edit_action.tsv: sql/edit_action.sql
# cat sql/edit_action.sql | \
# mysql $(dbstore) enwiki > \
# datasets/originals/enwiki_edit_action.tsv
That's a commented out Makefile rule that calls my local database with my local configuration hardcoded and runs some SQL against it. This is great if you want to know what SQL produced which datafile, but not very useful if you want to replicate the work. And why is it commented out!? Well, the database query takes a long time to run and I didn't want to accidentally overwrite the data file as I was finishing off the research paper. Gross, right? This isn't all that useful if you wanted to perform a similar analysis.
But in producing this CRAPL code, there are some nice, generalizable parts that occur to me so I write them up for others' benefit. I've gone through a few iterations of this and learned from my mistakes.
Back in 2011, I released the first version of wikimedia-utilities, a set of utilities that made the work I was doing at the Wikimedia Foundation easier. The killer feature of this library was the XML processing strategy. It changed the work of processing Wikipedia's terrabyte scale XML dumps from a ~2000 line script to a ~100 line script. But the code wasn't very pythonic, it lacked proper tests and did not integrate well into the python packaging environment.
In 2013, I decided to make a clean break and start working on mediawiki-utilities, a super-set of utilities from wikimedia-utilities that were intentionally generalized to run on any MediaWiki instance. I had learned some lessons about being pythonic, implementing proper tests and integrating with python's packaging environment.
But as I had been working on new projects and realizing how they could generalize, I ended up expanding mediawiki-utilities to a monolith of loosely related parts. And it gets worse. Since I focused on those parts as I needed them, there were certain modules that were ignored. Since I did most of my work with the databases directly, it was rare that I spent time on the 'database' module of mediawiki-utilities. I ended up with a monolith that was inconsistently developed!
So, in thinking about monoliths and how to solve problems that they impose, I was inspired by the Unix philosophy of combining "small, sharp tools" to solve larger problems. I realized that the primary modules of mediawiki-utilities could be split off into their own projects and combined in interesting ways -- and that this would enable a more distributed strategy to management. So I've been hard at work to bring this vision into the light.
First, the core utilities:
pip install mwxml
• docs • source- This library contains a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: complexity and performance of streaming XML parsing.
pip install mwapi
• docs • source- This library provides a set of basic utilities for interacting with MediaWiki’s “action” API – usually available at /w/api.php. The most salient feature of this library is the mwapi.Session class that provides a connection session that sustains a logged-in user status and provides convenience functions for calling the MediaWiki API. See get() and post().
pip install mwdb
• source- This library provides a set of utilities for connecting to and querying a MediaWiki database.
pip install mwparserfromhell
• docs • source- This library provides an easy-to-use and outrageously powerful parser for MediaWiki wikicode.
- Note that I am primarily a user of this library -- not a major contributor -- but it obviously belongs in this list.
Now the peripheral libraries that make use of these core utilities:
pip install mwoauth
• docs • source- This library provide a simple means to performing an OAuth handshake with a MediaWiki installation with the OAuth Extension installed.
pip install mwreverts
• docs • source- This library provides a set of utilities for detecting reverts (see mwreverts.Detector and mwreverts.detect()) and identifying the reverted status of edits to a MediaWiki wiki.
pip install mwsessions
• docs • source- This library provides a set of utilities for group MediaWiki user actions into sessions. mwsessions.Sessionizer and mwsessions.sessionize() can be used by python scripts to group activities into sessions or the command line utilities can be used to operate directly on data files. Such methods have been used to measure editor labor hours.
pip install mwpersistence
• source- This library provides a set of utilities for measuring content persistence and tracking authorship in MediaWiki revisions.
And I have a bunch more that are just on the horizon. They represent a sampling of my active research projects.
- mwmetrics -- User behavioral statistics extraction for MediaWiki editors
- mwrefs -- Extract citations, references and scholarly identifiers from MediaWiki
- mwevents -- Generalized event extraction and processing framework for MediaWiki
- mwtalkpage -- A talk page discussion parser for MediaWiki
It's my goal that researchers who haven't been working with wiki datasets will have a much easier time building off of my work to do their own. I think that a good set of libraries can make a huge difference in this regard. That's my goal.
I'll be making a more substantial announcement soon. In the meantime, I'm cleaning up and extending documentation and working together some examples that demonstrate how a researcher can compose these small, sharp libraries together to perform powerful analyses of Wikipedia and users in other MediaWiki wikis. Until then, please use these utilities, let me know about bugs and send my your pull requests!
I'll be making a more substantial announcement soon. In the meantime, I'm cleaning up and extending documentation and working together some examples that demonstrate how a researcher can compose these small, sharp libraries together to perform powerful analyses of Wikipedia and users in other MediaWiki wikis. Until then, please use these utilities, let me know about bugs and send my your pull requests!