First, the umbrella project: Measuring value-added
Basically, I see the value of Wikipedia as a simple combination of two hidden variables: quality and importance. If we focused on making our unimportant content really high quality, that wouldn't be very valuable. Conversely if we were to focus on increasing the quality of the most important content first, that would increase the value of Wikipedia most quickly.
Value = Quality × Importance
But I want to look at value-adding activities, so I need to measure progress towards quality. I think a nice term for that is productivity.
Value-added = Productivity × Importance
So in order to take measurements of value-adding activity in Wikipedia, I need to bring together good measure of productivity and importance.
Density of log(view rate) for articles assessed by Wikipedians
I'm going to side-step a big debate purely because I don't feel like re-hashing it in text. It's not clear what importance is. But we have some good ways to measure it. The two dominant strategies for determining the importance of a Wikipedia article's topic are (1) view rate counts and (2) link structure.
With view rate counts, the assumption is made that the most important content in Wikipedia is viewed most often. This works pretty well as far as assumptions go, but it has some notable weaknesses. For example, the article on Breaking Bad (TV show) has about an order of magnitude more views than the article on Chemistry. For an encyclopedia of knowledge, it doesn't feel right that we'd consider a TV show to be more important than a core academic discipline.
Link structure provides another opportunity. Google's founders famously used the link structure of the internet to build a ranking strategy for the most important websites. See PageRank. This also seems to work pretty well, but it's less clear what the relationship is between the link graph properties and the nebulous notion of importance. At least with page view rates, you can plainly imagine the impact that a highly viewed article has.
Fun story though: Chemistry has 10 times as many incoming links as Breaking Bad. It could be that this measurement strategy can help us deal with the icky feeling us academics get when thinking that a TV show is more important than centuries of difficult work building knowledge.
Luckily, there is a vast literature for measuring the quality of contributions in Wikipedia. Many of which I have published! There are a lot of strategies, but the most robust (and difficult to compute) is tracking the persistence of content between revisions. The assumption goes: the more subsequent edits a contribution survives, the high quality it probably was. We can quite easily weight "words added" by "persistence quality" to get a nice productivity measure. It's not perfect, but it works. The trick is figuring out the right way to scale and weight the measures so that they are intuitively meaningful.
The real trick here was making the computation tractable. It turns out that tracking changes between revisions is extremely computationally intensive. It would take me 60 days or so to track content persistence across the entire ~600m revisions of Wikipedia on a single core of the fastest processor on the market. So the trick is to figure out how to distribute the processing across multiple processors. We've been using Hadoop streaming. See my past post about it: Fitting Hadoop streaming into my workflow It's been surprisingly difficult to work with memory issues in Hadoop streaming that don't happen when just using unix pipes on the command line. I might make a post about that later, but honestly, it just makes me feel tried to think about those types of problems.
Bringing it together
I'm almost there. I've still got to work out some threshholding bits for productivity measures, but I've already finished the hard computational work. My next update (or paper) will be about who, where and when of value-adding in Wikipedia. Until then, stay tuned.