Socio-technologist

Join my reddit AMA about Wikipedia and ethical, transparent AI

2017-05-24T13:32:00.003-07:00

Hey folks, I'm doing an experimental Reddit AMA ("ask me anything") in r/IAmA on June 1st at 21:00 UTC. For those who don't know, I create artificial intelligences that support the volunteers who edit Wikipedia. I've been studying the ways that crowds of volunteers build massive, high quality information resources like Wikipedia for over ten years. This AMA will allow me to channel that for new audiences in a different (for us) way. I'll be talking about the work I'm doing with the ethics and transparency of the design of AI, how we think about artificial intelligence on Wikipedia, and ways we’re working to counteract vandalism. I'd love to have your feedback, comments, and questions—preferably when the AMA begins, but also through the ORES talkpage on MediaWiki.

If you'd like to know more about what I do, see my WMF staff user page, this Wired piece about my work or my paper, "The Rise and Decline of an Open Collaboration System: How Wikipedia’s reaction to popularity is causing its decline".

Best practices for AI in the social spaces: Integrated refutations

2016-10-20T11:23:00.004-07:00

So, I was listening to an NPR show titled "Digging into Facebook's File on You". At some point, there was some casual discussion of laws that some countries in the European Union have re. users' ability to review and correct mistakes in data that is stored about them. This made me realize that ORES needs a good mechanism for you to review how the system classifies you and your stuff.

As soon as I realized this, I wrote up a ticket describing my rough thoughts on what a "refutation" system would look like. From https://phabricator.wikimedia.org/T148700:

Build a user-friendly UI for reviewing how ORES has classified you or any query-able slice of activity.
Include UI & API to refute classifications endpoints (via OAuth).
Include UI to curate "refutations" (suppress & meta-review).
Notes:

Refutations need freeform text. Freeform text *must* have a suppression system.

We should include refutations in ORES query results.

We can borrow from MediaWiki's global user rights (ORES is global) for curation privs.

On one hand, I'm excited by this idea because I think it will be a very interesting exploration into what kinds of feedback people want to give to ORES' classifications. I think it will provide a good means for humans to disagree with the machine in ways that are effective. It's important that we don't hide false-positive/negative reports somewhere that no one will ever see. It would be better if such reports were available as part of a query result -- so that ORES decision can be flagged and directly challenged by a human.

On the other hand, I'm a little bit embarrassed that this wasn't part of the plan all along. I guess in a way, I expected the wiki to fill this infrastructure -- of critiquing ORES. But critiques on the wiki are hidden from the API that tools use to consume score and relegating them to a less effectual class of data. I feel a little bit stupid that it never occurred to me that humans should be able to affect ORES' output.

Now to find the resources to construct this Meta-ORES system...

The end of "lowering barriers" as a metaphor for transition difficulty

2016-06-29T08:12:00.002-07:00

I remember sitting on the carpet at CSCW 2014 talking to Gabriel Mugar about boundaries and how people becoming aware of them. I've been working on some thoughts re. "barriers" being the wrong word because one imagines a passive wall of some height/permeability that can be removed or opened -- hence "lowered".

In the case of active socialization as some newcomer is welcomed to some established community, the "wall" is more than "lowered"; an active aid of conveyance across some threshold is provided. Here, Latour's door opener doesn't just make it easy to enter a building but also communicates that you are welcome. If we were to apply the barrier metaphor of a "wall" that might be lowered, we might imagine a simple hole in the wall being the lowest barrier. But when thinking about active socialization, a hole in the wall with a "welcomer" is more open. Does the "welcomer" lower the "wall" further? No. Suddenly the idea of "lowering as barrier" gets in the way of thinking clearly about the difficulty people experience when transitioning from "outside" to "inside" of a community.

Brief discussion: The Effects and Antecedents of Conflict in FLOSS dev.

2016-03-13T11:30:00.000-07:00

Hey folks. For this weeks blogging, I'll be reading and thinking about:

A Filippova, H Cho, The Effects and Antecedents of Conflict in Free and Open Source Software Development. CSCW, 2016

This isn't a review or really a summary -- just some thoughts I had that I want to share.

Transformational leadership

See https://en.wikipedia.org/wiki/Transformational_leadership

Filippova et al. highlight a "transactional leadership" style as one of the factors mitigating the negative effects of high-conflict FOSS teams.

I don't really like the word "transformational" since, to me, it seems to fail to highlight the key meaning of the term -- collaboration. A transactional leader doesn't own the process of change, but rather works with others to put together a vision and enact a change with other committed individuals. It involves convincing other of a directional change or new effort rather than enforcing it through force or punishment/reward. I think the key insight that transactional leaders have is that they do not own the process of change. At most they can be a key contributor to it. There will need to be a conversation and some sort of consensus before real change can happen. In my practice as a leader of a small group of volunteers developing a FOSS project, I can't imagine operating any other way. I must lead by example -- by making a convincing case for what we should do and allowing myself to be convinced by other. The direction we set will be owned by all. I need my team to feel this ownership and shared identity. This is a trick when there are deadlines and a disagreement threatens delaying substantial work, but in my experience, deadlines aren't that dead and delays can be real opportunities to step back and ask why the disagreement exists in the first place.

In the end, if I can't convince you (gentle reader, my assumed collaborator) that my idea is good and worth pursuing then maybe it's not actually good or worth pursuing. In a way, I see transactional leadership style a lot like code review. If I want to reprogram our team structure/plans, then I should be able to get the changes reviewed and supported by others. In the process of review, we can increase our shared notion of norms/goals and make sure that the implemented changes are actually good!

Different types of conflict

So, I've been studying conflict patterns in Wikipedia for a while, but I've never really dug into the literature about different types of conflict. Of course, it's obvious that there *are* different types. I've written about this in the past around reverts in Wikipedia, but it's much more useful to apply past through on the subject than to inject my own naive point of view. Luckily, Filippova et al. provide a nice summary of Task conflict, Affective conflict, Process conflict, and Normative conflict.

Task conflict: Conflict about what needs to be done. E.g. do we engineer ORES to use celery workers or do we just plan to have a large pool of independent uwsgi threads? While this can be good in that it brings a diverse perspective of possible implementations, it also might turn into a religious battle over the Right Way To Do Things(TM).
Affective conflict: AKA drama. Conflict due to bad blood -- relationships between people might cause conflict regardless of any task disagreement. E.g. do we attribute the failure of a team member to complete a task to the complexity of the task or to their general incompetence?
Process conflict: Disagreements over how to do tasks. E.g. do we require code-review and non-self-merges for *everything* or are there some reasonable exceptions. On the Revision Scoring project, I generally make process declarations that go unchallenged and then we iterate whenever the process seems to be not working. So far, I wouldn't really say that this has escalated to "conflict" yet, but I could see how it might.
Normative conflict: Disagreements about "group function". E.g. do we generally pursue a caution-first strategy or an open-first strategy (everything is open until a problem arises vs. everything is closed until we know we can safely open it)? This is a discussion that I'd like to come back to in a future blog post as I'm very opinionated about how the norms of Wikipedia (bold inclusionism & openness) should be extended to the software development community around MediaWiki.

Revscoring 1.0 and some demos

2016-03-06T11:15:00.000-08:00

The "Revision Scoring" logo

Hey folks,

I've mostly been traveling recently, but I have got some hacking in. There are a few things I want to share.

I just released the revscoring 1.0 -- revscoring is the library that powers ORES. I developed it to make building and deploying machine learning models as services easier.
I wrote up an ipython notebook that demonstrates how to build a machine learning classifier to detect vandalism in Wikipedia.
I just finished writing up an ipython notebook that digs into how the feature extraction system in revscoring works.

Next week, I hope to be showing you some more about the disparate impact of damage detectors on anons. Stay tuned.

Notes on writing a Wikipedia Vandalism detection paper

2016-01-24T11:48:00.001-08:00

Hey folks. I've been reviewing Wikipedia vandalism detection papers -- which have been an active genre since ~2008. I'll be writing a more substantial summary of the field at some point, but for now, I just want to share some notes on what I (a Wikipedia vandalism detection practitioner) want to see in future work in this area.

Two thresholds: auto-revert & patrol

There are two things that I want to do with a vandalism prediction model. Auto-revert at an extremely high level of confidence (SMALL% false-positive -- e.g. 1% or 0.1%) and patrol everything that might be vandalism (LARGE% recall -- e.g. 95% or 97.5%). These two modes correspond to auto-revert bots (like ClueBot NG) and recent changes partolling performed by Wikipedia editors. These two thresholds represent basic values to optimize for that represent a real reduction in the amount of time and energy that Wikipedians need to spend patrolling for vandalism.

Truth space of a classifier model.
[[:commons:File:Precisionrecall.svg]]

Optimizing recall for anti-vandal bots

Anti-vandal bots are great in that they operate for free (short of development and maintenance), but they bust behave nicely around humans. A bot that reverts edits is potentially hazardous and so Wikipedians & ClueBot NG maintainers have settled on a 0.1% false-positive rate and claim that they are able to detect 40% of all vandalism. They also claim that, at an older false-positive rate threshold of 1%, the bot was able to catch 55% of all vandalism.

So, vandalism prediction model scholars. Please tell me what recall you get at 1% and 0.1% false-positive rates. As this proportion goes up, humans will need to spend less time and energy reverting vandalism.

Optimizing review-proportion for patrollers

We may never reach the day where anti-vandal bots are able to attain 100% recall. In the meantime, we need to use human judgement to catch everything else. But we can optimize how we make use of this resource (human time and attention) by minimizing how many edits humans will need to review in order to catch some large percentage of the vandalism -- e.g. 95% or 97.5%.

So, vandalism prediction model scholars. Please tell me what proportion of all edits your model must flag as vandalism in order to get 95 and 97.5% recall. As this proportion goes down, humans will need to spend less time and energy reviewing.

Realtime is the killer use-case

This is more of a rant that a request for measurements. A lot of papers explore how much more fitness that they can get using post-hoc measures of activity around an edit. It's no surprise that you can tell whether or not an edit was vandalism easier once you can include "was it reverted?" and "did the reverting editor call it vandalism?" in your model. There's lots of discussion around how these post-hoc models could be used to clean up a print version of Wikipedia, but I'm pretty sure we're never going to do that (at least not really). If we ever did try to reduce views of vandalized articles, we'd probably want do that in realtime. ;)

Disparate impact of damage-detection on anonymous Wikipedia editors

2015-12-06T10:28:00.003-08:00

Today, I'm writing briefly about a problem that I expect to be studying and trying to fix over the course of the next few weeks.

The problem: The damage detection models that ORES supports seems to be overly skeptical of edits by anonymous editors and newcomers.

I've been looking at this problem for a while, but I was recently inspired by by the framing of disparate impact. Thanks to Jacob Thebault-Spieker for suggesting I look at the problem this way.

In United States anti-discrimination law, the theory of disparate impact holds that practices in employment, housing, or other areas may be considered discriminatory and illegal if they have a disproportionate "adverse impact" on persons in a protected class. via Wikipedia's Disparate Impact (CC-BY-SA 4.0)

So, let's talk about some terms and how I'd like to apply them to Wikipedia.

Disproportionate adverse impact. The damage detection models that ORES supports are intended to focus attention on potentially damaging edits. Still human judgement is not perfect and there's lot of fun research that suggests that "recommendations" like this can affect people's judgement. So by encouraging Wikipedia's patrollers to look a particular edit, we are likely also making them more likely to find flaws in that edit than if it was not highlighted by ORES. Having an edit rejected can demotivate the editor, but it may be even more concerning that the rejection of content from certain types of editors may lead to coverage biases as the editors most likely to contribute to a particular topic may be discouraged or prevented from editing Wikipedia

Protected class. In US law, it seems that this term is generally reserved for race, gender, and ability. In the case of Wikipedia, we don't know these demographics. They could be involved and I think they likely are, but I think that anonymous editors and newcomers should also be considered a protected class in Wikipedia. Generally, anonymous editors and newcomers are excluded from discussions and therefor subject to the will of experienced editors. I think that this has been having a substantial, negative impact on the quality and coverage of Wikipedia. To state it simply, I think that there are a collection of systemic problems around anonymous editors and newcomers that prevent them from contributing to the dominant store of human knowledge.

So, I think I have a moral obligation to consider the effect that these algorithms have in contributing to these issues and rectifying them. The first and easiest thing I can do is remove the features user.age and user.is_anon from the prediction models. So I did some testing. Here's fitness measures (see AUC) all of the edit quality models with the current and without-user features included.

wiki	model	current AUC	no-user AUC	diff
dewiki	reverted	0.900	0.792	-0.108
enwiki	reverted	0.835	0.795	-0.040
enwiki	damaging	0.901	0.818	-0.083
enwiki	goodfaith	0.896	0.841	-0.055
eswiki	reverted	0.880	0.849	-0.031
fawiki	reverted	0.913	0.835	-0.078
fawiki	damaging	0.951	0.920	-0.031
fawiki	goodfaith	0.961	0.897	-0.064
frwiki	reverted	0.929	0.846	-0.083
hewiki	reverted	0.874	0.800	-0.074
idwiki	reverted	0.935	0.903	-0.032
itwiki	reverted	0.905	0.850	-0.055
nlwiki	reverted	0.933	0.831	-0.102
ptwiki	reverted	0.894	0.812	-0.082
ptwiki	damaging	0.913	0.848	-0.065
ptwiki	goodfaith	0.923	0.863	-0.060
trwiki	reverted	0.885	0.809	-0.076
trwiki	damaging	0.892	0.798	-0.094
trwiki	goodfaith	0.899	0.795	-0.104
viwiki	reverted	0.905	0.841	-0.064

So to summarize what this table tells us: We'll lose between 0.05 and 0.10 AUC per model which brings us from beating the state of the art to not. That makes the quantitative glands in my brain squirt some anti-dopamine out. It makes me want to run the other way. It's really cool to be able to say "we're beating the state of the art". But on the other hand, it's kind of lame to know "we're doing it at the expense of users who are most sensitive and necessary." So, I've convinced myself. We should deploy these models that look less fit by the numbers, but also reduce the disparate impact on anons and new editors. After all, the actual practical application of the model may very well actually be better despite what the numbers say.

But before I do anything, I need to convince my users. They should have a say in this. At the very least, they should know what is happening. So, next week, I'll start a conversation laying out this argument and advocating for the switch.

One final note. This problem may be a blessing in disguise. By reducing the fitness of our models, we have a new incentive to re-double our efforts toward finding alternative sources of signal to increase the fitness of our models.

Vision of ORES & ORES vision

2015-11-22T10:25:00.004-08:00

So I've been working on a blog post for blog.wikimedia.org about ORES. I talked about ORES a few weeks ago in ORES: Hacking social structures by building infrastructure, so check that out for reference. Because the WMF blog is relatively high profile, the Comms team at the WMF doesn't want to just lift my personal bloggings about it -- which makes sense. I usually spend 1-2 hours on this, so you get typos and unfinished thoughts.

In this post, I want to talk to you about something that I think is really important when communication about what ORES is to a lay audience.

Visualizing ORES

The WMF Comms team is pushing me to make the topic of machine triage much more approachable to a broad audience. So, I have been experimenting with visual metaphors that would make kinds of things that ORES enables easier to understand. I like to make simple diagrams like the one below for the presentations that I give.

The flow of edits from The Internet to Wikipedia are highlighted by ORES

quality prediction models as "good", "needs review" and "damaging".

ORES vision

But it occurs to me that a metaphor might be more appropriate. With the right metaphor, I can communicate a lot of important things through implications. With that in mind, I really like using Xray specs as a metaphor for what ORES does. It hits a lot of important points about what using ORES means -- both what makes it powerful and useful and also why we should be cautious when using it.

A clipping from an old magazine showing fancy sci-fi specs.

ORES shows you things that you couldn't see easily beforehand. Like a pair of Xray specs, ORES lets you peer into the firehose of edits coming into Wikipedia and see potentially damaging edits stand out in sharp contrast against the background of probably good edits. But just like a pair of sci-fi specs, ORES alters your perception. It implicitly makes subjective statements about what is important (separating the good from the bad) and it might bias you towards looking at the potentially bad with more scrutiny. While this may be the point, it can also be problematic. Profiling an editors work by a small set of statistics is inherently imperfect and the imperfections in the prediction can inevitably lead to biases. So I think it is important to realize that, when using ORES, you're perception is altered in ways that aren't simply more truthful.

So, I hope that the use of this metaphor will help educate ORES users in the level of caution they employ as this socio-technical conversation about how we should use subjective, profiling algorithms as part of the construction of Wikipedia.

Measuring value-adding in Wikipedia

2015-11-01T09:34:00.003-08:00

So I've been working on this project on an off. I've been trying to bring robust measures of edit quality/productivity to Wikipedians. In this blog post, I'm going to summarize where I am with the project.

First, the umbrella project: Measuring value-added

Basically, I see the value of Wikipedia as a simple combination of two hidden variables: quality and importance. If we focused on making our unimportant content really high quality, that wouldn't be very valuable. Conversely if we were to focus on increasing the quality of the most important content first, that would increase the value of Wikipedia most quickly.

Value = Quality × Importance

But I want to look at value-adding activities, so I need to measure progress towards quality. I think a nice term for that is productivity.

Value-added = Productivity × Importance

So in order to take measurements of value-adding activity in Wikipedia, I need to bring together good measure of productivity and importance.

Measuring importance

See https://meta.wikimedia.org/wiki/Research:Measuring_article_importance

Density of log(view rate) for articles assessed by Wikipedians

for importance.

I'm going to side-step a big debate purely because I don't feel like re-hashing it in text. It's not clear what importance is. But we have some good ways to measure it. The two dominant strategies for determining the importance of a Wikipedia article's topic are (1) view rate counts and (2) link structure.

With view rate counts, the assumption is made that the most important content in Wikipedia is viewed most often. This works pretty well as far as assumptions go, but it has some notable weaknesses. For example, the article on Breaking Bad (TV show) has about an order of magnitude more views than the article on Chemistry. For an encyclopedia of knowledge, it doesn't feel right that we'd consider a TV show to be more important than a core academic discipline.

Link structure provides another opportunity. Google's founders famously used the link structure of the internet to build a ranking strategy for the most important websites. See PageRank. This also seems to work pretty well, but it's less clear what the relationship is between the link graph properties and the nebulous notion of importance. At least with page view rates, you can plainly imagine the impact that a highly viewed article has.

Fun story though: Chemistry has 10 times as many incoming links as Breaking Bad. It could be that this measurement strategy can help us deal with the icky feeling us academics get when thinking that a TV show is more important than centuries of difficult work building knowledge.

Measuring productivity

See: https://meta.wikimedia.org/wiki/Research:Measuring_edit_productivity

Luckily, there is a vast literature for measuring the quality of contributions in Wikipedia. Many of which I have published! There are a lot of strategies, but the most robust (and difficult to compute) is tracking the persistence of content between revisions. The assumption goes: the more subsequent edits a contribution survives, the high quality it probably was. We can quite easily weight "words added" by "persistence quality" to get a nice productivity measure. It's not perfect, but it works. The trick is figuring out the right way to scale and weight the measures so that they are intuitively meaningful.

The real trick here was making the computation tractable. It turns out that tracking changes between revisions is extremely computationally intensive. It would take me 60 days or so to track content persistence across the entire ~600m revisions of Wikipedia on a single core of the fastest processor on the market. So the trick is to figure out how to distribute the processing across multiple processors. We've been using Hadoop streaming. See my past post about it: Fitting Hadoop streaming into my workflow It's been surprisingly difficult to work with memory issues in Hadoop streaming that don't happen when just using unix pipes on the command line. I might make a post about that later, but honestly, it just makes me feel tried to think about those types of problems.

Bringing it together

I'm almost there. I've still got to work out some threshholding bits for productivity measures, but I've already finished the hard computational work. My next update (or paper) will be about who, where and when of value-adding in Wikipedia. Until then, stay tuned.

My story of SQL as a data analysis equalizer

2015-10-26T07:01:00.004-07:00

So, the title of this post is a little bit extreme, but I chose it specifically because I think it will help you (gentle reader) in thinking in the direction that I'd like you to think.

Back when I was a young & bright eyed computer scientist who was slowly realizing that social dynamics were a fascinating and complex environment within which to practice technology development, I was invited to intern at the Wikimedia Foundation with a group of 7 other researchers. It turns out that we all had very different backgrounds going in. Fabian, Yusuke, and I had a background in computer science. Jonathan had expertise in technology design methods (but not really programming). Melanie's expertise was in rhetoric and language. Stuart was trained in sociology and philosophy of science (but he'd done a bit of casual programming to build some bots for Wikipedia). I think this diverse set of backgrounds enabled us to think very well about the problem that we had to face, but that's a subject for another blog entry. Today I want to talk to you about the technology that we ended up rallying around and taking massive advantage of: the Structured Query Language (SQL) and a Relational Database Management System (RDBMS).

Up until my time at the Wikimedia Foundation, I had to do my research of Wikipedia the hard way. 1st, I downloaded Wikipedia's 10 terrabyte XML dump (compressed to ~100 gigabytes). Then I write Python script that used a streaming p7zip decompressor and a unix pipe to read the XML with a streaming processor. This workflow was complex. It tapped many of the skills I had learned in my training as a computer scientist. And yet, it was still incredibly slow to perform basic analyses.

It didn't take me long to start using this XML processing strategy to produce intermediate datasets that could be loaded into a postgres RDBMS for high-speed querying. This was invaluable to making quick progress. Still I learned some lessons about including *all the metadata I reasonably could* in this dataset since going back to the XML was another headache and week-long processing job. As a side-note, I've since learned that many other computer scientists working with this dataset went through a similar process and have since polished and published their code that implements these workflows. TL;DR: This was a really difficult analysis workflow even for people with a solid background in technology. I don't think it's unreasonable to say that social scientist or rhetoric scholar would have found it intractable to do alone.

When I was helping organize the work we'd be doing at the Wikimedia Foundation, I'd heard that there was a proposal in the works to get us researchers a replica of Wikipedia's databases to query directly. Honestly, I was amazed that people weren't doing this already. I put my full support behind it and thanks to others who saw the value, it was made reality. Suddenly I didn't need to worry about processing new XML dumps to update my analyses, I could just run a query against a database that was already indexes and up to date at all times. This was a breakthrough for me and I found myself doing explorations of the dataset that I had never considered before because the speed of querying the relevancy of the data made them possible. Stuart and I had a great time writing SQL queries for both our own curiosity and to explore what we ought to be exploring.

For my coworkers who had no substantial background in programming, they saw yet another language that the techies were playing around with. So, they took advantage of us to help them answer their questions by asking us to produce excel-sized that they could explore. But as these things go when people gets busy, these requests would often remain unanswered for days at a time. I've got to hand it to Jonathan. Rather than twiddling his thumbs while he waited for a query request to be resolved, he decided to pick up SQL as a tool. I think he set an example. It's one thing to have a techy say to you, "Try SQL! It's easy and powerful." and a totally different thing for someone without such a background to agree. By the end of the summer internship, I don't think we had anyone (our managers included) who were not writing a bit of SQL here and there. All would agree that they were substantially empowered by it.

Since then, Jonathan has made it part of his agenda to bring SQL and basic data analysis techniques to larger audiences. He's been a primary advocate (and volunteer product manager) of Quarry our new open querying to for Wikimedia data. That service has taken off like wildfire -- threatening to take down our MariaDB servers. Check it out https://quarry.wmflabs.org/ Specifically, I'd like to point you to the list of recent queries: https://quarry.wmflabs.org/query/runs/all Here, you can learn SQL techniques by watching others use them!

ORES: Hacking social structures by building infrastructure

2015-10-04T14:46:00.002-07:00

So, I just crossed a major milestone on a system I'm building with my shoe-string team of mostly volunteers and I wanted to tell you about it. We call it ORES.

The ORES logo

The Objective Revision Evaluation Service is one part response to a feminist critique of power structures and one part really cool machine learning and distributed systems project. It's a machine learning service that is designed to take a very complex design space (advanced quality control tools for Wikipedia) and allow for a more diverse set of standpoints to be expressed. I hypothesize that systems like these will make Wikipedia more fair and welcoming while also making it more efficient and productive.

Wikipedia's power structures

So... I'm not going to be able to go into depth here but there's some bits I think I can say plainly. If you want a bit more, see my recent talk about it. TL;DR: The technological infrastructure of Wikipedia was build through the lens of a limited standpoint and it was not adapted to reflect a more complete account of the world once additional standpoints entered the popular discussion. Basically, Wikipedia's quality control tools were designed for what Wikipedia editors needed in 2007 and they haven't changed in a meaningful way since.

Hacking the tools

I had some ideas on what kind of changes to the available tools would be important. In 2013, I started work in earnest on Snuggle, a successor system. Snuggle implements a socialization support system that helps experienced Wikipedia editors find promising newcomers who need some mentorship. Regretfully, the project wasn't terribly successful. The system works great and I have a few users, but not as many the system would need to do its job at scale. In reflecting on this, I can see many reasons why, but I think the most critical one was that I couldn't sufficiently innovate a design that fit into the social dynamics of Wikipedia It was too big of a job. It requires the application of many different perspectives and a conversation of iterations. I was a PhD student -- one of the good ones because Snuggle gets regular maintenance -- but this work required a community.

When I was considering where I went wrong and what I should do next, I was inspired by was the sudden reach that Snuggle gained when the HostBot developer wanted to use my "promising newcomer" prediction model to invite fewer vandals to a new Q&A space. My system just went from 2-3 users interacting with ~10 newcomers per week to 1 bot interacting with ~2000 newcomers per week. Maybe I got the infrastructure bit right. Wikipedia editors do need the means to find promising newcomers to support after all!

Hacking the infrastructure

So, lately I've been thinking about infrastructure rather than direct applications of experimental technology. Snuggle and HostBot helped to know to ask the question, "What would happen if Wikipedia editors could find good new editors that needed help?" without imagining any one application. The question requires a much more system-theoretic way of reasoning about Wikipedia, technology and social structures. Snuggle seemed to be interesting as an infrastructural support for Wikipedia. What other infrastructural support would be important and what changes might that enable across the system itself?

OK. Back to quality control tools -- the ones that haven't changed in the past 7 years despite the well known problems. Why didn't they change? Wikipedia's always had a large crowd of volunteer tool developers who are looking for ways to make Wikipedia work better. I haven't measured it directly, but I'd expect that this tech community is as big and functional as it ever was. There were loads of non-technological responses to the harsh environment for newcomers (including the Teahouse and various WMF initiatives). AFAICT, the tool I built in 2013 was the *only* substantial technological response.

Why is there not a conversation of innovation happening around quality control tools? If you want to build a quality control tool for Wikipedia that works efficiently, you need a machine learning model that calls your attention to edits that are likely to be vandalism. Such an algorithm can reduce the workload of reviewing new edits in Wikipedia by 93%, but standing one up is excessively difficult. To do it well, you'll need an advanced understand of computer science and some substantial engineering experience in order to get the thing to work in real time.

The "activation energy" threshold to building a new quality
control tool is primarily due to the difficulty of building a
machine learning model.

So, What would happen if Wikipedia editors could quickly find the good, the bad, and the newcomers in need of support. I'm a computer scientist. I can build up infrastructure for that and cut the peak off of that mountain -- or maybe cut it down entirely. That's what ORES is.

What ORES is

ORES is a web service that's provides access to a scale-able computing cluster full of state-of-the-art machine learning algorithms for detecting damage, differentiating good-faith edits from bad and measuring article quality. All that is necessary to use this service is to request a url containing the revision you want scored and the models you would like to apply to it. For example, if you wanted to know if my first registered edit on Wikipedia was damaging, you could request the following URL.

https://ores.wmflabs.org/scores/enwiki/damaging/190057686/

Luckily, ORES does not think this is damaging in the slightest.

{
"190057686": {
"prediction": false,
"probability": {
"false": 0.9999998999999902,
"true": 1.0000000994736041e-07
}
}
}

We use a distributed architecture to make scaling up the system to meet demand easy. The system is built in python. It uses celery to distribute processing load and redis for a cache. It is based on revscoring library I wrote to generalize machine learning models of edits in Wikipedia. This same library will allow you to download one of our model files and use it on your own machine -- or just use our API.

Our latest models are substantially more fit than the state-of-the-art (.84 AUC to our .90-95 AUC) our system has been substantially battle tested. Last month, Huggle, one of the dominant yet unchanged quality control tools started making use of our service. We've seen tool devs and other projects leap at the ability to use our service.

Ra•un (for projects that support ORES)
Real-Time Recent Changes
en:Wikipedia:WikiProject X via en:User:Reports bot
en:Wiki Education Foundation – Student quality/productivity measurements
en:User:SuggestBot
crosswatch – cross-wiki watchlist
(in development) ORES extension to MediaWiki
WikiEd recent student activity dashboard
WikiEd article finder

MediaWiki Utilities -- Unix style

2015-09-13T11:12:00.000-07:00

Today, I want to talk about a specific type of research output that, I feel, adds a substantial amount of value beyond a single research project. I'm talking about open sourced research code, but not of the type that you normally see -- the type that's actually intended to be used by other people to serve new purposes.

Before I dig into software libraries I want to talk to you about, I must make a distinction:

CRAPL quality code -- This is the code that a researcher builds ad-hoc in order to get something done. There's little thought spent on generalizability or portability. With this code, it's usually better and faster to fix a problem by adding a step in the workflow rather than fixing the original problem. So, you end up with quite a mess usually. While good for documenting the process of data science work, this is not useful to others.
Library quality code -- This code has been designed to generalize to new problems. It's intended to be used as a utility by others who are doing similar work -- but not the exact same work. It's usually well documented and well tested. With this code, it is sacreligious to add more code on top of broken code to fix a problem. This is just one of the many disciplines that must be applied to the practice of writing software to have good, library quality code.

I've been analyzing Wikipedia data for nearly a decade (!!!) -- and I can tell you that it was never easy. The English Wikipedia XML dumps that I have done most of my work with are on the order of 10 terrabytes uncompressed. The database, web API and XML dumps all use different field names to refer to the same thing. In each one, the absence of a field -- or NULLing of the field can mean different things. Worse, the MediaWiki software has been changing over time, so in order to do historical analyses, you need to take that into account. In the process of working out these details and getting my work done, I've produced reams of CRAPL quality code. See https://github.com/halfak/Activity-sessions-research for an example. In this case, I have a Makefile that, if executed, would replicate my research project. But if you look inside that Makefile, you'll see things like this:

# datasets/originals/enwiki_edit_action.tsv: sql/edit_action.sql
# cat sql/edit_action.sql | \
# mysql $(dbstore) enwiki > \
# datasets/originals/enwiki_edit_action.tsv

That's a commented out Makefile rule that calls my local database with my local configuration hardcoded and runs some SQL against it. This is great if you want to know what SQL produced which datafile, but not very useful if you want to replicate the work. And why is it commented out!? Well, the database query takes a long time to run and I didn't want to accidentally overwrite the data file as I was finishing off the research paper. Gross, right? This isn't all that useful if you wanted to perform a similar analysis.

But in producing this CRAPL code, there are some nice, generalizable parts that occur to me so I write them up for others' benefit. I've gone through a few iterations of this and learned from my mistakes.

Back in 2011, I released the first version of wikimedia-utilities, a set of utilities that made the work I was doing at the Wikimedia Foundation easier. The killer feature of this library was the XML processing strategy. It changed the work of processing Wikipedia's terrabyte scale XML dumps from a ~2000 line script to a ~100 line script. But the code wasn't very pythonic, it lacked proper tests and did not integrate well into the python packaging environment.

In 2013, I decided to make a clean break and start working on mediawiki-utilities, a super-set of utilities from wikimedia-utilities that were intentionally generalized to run on any MediaWiki instance. I had learned some lessons about being pythonic, implementing proper tests and integrating with python's packaging environment.

But as I had been working on new projects and realizing how they could generalize, I ended up expanding mediawiki-utilities to a monolith of loosely related parts. And it gets worse. Since I focused on those parts as I needed them, there were certain modules that were ignored. Since I did most of my work with the databases directly, it was rare that I spent time on the 'database' module of mediawiki-utilities. I ended up with a monolith that was inconsistently developed!

So, in thinking about monoliths and how to solve problems that they impose, I was inspired by the Unix philosophy of combining "small, sharp tools" to solve larger problems. I realized that the primary modules of mediawiki-utilities could be split off into their own projects and combined in interesting ways -- and that this would enable a more distributed strategy to management. So I've been hard at work to bring this vision into the light.

First, the core utilities:

pip install mwxml • docs • source

This library contains a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: complexity and performance of streaming XML parsing.

pip install mwapi • docs • source

This library provides a set of basic utilities for interacting with MediaWiki’s “action” API – usually available at /w/api.php. The most salient feature of this library is the mwapi.Session class that provides a connection session that sustains a logged-in user status and provides convenience functions for calling the MediaWiki API. See get() and post().

pip install mwdb • source

This library provides a set of utilities for connecting to and querying a MediaWiki database.

pip install mwparserfromhell • docs • source

This library provides an easy-to-use and outrageously powerful parser for MediaWiki wikicode.
Note that I am primarily a user of this library -- not a major contributor -- but it obviously belongs in this list.

Now the peripheral libraries that make use of these core utilities:

pip install mwoauth • docs • source

This library provide a simple means to performing an OAuth handshake with a MediaWiki installation with the OAuth Extension installed.

pip install mwreverts • docs • source

This library provides a set of utilities for detecting reverts (see mwreverts.Detector and mwreverts.detect()) and identifying the reverted status of edits to a MediaWiki wiki.

pip install mwsessions • docs • source

This library provides a set of utilities for group MediaWiki user actions into sessions. mwsessions.Sessionizer and mwsessions.sessionize() can be used by python scripts to group activities into sessions or the command line utilities can be used to operate directly on data files. Such methods have been used to measure editor labor hours.

pip install mwpersistence • source

This library provides a set of utilities for measuring content persistence and tracking authorship in MediaWiki revisions.

And I have a bunch more that are just on the horizon. They represent a sampling of my active research projects.

mwmetrics -- User behavioral statistics extraction for MediaWiki editors
mwrefs -- Extract citations, references and scholarly identifiers from MediaWiki
mwevents -- Generalized event extraction and processing framework for MediaWiki
mwtalkpage -- A talk page discussion parser for MediaWiki

It's my goal that researchers who haven't been working with wiki datasets will have a much easier time building off of my work to do their own. I think that a good set of libraries can make a huge difference in this regard. That's my goal.

I'll be making a more substantial announcement soon. In the meantime, I'm cleaning up and extending documentation and working together some examples that demonstrate how a researcher can compose these small, sharp libraries together to perform powerful analyses of Wikipedia and users in other MediaWiki wikis. Until then, please use these utilities, let me know about bugs and send my your pull requests!

Pre-blog: The life of a traveling technologist

2015-08-27T11:31:00.003-07:00

So, I didn't have much time this week and I'm doing this Iron Blogging thing. If you got here looking for a cool discussion of the life of a traveling technologist, I regret to inform you that this will only be a meta-discussion. Once I've completed the proper discussion, I'll put the link right below this paragraph.

What?

So my life is pretty weird in a lot of ways. I travel a lot. I generally don't see my teammates at the Wikimedia Foundation for months at a time. Worse, I have a group of friends who are extremely geographically distributed that my geographically local friends don't know about. We only see each other during conferences and other academic events.

Another sort of interesting aspect of my professional life is that I straddle the line between industry and academia. When it comes to the meat of knowledge & knowledge production, there's no conflict. But the timescales are amazingly different.

But through dealing with this, I've worked out some hacks. Some have to do with communication channels and making it feel like you are present even when you are not in the office. Others are my folding bike and the amazing experience I get visiting European cities.

So, I conclude with a promise of future bloggings with photos and insights. I just don't have the time right now!

Some of my geographically distributed social network who are

unknown to my local friends.

My cardboard cutout at the WMF office.

My folding bike -- waiting for my flight.

VisualEditor and barriers to entry in Wikipedia

2015-08-06T08:46:00.002-07:00

In this blog entry, I'm going to briefly cover a presentation that I gave recently as part of the Wikimedia Research Showcase.

The Newcomer Gauntlet. A theoretical diagram of barriers to entry in

Wikipedia are depicted with the hypothesized effect of VisualEditor's

reduction in technical literacy barrier highlighted in green.

TL;DR: We ran an experiment where we gave a WYSIWYG editor to newly registered editors in Wikipedia and monitored the effect it had on their productivity. We found that it didn't affect productivity either way. I think that this is because the barriers to entry in Wikipedia primarily consist of social/motivational issues, so reducing the technical literacy barriers that VE targets did not have a meaningful effect.

The talk is embedded below. My talk is first. There's a second talk by some students looking at building a knowledge graph with Wikipedia and some google tech too.

Fitting hadoop streaming into my python workflow

2014-11-30T12:27:00.001-08:00

I got something working in Hadoop (by reading the documentation[1]) that is going to change the way I work with Wikipedia text data. In order to give you a sense for how excited I am, I'm going to need to talk to you about the way that I approach my work with large datasets.

My python workflow

So, I try to do as much work as I can with input/output streams. I'm a bit of a unix geek, so that means I'm thinking about standard-in and standard-out and writing my analytics code as though I were supplementing UNIX core utils. This has nice computational practicalities and it allows you to stream operations together.

Computational practicalities

The worst bottleneck you're likely to encounter in the course of doing data analysis on large files is memory. Most statistical environments specifically support in-memory analysis of datasets to the exclusion of streaming. This means that, in order to get any work done, you need to first copy all of your data from disk to memory. If your dataset is larger than the available memory, you'll likely end up crashing the machine you are working with -- or at least render it unusable until your process finally crashes.

On the other hand, python, perl and unix utilities (like cut, shuf, sort, sed, etc.) afford some powerful capability for working with datasets as streams, and therefore, dramatically reducing the memory footprint of a dataset. As the diagram above makes apparent, if you can fit your data operation into a streaming job, then it doesn't matter how big your dataset is and you'll end up needing very little memory.

Streaming operations together

I like Doug McIlroy's summary of the unix philosophy:

Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.

By following these principles, you can build "operators" rather than stand-alone programs. This allows you to write less code (which is a Good Thing™) and get more done. Recently, I wrote a paper that required me to wrangle 12 different datasets from 7 different websites -- and being able to integrate my code with unix core utilities was a godsend. Here's a line from my projects Makefile that I think captures the amazingness.

(echo -e "user_id\tintertime\ttype";
bzcat datasets/originals/aol_search.tsv.bz2 |
tail -n+2 |

sort -k1,2 |
./intertimes |

sed -r "s/(.*)/\1\tapp view/" | things
shuf -n 100000) >
datasets/aol_search_intertime.sample.tsv

These lines achieve the following steps of a streaming job:

Print out new headers: user_id, intertime, type
Decompress the app_view.tsv.bz2 dataset
Trim off the original header line
Sort the dataset by the first two columns (user_id, timestamp)
Compute the time between events per user (this is my python)
Add a column of "app view" corresponding to the "type" header
Randomly sample 100000 observations
Write it all out to a new data file.

In order to perform this complex operation over 28 million rows, I wrote ~30 lines of python code and put it in the middle of a few unix utilities that do most of the heavy lifting for me. The processes took about 2 minutes and used a couple hundred MB of memory. The random sample that we finally arrive at contains only 100k rows and will happily load into memory in about a second.

Bringing it together with Hadoop streaming

If you've been living under a rock for the last 5 years or so, Hadoop is a framework for performing map/reduce operations that has become the industry standard. Through Hadoop's streaming interface, I can make use of UNIXy streaming utilities to get my python work done faster. Since I'm already thinking about my work as processing data one-row-at-a-time, map/reduce is not that much of a leap. '

However, there's one part of my streaming scripts that is very important to me, but I hadn't told you about it yet. When I do behavioral analysis in Wikipedia, I often need to choose a dimension and process user actions in order over time. A good example of actions in Wikipedia is an "edit". There are two really interesting dimensions to process edits: page and user.

The diagram above visualizes these two dimensions. If we look at my edits over time, we are looking at the "user" dimension. If we look at the article Anarchism's edits of time, we are looking at the "page" dimension. In order to processed data based on these dimensions, I'll usually sort and partition the data. Notabily, the designers of Wikipedia's XML dumps had this in mind when they put together their format. They partition based on page and sort based on timestamp. Yet, hadoop's native map/reduce strategy doesn't afford the ability to insist that I'll see a page's revisions or a user's edits together and in order.

Enter secondary sort. Using org.apache.hadoop.mapred.lib.KeyFieldBasedComparator and org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner, I can tell hadoop to sort and partition data before the "reduce" phase of the process. The best part is, these guys are parameterized in the exact same way as UNIX sort!

So, let's say you want to process a users activities over time. Set the partitioner divide the data based on the user_id column and tell hadoop to sort based on the revision timestamp. Here's what the call looks like for a dataset I'm working with right now. (first column == user_id, second column == timestamp)

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapreduce.output.fileoutputformat.compress=false \
-D mapreduce.input.fileinputformat.split.minsize=300000000 \
-D mapreduce.task.timeout=6000000 \
-D stream.num.map.output.key.fields=2 \
-D mapreduce.partition.keypartitioner.options='-k1,1n' \
-D mapred.output.key.comparator.class="org.apache.hadoop.mapred.lib.KeyFieldBasedComparator" \
-D mapreduce.partition.keycomparator.options='-k1,1n -k2,2n' \
-files intertime.py \
-archives 'hdfs:///user/halfak/hadoopin/virtualenv.zip#virtualenv' \
-input /user/halfak/enwiki_20141025.*.tsv \
-output "/user/halfak/hadoopin/enwiki_intertime/$(date +%Y%m%d%H%M%S)" \
-mapper cat \
-reducer intertime.py \

-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

So, I'm stoked. This will change the way I do my work.

1. Which leaves a bit to be desired[2]

2. Finding which error code to google is worse

Wikipedia's socio-technology as compared to biology

2014-10-17T07:26:00.001-07:00

I recently gave a presentation at the Wikimedia Foundation's Research Showcase where I compared Wikipedia to a paramecium. I'll be turning that presentation into a blog post later, but in the meantime, I wanted to share a slide from that presentation that I thought was fun.

From left to right: a small village with about 150 people, the MediaWiki software and extensions, the crowd from Wikimania'14 and Wikipedia.

Here, I equate organelles to specialized sub-systems of the greater paramecium. In the case of Wikipedia, specialized sub-systems tend to take the form of software.

See also:

Treading water & touching the lake bottom -- Perspective for young researchers

2014-09-28T09:06:00.001-07:00

Last week, I went to the University of Michigan to visit Cliff Lampe and to speak at the ICOS lecture series. In order to welcome me to Ann Arbor, Cliff set up a dinner with some of his bright-eyed young grad students. At one point in the conversation, I off-handedly gave some advice on making it through grad school with one’s psyche intact. It got me thinking about the sense of fear I experience when pursuing unknowns.

Getting started on a new project is a disorienting experience. This disorientation can be terrifying and disheartening. However, as a research project gets more mature, the notion of what questions are interesting, how they can be answered, and why those answers matter begin to form fixed points of reference. In my experience, these conceptual footholds bring an intense calming and sense of a productive purpose to a project. They can mean the difference between feeling like one is standing before a soul crushing abyss, and instead, feeling like one is standing on solid ground. I've experienced something like this before grad school and I think it serves as an appropriate metaphor.

New projects are like being thrown into the middle of a lake at night.

For those of you who did not grow up in Minnesota (a.k.a. “The Land of 10,000 Lakes”) swimming in a lake is an interesting experience about which I’d like to share some insights. I’ve sketched an attractive graphic.

So, when you are floating in a lake and you can’t reach the bottom with your feet, you don’t have a good sense for how deep the water is. In the dark water, your imagination of the distance between you and the lake's bottom beneath you can be your enemy. Even if the bottom of the lake is just inches away, it could be a hundred feet away -- and filled with eels, sharks and monsters. This could be seen as a childish way of thinking, but even today I find myself slipping into the spiral of imagined terrors when I swim in large, deep bodies of water.

As you approach the shore, the fear doesn’t progressively subside. Instead, the act of swimming feels a bit like fleeing and that enhances the thoughts of imagined terrors just inches from your feet. Faster swimming leads to more terror enhancement in sneaky spiral that makes you want to swim for your life.

However, once your feet come into contact with the sandy bottom, the relief is immediate. This knowledge of where the bottom is gives you perspective. You can now measure progress toward the shore. You can stand and stop treading water. There's no room for monsters beneath you anymore.

OK, back to research projects. I see a lot of young grad students leading a project for the first time struggling in their deep water stage(which is totally normal) and assuming that it’s because they are somehow less competent than others(they aren't) or that their work won't lead to anything useful. This happens all too often when students are making fine progress. They may be getting close to the shore of their research project, but they don't have a sense of any reference points yet. The unknowns still feel infinite and intractable. The direction they're going could be leading them to deeper water.

So far as I can tell, everyone experiences the deep water stage of a research project, no matter how senior. Seasoned researchers have made peace with, or even celebrate, this disorienting part of research in the same way that veterans of the lake have made peace with their imagined deep-water monsters. The trick is to staying sane is to recognize that sneaky spiral of fear when it happens.