Monday, October 26, 2015

My story of SQL as a data analysis equalizer

So, the title of this post is a little bit extreme, but I chose it specifically because I think it will help you (gentle reader) in thinking in the direction that I'd like you to think.

Back when I was a young & bright eyed computer scientist who was slowly realizing that social dynamics were a fascinating and complex environment within which to practice technology development, I was invited to intern at the Wikimedia Foundation with a group of 7 other researchers.  It turns out that we all had very different backgrounds going in.  Fabian, Yusuke, and I had a background in computer science.  Jonathan had expertise in technology design methods (but not really programming).  Melanie's expertise was in rhetoric and language.  Stuart was trained in sociology and philosophy of science (but he'd done a bit of casual programming to build some bots for Wikipedia).  I think this diverse set of backgrounds enabled us to think very well about the problem that we had to face, but that's a subject for another blog entry.  Today I want to talk to you about the technology that we ended up rallying around and taking massive advantage of: the Structured Query Language (SQL) and a Relational Database Management System (RDBMS).

Up until my time at the Wikimedia Foundation, I had to do my research of Wikipedia the hard way.  1st, I downloaded Wikipedia's 10 terrabyte XML dump (compressed to ~100 gigabytes).  Then I write Python script that used a streaming p7zip decompressor and a unix pipe to read the XML with a streaming processor.  This workflow was complex.  It tapped many of the skills I had learned in my training as a computer scientist.  And yet, it was still incredibly slow to perform basic analyses.

It didn't take me long to start using this XML processing strategy to produce intermediate datasets that could be loaded into a postgres RDBMS for high-speed querying.  This was invaluable to making quick progress.  Still I learned some lessons about including *all the metadata I reasonably could* in this dataset since going back to the XML was another headache and week-long processing job.  As a side-note, I've since learned that many other computer scientists working with this dataset went through a similar process and have since polished and published their code that implements these workflows.  TL;DR: This was a really difficult analysis workflow even for people with a solid background in technology.  I don't think it's unreasonable to say that social scientist or rhetoric scholar would have found it intractable to do alone.

When I was helping organize the work we'd be doing at the Wikimedia Foundation, I'd heard that there was a proposal in the works to get us researchers a replica of Wikipedia's databases to query directly.  Honestly, I was amazed that people weren't doing this already.  I put my full support behind it and thanks to others who saw the value, it was made reality.  Suddenly I didn't need to worry about processing new XML dumps to update my analyses, I could just run a query against a database that was already indexes and up to date at all times.  This was a breakthrough for me and I found myself doing explorations of the dataset that I had never considered before because the speed of querying the relevancy of the data made them possible.  Stuart and I had a great time writing SQL queries for both our own curiosity and to explore what we ought to be exploring.

For my coworkers who had no substantial background in programming, they saw yet another language that the techies were playing around with.  So, they took advantage of us to help them answer their questions by asking us to produce excel-sized that they could explore.  But as these things go when people gets busy, these requests would often remain unanswered for days at a time.  I've got to hand it to Jonathan.  Rather than twiddling his thumbs while he waited for a query request to be resolved, he decided to pick up SQL as a tool.  I think he set an example.  It's one thing to have a techy say to you, "Try SQL!  It's easy and powerful." and a totally different thing for someone without such a background to agree.  By the end of the summer internship, I don't think we had anyone (our managers included) who were not writing a bit of SQL here and there.  All would agree that they were substantially empowered by it.

Since then, Jonathan has made it part of his agenda to bring SQL and basic data analysis techniques to larger audiences.  He's been a primary advocate (and volunteer product manager) of Quarry our new open querying to for Wikimedia data.  That service has taken off like wildfire -- threatening to take down our MariaDB servers.  Check it out  Specifically, I'd like to point you to the list of recent queries:  Here, you can learn SQL techniques by watching others use them!

Sunday, October 4, 2015

ORES: Hacking social structures by building infrastructure

So, I just crossed a major milestone on a system I'm building with my shoe-string team of mostly volunteers and I wanted to tell you about it.  We call it ORES.
The ORES logo

The Objective Revision Evaluation Service is one part response to a feminist critique of power structures and one part really cool machine learning and distributed systems project.   It's a machine learning service that is designed to take a very complex design space (advanced quality control tools for Wikipedia) and allow for a more diverse set of standpoints to be expressed.  I hypothesize that systems like these will make Wikipedia more fair and welcoming while also making it more efficient and productive.

Wikipedia's power structures

So...  I'm not going to be able to go into depth here but there's some bits I think I can say plainly.  If you want a bit more, see my recent talk about it.  TL;DR: The technological infrastructure of Wikipedia was build through the lens of a limited standpoint and it was not adapted to reflect a more complete account of the world once additional standpoints entered the popular discussion.  Basically, Wikipedia's quality control tools were designed for what Wikipedia editors needed in 2007 and they haven't changed in a meaningful way since.  

Hacking the tools

I had some ideas on what kind of changes to the available tools would be important.  In 2013, I started work in earnest on Snuggle, a successor system.  Snuggle implements a socialization support system that helps experienced Wikipedia editors find promising newcomers who need some mentorship.  Regretfully, the project wasn't terribly successful.  The system works great and I have a few users, but not as many the system would need to do its job at scale.  In reflecting on this, I can see many reasons why, but I think the most critical one was that I couldn't sufficiently innovate a design that fit into the social dynamics of Wikipedia  It was too big of a job.  It requires the application of many different perspectives and a conversation of iterations.  I was a PhD student -- one of the good ones because Snuggle gets regular maintenance -- but this work required a community.

When I was considering where I went wrong and what I should do next, I was inspired by was the sudden reach that Snuggle gained when the HostBot developer wanted to use my "promising newcomer" prediction model to invite fewer vandals to a new Q&A space.  My system just went from 2-3 users interacting with ~10 newcomers per week to 1 bot interacting with ~2000 newcomers per week.  Maybe I got the infrastructure bit right.  Wikipedia editors do need the means to find promising newcomers to support after all!

Hacking the infrastructure

So, lately I've been thinking about infrastructure rather than direct applications of experimental technology.  Snuggle and HostBot helped to know to ask the question, "What would happen if Wikipedia editors could find good new editors that needed help?" without imagining any one application.  The question requires a much more system-theoretic way of reasoning about Wikipedia, technology and social structures.  Snuggle seemed to be interesting as an infrastructural support for Wikipedia.  What other infrastructural support would be important and what changes might that enable across the system itself?

OK.  Back to quality control tools -- the ones that haven't changed in the past 7 years despite the well known problems.  Why didn't they change?  Wikipedia's always had a large crowd of volunteer tool developers who are looking for ways to make Wikipedia work better.  I haven't measured it directly, but I'd expect that this tech community is as big and functional as it ever was.  There were loads of non-technological responses to the harsh environment for newcomers (including the Teahouse and various WMF initiatives).  AFAICT, the tool I built in 2013 was the *only* substantial technological response.

Why is there not a conversation of innovation happening around quality control tools?  If you want to build a quality control tool for Wikipedia that works efficiently, you need a machine learning model that calls your attention to edits that are likely to be vandalism.  Such an algorithm can reduce the workload of reviewing new edits in Wikipedia by 93%, but standing one up is excessively difficult.  To do it well, you'll need an advanced understand of computer science and some substantial engineering experience in order to get the thing to work in real time.  
The "activation energy" threshold to building a new quality
control tool is primarily due to the difficulty of building a
machine learning model.

So, What would happen if Wikipedia editors could quickly find the good, the bad, and the newcomers in need of support.  I'm a computer scientist.  I can build up infrastructure for that and cut the peak off of that mountain -- or maybe cut it down entirely.  That's what ORES is. 

What ORES is

ORES is a web service that's provides access to a scale-able computing cluster full of state-of-the-art machine learning algorithms for detecting damage, differentiating good-faith edits from bad and measuring article quality.  All that is necessary to use this service is to request a url containing the revision you want scored and the models you would like to apply to it.  For example, if you wanted to know if my first registered edit on Wikipedia was damaging, you could request the following URL. 

Luckily, ORES does not think this is damaging in the slightest.

"190057686": {
"prediction": false,
"probability": {
"false": 0.9999998999999902,
"true": 1.0000000994736041e-07
We use a distributed architecture to make scaling up the system to meet demand easy.  The system is built in python.  It uses celery to distribute processing load and redis for a cache.  It is based on revscoring library I wrote to generalize machine learning models of edits in Wikipedia.   This same library will allow you to download one of our model files and use it on your own machine -- or just use our API.

Our latest models are substantially more fit than the state-of-the-art (.84 AUC to our .90-95 AUC) our system has been substantially battle tested.  Last month, Huggle, one of the dominant yet unchanged quality control tools started making use of our service.  We've seen tool devs and other projects leap at the ability to use our service.