How Long Before 1st Accepted Answer on StackOverflow?

I had started work on a new project recently. After committing about 200 lines of code, I decided to check HN to reward myself. It was then I found Guillermo Winkler’s blog post about programming languages on StackOverflow. It was quite an interesting analysis. I read it and went back to work. But I couldn’t quite concentrate. My inner statistician was whinging. It thinks that Guillermo didn’t answer the question. And my inner statistician wasn’t satisfied.

While it’s good knowing what the average wait time is, I suspect it wasn’t entirely useful. And so I began to perform my own analysis.

Data Gathering

The first step was data gathering. Guillermo had used Stack Overflow’s data explorer to acquire data. And amongst the chief complaints on the HN thread was that the languages seemed a little too arbitrary. I decided I’d fix that as well. So I first went to the tags page and picked out the top 10 programming language tags on Stack Overflow. They were: c#, java, php, javascript, c++, python, objective-c, c, ruby, vb.net. It fits pretty well with the TIOBE index of programming languages.

And so I queried SO’s Data explorer, using codes Guillermo originally written:

Due to the limits of the Data Explorer, the top 50000 rows returned only contained C and C++. So I had to manually pull the other languages and then concatenate them. I ended up with a file with over 3.9 million rows in it. Fun.

Distributions!

The next thing to look at was obviously the distribution of ResponseTimes. I suspected that the distribution of response times will be slightly off normal, skewed left. Of course, I couldn’t have been prepared for this:

It was immediately apparent I was quite right: means mean nothing. Let’s look at the skewness and kurtosis of the data.

> skewness(df2$minRT)
[1] 12.14729
> kurtosis(df2$minRT)
[1] 178.759
> median(df2$minRT)
[1] 15
> mean(df2$minRT)
[1] 4055.231

Let’s see what happens if we log the X axis:

Modeling

Next of course, was to aggregate the data. This is the gist of it:

If you notice, this basically counts the number of unanswered questions at a given time. This is because I want a survival chart. I am modeling the action of a question being answered as a ‘death’. Here’s the result:

Issues and Future Work

The problem with this analysis is that it merely looks at the subset of questions where an accepted answer exists. “Death” is then having an accepted answer. However, I suspect this excludes a lot of the questions raised in Stack Overflow. Perhaps questions closed without an answer can also be considered “death” ?

Future work can include time of day and various other kinds of regressions. Questions like “do the length of questions matter?” and “Does reading grade of the question title matter?” can also be worked upon and answered. In fact Guillermo himself had some very interesting questions, which incorporate the two questions: “Do difficult questions matter?”

Conclusion

I need to sleep, so here’s a quick conclusion. I hadn’t written the code to figure out the hazard function or calculated the kaplan meier estimator. But I think survival analysis of the questions answered (or indeed, just a CDF) would have yielded better results than mere 1st-moment analysis. I have too, yet to do a log-rank test to compare the different languages. If anyone wants to do them, please feel free to just fork my gists. 😀

I think from the chart, it is quite obvious that there are different survival patterns for different languages. And that different languages have different rates of questions being solved. Oh well, 3 hours well wasted. Most of it data collecting ಠ_ಠ.

Errata

~~In my sleepiness last night I had logged the wrong axis on the survival chart. I’m a dumbass and I will fix this when I get home~~. Fixed 8th November 2012
I hadn’t yet written the Kaplan-Meier estimator. I thought… hey, if you can get away with flashy charts and fancy-looking code, why bother with the real analysis? Just kidding. I’ll probably get them done when I get home.