A cautionary note on selection
In follow-up to my experiment on anchoring, I had subjects guess the number of dots in a series of images. Workers were “randomly” (I’ll explain the scare quotes later) assigned to one of 7 of these tasks (labeled A-G), in order from fewest dots to the most dots.
Here is A:
and here is G:
After the initial task, workers could keep working, making their way through the other 6 tasks, which were also “randomly” assigned. Most did all 7 but a sizable number just did 1. As data began to come in, I checked that an approximately equal numbers of workers did each of the 7 pictures. They did, but in AMT this tells us nothing about selection: when a user “returns” a HIT rather than completing it, that HIT is returned to the pool of uncompleted HITs, making it available for a future worker.
When I looked at the distribution of “first HITs” i.e., the first picture performed by a subject (remember that they could complete more than one picture) a striking pattern jumped out:
A B C D E F G
101 100 141 170 154 158 169
In general, the more dots a picture had, the more likely it was to be the first task done by a worker. A chi-square test confirmed that the distribution of first pictures.
What’s going on?
The over-representation of high-dot pictures among the first pictures completed by a worker suggests that a disproportionate number of high-dot pictures are being returned. It is of course possible that workers find evaluating many-dot pictures more onerous, but this seems unlikely, especially since there was no penalty for bad guesses. My hunch is that workers find the high-dot tasks less pleasant because they take longer to load—not because they are intrinsically more difficult. Picture G’s file size is 35K, while picture A is only about 10K. This difference is not much for someone with a fast connection, but many workers presumably have slow connections and grew tired of waiting for a G image to load. I plan to test this hypothesis by using image files with approximately identical sizes and see if the pattern still persists.
If workers are returning HITs for basically arbitrary reasons, that the AMT “back to the pool” sampling is as good as random. However, if users are non-randomly non-complying, you can potentially make biased inferences. In addition to the problem of non-random attrition, because of how AMT allocated workers to tasks, you also get changes in the probability that a subject will be assigned to a group (e.g., late arrivals to the experiment are more likely to be assigned to the group people found distasteful). In future posts, I hope to discuss some of the ways this problem can be circumvented.
You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.