Inter-Rater Reliability

Establishing inter-rater reliability is important in order to ensure that multiple analysts deliver consistent results when analyzing the same data. We recommend that two analysts each analyze a portion of the recorded videos and calculate an inter-rater reliability score using that data. We used [1]’s method for calculating minimum sample size needed to establish reliability. We use Gwet’s AC1 statistic [2] to calculate inter-rater reliability, due to a recognized issue with Cohen’s Kappa when it is calculated for data in which certain events are rare (e.g. codes like discord or positive/negative emotion) [3]. The AC1 statistic is an alternative to Cohen’s Kappa that corrects for this issue while still accounting for chance agreement [2].

References

Stephen Lacy and Daniel Riffe. 1996. Sampling error and selecting intercoder reliability samples for nominal content categories. Journalism & Mass Communication Quarterly 73, 4: 963–973.
Kilem Li Gwet. 2008. Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology 61, 1: 29–48.
Anthony J Viera, Joanne M Garrett, and others. 2005. Understanding interobserver agreement: the kappa statistic. Fam Med 37, 5: 360–363.