The reliability of clinical measurements is critical to medical research and clinical practice. Newly proposed methods are assessed in terms of their reliability, which includes their repeatability, intra- and interobserver reproducibility. In general, new methods that provide repeatable and reproducible results are compared with established methods used clinically. This paper describes common statistical methods for assessing reliability and agreement between methods, including the intraclass correlation coefficient, coefficient of variation, Bland-Altman plot, limits of agreement, percent agreement, and the kappa statistic. These methods are more appropriate for estimating reliability than hypothesis testing or simple correlation methods. However, some methods of reliability, especially unscaled ones, do not clearly define the acceptable level of error in real size and unit. The Bland-Altman plot is more useful for method comparison studies as it assesses the relationship between the differences and the magnitude of paired measurements, bias (as mean difference), and degree of agreement (as limits of agreement) between two methods or conditions (e.g., observers). Caution should be used when handling heteroscedasticity of difference between two measurements, employing the means of repeated measurements by method in methods comparison studies, and comparing reliability between different studies. Additionally, independence in the measuring processes, the combined use of different forms of estimating, clear descriptions of the calculations used to produce indices, and clinical acceptability should be emphasized when assessing reliability and method comparison studies.
Citations
Undergraduate medical students should learn oral presentation skills, which are central to physician-physician communication. The purpose of this study was to compare checklist scores with global ratings for evaluation of oral case presentation and to investigate interrater agreement in the scoring of checklists.
The study group included twenty-one teams of undergraduate medical students who did clerkship for 2 weeks in the department of Laboratory Medicine of Mokdong Hospital, School of Medicine, Ewha Womans University from January 2005 to October 2006. Three faculty raters independently evaluated oral case presentations by checklists, composing of 5 items. A consensus scores of global ratings were determined after discusssion. Inter-rater agreement was measured using intraclass correlation coefficient(ICC). As the ICC values approaches 1.0, it means higher inter-rater agreement.
The mean of consensus global ratings was significantly higher than that of checklists by three faculty raters(12.6±1.7 vs 11.1±2.0,
These results suggest that checklist scores by faculty raters could be one of the most useful tools for evaluation of oral case presentation, if checklist would be modified to make less ambiguous and more objective and faculty raters would have opportunities to be educated and trained for evaluation skills of oral case presentation.
Citations