Clinical reasoning is situation-dependent and case-specific; therefore, assessments incorporating different patient presentations are warranted. The present study aimed to determine the reliability of a multi-station case-based viva assessment of clinical reasoning in an Australian pre-registration osteopathy program using generalizability theory. Students (from years 4 and 5) and examiners were recruited from the osteopathy program at Southern Cross University, Lismore, Australia. The study took place on a single day in the student teaching clinic. Examiners were trained before the examination. Students were allocated to 1 of 3 rounds consisting of 5 10-minute stations in an objective structured clinical examination-style. Generalizability analysis was used to explore the reliability of the examination. Fifteen students and 5 faculty members participated in the study. The examination produced a generalizability coefficient of 0.53, with 18 stations required to achieve a generalizability coefficient of 0.80. The reliability estimations were acceptable and the psychometric findings related to the marking rubric and overall scores were acceptable; however, further work is required in examiner training and ensuring consistent case difficulty to improve the reliability of the examination.
The assessment of clinical reasoning is a challenge for health profession educators who grapple with evolving understandings of the complex nature of reasoning in practice. It appears that clinical reasoning is more closely aligned with clinical knowledge and knowledge organization than with problem-solving capacity . The literature confirms that the context-specific nature of clinical reasoning requires multiple assessments in different contexts by different assessors to optimize the validity and reliability of the overall assessment. The emerging clinical reasoning literature in osteopathy includes descriptions of the reasoning processes of practicing osteopaths  and educators , and the challenges associated with clinical reasoning in osteopathy . The assessment of clinical reasoning in osteopathy has also been explored [5,6]. Orrock et al.  developed a viva examination to assess clinical reasoning in Australasian osteopathy students. That study provided initial evidence supporting the validity of the scores derived from the examination. The authors stated that further work on the marking rubric was required, as was evaluation of the reliability of the assessment. The present study aimed to evaluate modifications to the rubric and the reliability of a clinical reasoning viva assessment in a pre-professional osteopathy program leading to registration as an osteopath in Australia.
This was a cross-sectional study designed to evaluate the reliability of a 5 station objective structured clinical examination (OSCE)-format examination.
Materials and/or subjects
Students enrolled in the fourth and fifth years of the osteopathy program at Southern Cross University (SCU), Lismore, Australia were invited to participate in the study. Participation was not a requirement for any academic subject in their program; however, students were provided with feedback about their performance. Examiners were recruited from the academic and clinical education staff of the SCU osteopathy program. Training was provided to the examiners in the form of a training manual, training video, and an examiner training session that lasted for 1.5 hours immediately before the viva examination.
Students were allocated to 1 of 3 circuits and cycled through 5 stations in an OSCE-type format. Each station lasted for 10 minutes, during which time the examiner worked through a 3-stage clinical history with the student. The study took place on a single day in the student teaching clinic on April 12, 2016. The examination process was as follows, and the content of question items is presented in Table 1: first, the student entered the room; second, the examiner presented stage 1 of the case to the student to read; third, the examiner asked Q1 and Q2 from the rubric; fourth, the examiner presented stage 2 of the case to the student to read; fifth, the examiner asked Q3, Q4, Q5, Q6, and Q7 from the rubric; sixth, the examiner presented stage 3 of the case to the student to read; and seventh, the examiner asked Q8, Q9, Q10, Q11, and Q12 from the rubric.
Each of the clinical histories was taken from the examination developed by Orrock et al. , and each student was marked by the examiner using a modified rubric as suggested by those authors (Appendix 1). Modifications to the rubric were guided by the correlations between multiple items observed in the study of Orrock et al. . Each examiner assessed the student based on only a single clinical history scenario, and the examiner was not required to total up the marking rubric. Question 12 did not contribute to the students’ total score for the examination.
Descriptive statistics and reliability estimations (ordinal Cronbach alpha and McDonald omega) were generated for the examination in R ver. 3.3.0 (The R Foundation for Statistical Computing, Vienna, Austria; https://www.r-project.org/) using the ‘userfriendlyscience’ package ver. 0.4–1 (http://userfriendlyscience.com). Generalizability analysis was used to evaluate the reliability of the examination  using G_String IV (The Program for Educational Research and Development, Hamilton, ON, Canada; http://fhsperd.mcmaster.ca/g_string/). The generalizability (G) study had a fully crossed design with 3 facets: all ‘students’ participating in the exam were examined by all ‘examiners’ on all ‘items’ on the rubric (student×examiner ×item). Examiners were treated as a random facet and items were treated as a fixed facet. This design did not allow for the identification of variance due to the case and examiner, as each examiner only assessed 1 case. The examination was designed to assess the students’ clinical reasoning ability; therefore, the absolute error (Φ) was the chosen reliability coefficient . A decision study was performed by changing the number of examiners/stations to investigate the number of stations required for a high-stakes assessment.
The study was approved by the Southern Cross University Human Research Ethics Committee (ECN-15-237).
Fifteen students and 5 examiners were recruited for the examination. All examiners participated in the examiner training program. The mean student score was 34.3± 7.2 out of 55 (Table 1). The Cronbach alpha value was 0.88 (95% confidence interval, 0.84 to 0.92) for the modified rubric, and removing an item from the marking rubric did not improve this value. The McDonald omega (hierarchal) was 0.71, supporting the calculation of a total score for the examination. The G-coefficient (Φ) was 0.53; that is, just over half of the variation in the results was due to differences between student performance on the examination. A generalizability coefficient of 0.80 would have been achieved for 18 examiners/stations . The variance components are presented in Table 2. Residual and systematic error accounted for the largest variance, at over 37%. The raw data file is available in Supplement 1.
High-stakes assessments need to be standardized to ensure reliability, and high-stakes viva assessments have reported acceptable reliability. The present study evaluated the reliability of a clinical reasoning viva examination in an Australian pre-professional osteopathy program. The reliability estimations supported both the internal structure of the modified rubric and the calculation of a total score. The Φ-coefficient for the 5 examiners was 0.53, suggesting that 53% of the variance in the students’ total score was attributable to real differences in student performance on the examination. To achieve an acceptable coefficient for high-stakes decision-making (> 0.80), 18 examiners/stations would have been required . Such a result suggests the proposed format of the examination may not be reliable without further review and re-evaluation.
The greatest variance was attributable to residual and systematic error. The examiner and student×examiner facets both contributed approximately 20% of the variance, suggesting that the examiners were a substantial contributor to a student’s score. Examiner variance was approximately double that of student variance, suggesting the mean scores given by the examiners on 1 case were more variable than the mean student score across all 5 cases. That is, little variation was found in student performance across the examination, as supported by the small percentage of variance attributable to the student facet. However, the study design did not allow for the influence of case difficulty/specificity to be partitioned out from the examiner facet, meaning that there may have been variability in the difficulty of each case, which was subsequently reflected in the variance resulting from the examiner facet. Previous work using the same cases did review the difficulty of each case and suggested that they were comparable, suggesting that the influence of the examiners may account for the result. Students were also scored differently by different examiners, as suggested by the student×examiner interaction. This could have been due to actual student performance, or prior knowledge of student performance. The latter is possible since the students and examiners were recruited from the same teaching program, and this may account for examiner training not being as successful as anticipated.
The items facet supports the Cronbach alpha and McDonald’s omega reliability estimations, but also demonstrates some variability in the item difficulty across the items on the marking rubric. That said, the items on the rubric made only a minor contribution to score variance, providing support for its use in the assessment. Further support for the rubric itself is provided by the small variance components for the student×items and examiner×items interaction terms.
The results of the present study suggest that further examiner training is required in order to improve the reliability of the examination. A number of the examiners reported difficulty completing the full suite of questions in the time allocated, and also felt that more substantial model answers would improve their grading decisions. It would also be of value to have the examiners conduct the same examination with different cases, in order to ascertain whether case specificity or examiner stringency contributed to the substantial error and the variance due to the examiner facet. Having 2 examiners for each case may also improve the reliability, although the potential benefit would need to be offset against the extra cost. The present study had some limitations. The small student numbers in the current study mean that our findings may not have been representative of the performance of the entire student body. There is also a possibility of self-selection bias on the part of both the students and the examiners. Students may have chosen to participate as preparation for upcoming examinations and to receive feedback. Examiner familiarity with the students is another limitation, which could be addressed by including examiners from outside the SCU teaching program. Further research into the examination is warranted following examiner training and a review of cases prior to implementation as a high-stakes assessment of clinical reasoning in osteopathy.
Conflict of interest
No potential conflict of interest relevant to this article was reported.