An Artificial Neural Network-based
for Clinical Performance Assessment
Adrian Casillas,1, 2
Stephen Clyman,4 Brian Clauser,4 Yihua Fan,4
R. Stevens1, 3
1Department
of Microbiology and Immunology, School of Medicine, University of California,
Los Angeles; 2Department of Medicine, Division of Clinical
Immunology and Allergy, School of Medicine, University of California,
Los Angeles; 3Graduate School of Education and Information
Science, CRESST, University of California, Los Angeles; 4National
Board of Medical Examiners, Philadelphia; USA
Abstract
We have used unsupervised
artificial neural networks (ANNs) to explore alternative models of student
performance and identify areas where such models may complement existing
assessment models. One hundred random student performances were selected
from a larger database of computer-based clinical scenario (CCS) performances
on a case of bacterial meningitis. Classifications resulting from this
neural network modeling were consistent with the National Board of Medical
Examiners (NBME) model in that highly rated performances (ratings of 7
or 8) were clustered on the neural network output grid. Very low performance
ratings shared few common features and were classified at isolated nodes.
Several performance clusters with very disparate NBME ratings (ranging
from 1 to 8) were electronically recreated as search path maps to visualize
the strategies used at different nodes. The neural network clustering
appeared to be sensitive to quantitative and qualitative test selections,
some reflecting broader behavioral classification. These particular performance
clusters did not appear to be coincidental since reproducibility across
3 separately trained networks could be achieved. In generating this performance
model from a constructive data analysis approach, rather than through
more traditional task analysis, we have validated the existing NBME CCS
scoring model and provided further evidence for the utility of ANNs in
educational training and assessment settings. These results also suggest
that the performance model being created by the NBME scoring criteria
is quite complete and that likely additions to this model would emphasize
the more behavioral aspects of clinical performance.
Introduction
The rapid evolution of information
technologies is beginning to provide new opportunities for studying the
complex behaviors of individuals engaged in complex tasks. Medical diagnosis
and management is an example of a very complex task, and multiple data
and performance models can be built from the ensuing patient/physician
interaction.1, 2 These models differentially address such issues
as cost, risk, and the quality of life.3 - 5 Currently, no
tasks other than the clinical encounter itself adequately capture the
information needed to simultaneously address all the aspects of these
different models. Several tasks do exist, however, that are attempting
to capture the critical features of each model, and these tasks form the
basis of current and proposed medical licensing examinations.
The NBME has introduced the
CCS examination, formerly known as CBX, in order to provide a simulated
patient experience requiring examinees to continually monitor the patient
and make appropriate management decisions.6 - 9 In assessing
these tests, an expert rating system is employed where actions are matched
with criteria and compared with broader performance standards. Although
competence is based on a rating, in reality, the decision of competence
is derived from multiple variables to ensure that the analytical scoring
model being employed is optimized for the purposes of the examination.10
It must be recognized, however, that many aspects of learning, knowledge,
and behavior affect performance, not all of which can be addressed in
any particular examination. It is not always clear what problem-solving
features may remain unrecognized in return for, perhaps, a level of efficiency
when constructing or scoring a problem performance.
While the NBMEs CCS
computer simulations have been refined for the purposes of licensure and
certification, other more subtle models of student performance may exist
within the data. This suggests the possibility for more exploratory and
analytical techniques that can be readily applied to "discover"
alternative classifications within complex data sets. Most performance
assessments or intelligent tutoring systems begin with knowledge skills
and cognitive task analysis and create suitable tasks and scoring criteria
based on this analysis.10 The very broad nature of the CCS
problem space allows alternative approaches toward developing performance
models.
We have utilized a constructive
data modeling approach that utilizes the pattern recognition capabilities
of ANNs to build performance models from existing complex data sets. ANNs
are nonparametric techniques that build rich models of complex phenomena
through a training and pattern recognition process and are capable of
categorizing behavior based on actual performance sequences. Neural networks
have had practical utility in solving classification problems with ill-defined
categories, where the patterns are often deeply hidden within the data
or where there are poorly defined models of behavior.11 In
this study, we use neural network analysis to explore the completeness
of the NBME CCS scoring model, and determine the validity of using ANN
analysis to generate performance classifications from complex data sets
in the absence of defined scoring criteria.
Methods
The NBME CCS Data Set
The NBME developed the CCS patient management simulations primarily for
assessing clinical knowledge and competence of 3rd-year and 4th-year medical
students.6, 8 In these simulations, examinees manage a patient
in a realistic fashion by requesting various diagnostic tests, therapeutic
options, and other clinically relevant items. The series of requested
actions is recorded in simulated time, and the resulting transaction list
defines the strategy for each particular student. These sequential actions
served as the input to train our neural networks.12, 13
Unsupervised Neural Network
Analysis of Performance Data
We utilized a self-organizing neural network map made up of a matrix of
competitive nodes referred to as a Kohonen layer.14, 15 The
network received data as a number of distinct inputs that are digital
representations of the paths between the students test item selections.
The number of inputs is based on the total number of unique test associations
represented in the performances used for training.
The neural network training
process is iterative where the value of each output node is adjusted based
on the magnitude and direction of the input vectors of each performance
presented during training. Each time the entire training data is passed
through the network, an epoch (1 iteration) is achieved. Each epoch during
training results in adjustments of the magnitude and direction of the
output vector until the completion of training. The duration of training
is empirically derived, and based on our experience, 1,000 to 10,000 epochs
was sufficient for achieving consistent classification. Our networks were
trained with a data set consisting of 100 performances from a single case.
Following the training process,
the same data used to train the network was presented to the network for
classification. The Kohonen self-organizing neural network associates
each input pattern with a representative output pattern. The winning nodes
for the entire set of performance data are summed to produce the topographic
representations seen throughout the results. The Kohonen self-organizing
neural network was constructed with software libraries from Ward Systems
Group (Rockville, MD).
Search Path Map Analysis
We electronically reconstructed the students problem-solving strategies
using software that produces visual representations of students
search through a defined problem space by querying the transaction database
of the performances and displaying the results in a graphical form.16,
17 Search path maps are displayed as boxes that correspond to actions
a student can select while managing a case. The tests can be grouped into
a variety of formats to display the use of certain tests or concepts or
to show details of the sequence of student selections in a particular
test group. The different laboratory tests available were clustered into
separate areas of the problem space, as shown in the Results. Individual
student performances were overlaid on this template by a series of lines
connecting the sequence of items chosen, with the lines going from the
upper left-hand corner of a test selection to the lower center of the
subsequent test. Thus, tracking a students strategy as a series
of "From-To" pairs was possible. Where multiple student performances
were displayed, the thickness of the lines between test item pairs was
proportional to the number of students who made that test selection.13,
18
Results
ANN Classification of
Case Performance Data
In order to derive ANN performance classifications, a training set of
100 randomly selected performances of case 139 (meningitis) was used to
train an unsupervised network. The training data was then analyzed by
the same network to identify performance clusters (Figure 1). We next
selectively queried the NBME database for specific performance ratings
that could be identified at specific node outputs to isolate the nodes
corresponding to high or low ratings. The values ranged from 1 (worst)
to 8 (best). We found a contiguous 2-node cluster (node 43/44) composed
of highly rated performances (rating
7) (Figure 1A). Our analysis of low ratings (less than 4) revealed very
limited clustering (Figure 1B).
Figure 1:
Output nodes for performances of problem 139 segregated by (A) rating
greater than 4 and (B) rating less than 4 (failing). The major nodes are
shown with arrows pointing to their location on the grid, and the number
of performances at each node is designated. Search path maps selected
by rating for problem 139. Highly rated performances (rating
= 7 or 8) are shown (C) beside those failing performances (rating below
3) (D). Note the lack of "CSF" domain usage in (D).
To understand the actual
strategies accounting for the classifications, the performance strategy
was overlaid on the template of the functional problem. In this display,
each number-coded box, represented a specific test (for example, History
and Physical items, Blood Cell Tests, Cerebrospinal Fluid [CSF] tests)
were grouped on the template into respective categories. With connecting
lines between the selected test items requested, we were able to reconstruct
the examinee strategies as search path maps (Figure 1C, D).
Our initial observation was
that the degree of thoroughness within the area of CSF tests for this
meningitis case was associated with the overall rating. Performance ratings
of 7 and 8 (i.e., at nodes 43 and 44) were uniformly associated with the
selection of many CSF-associated items (Figure 1C), while ratings below
3 (i.e., node 89) were associated with minimal or no test ordering in
the CSF domain (Figure 1D). As expected, the lowest rankings showed no
usage in the CSF test domain, suggesting a failure to even recognize the
problem as one of meningitis.
Clustering of Performances
Across Networks
There is often significant variability in the performance of a series
of neural networks trained with the same data set due to initial training
conditions and the properties of the data itself. We were interested to
know how well the network-assigned classifications were retained when
additional neural networks were trained with the same data. Two additional
neural networks were trained with the same architecture (i.e., 10,000
epochs with a 10 x 10 output). We selected major nodes from the ANN output
of our original network (Network #1) in order to compare these performance
clusters as they were classified on 2 independently trained neural networks
(# 2 and #3). We expected that the data would reflect the fact that clusters
in 1 network would be represented in a topologically ordered manner, meaning
that 1 cluster may be represented by physically close (adjacent) nodes
between different neural networks if the data represent similar patterns.14,
15 Five major output nodes generated by Network #1 (nodes 10, 21,25,
29, 43/44, and 89) were compared to the corresponding outputs generated
for the 2nd and 3rd networks (Table 1).
Table 1:
Comparison of performance clustering between specific nodes of Network
#1 with 2 independently trained networks (#2 and #3)
In node 10 of Network
1, 10 of 11 performances (91%) clustered on Network #2 while 9 of 11 (82%)
similarly clustered on Network #3. Between Networks #2 and #3, there was
73% preservation of clustering. Node 21 of Network #1 showed only 50%
cluster correlation with either Network #2 or #3. However, the latter
2 networks were identical in terms of their classification for the performances,
since the same performances were clustered within each of 2 nodes. In
node 25, there was a 63% correlation of Network #1 with Networks #2 and
#3, but the latter 2 neural networks, again, were identical in terms of
the classifications generated with the respective performances. Node 43/44,
where the highest NBME ratings were found on Network #1, showed that 8
of 12 performances (67%) clustered on Network #2; however, between Network
#1 and #3, 10 of the 12 performances (83%) were similarly classified at
the adjacent nodes 25 and 45 of Network #3. Furthermore, between Networks
#2 and #3 there was 83% preservation of clustering. As expected, the node
representative of the poorest NBME ratings, node 89, correlated 100% with
the classifications generated by Networks #2 and #3. Node 29 was unusual
in that the best correlation with either of the 2 independent networks
was 56%. These clusters were not retained between Networks #2 and #3,
since only 33% of the performances clustered among these 2 networks.
Figure 2:
Search path maps of node 29 performances. Node 29 was segregated into
poorly rated (A) and highly rated (B) performances. The use of superfluous
tests is shown in the captions with arrows indicating each group of tests.
Total neural network outputs of all nodes for 2 performances with a primary
peak at node 29. A performance with a low rating of 2 (C) is shown in
contrast to a performance with a rating of 8 (D). Note the lack of secondary
nodal output in (C) compared to the significant secondary peak at node
43 in (D).
Inconsistent Aspects of
the NBME and ANN Performance Assessment Models
There were other nodes that could not be explained as classifications
extending directly from the NBME expert-raters model. One example
of this type of performance is characteristic of the group clustered at
node 29 which failed to give consistent clustering across networks.
The search path map analysis
for performances at this node (Figure 2A, B) indicated the use of excessive
test item selections throughout several domains. Regardless of the search
strategy employed within the CSF domain, the consistent feature of these
performances was the overuse of a number of tests including serum lipid
profiles, urine tests, culture and sensitivity, and additional blood chemistries
whether the rating was low (Figure 2A) or high (Figure 2B).
The fact that there was a
cluster of performances at node 29 with disparate ratings also prompted
us to investigate the possibility that there may be additional information
within the network output useful for characterizing the quality of these
performances. Up to this stage, the network output we had focused on was
the highest nodal output rather than that of the entire output space (i.e.,
all 100 nodes on the 10 x 10 grid). In the next series of experiments,
instead of viewing the outcome of a performance only by its ANN assigned
winning node, all of the outputs generated at the final step of the performance
were visualized. We observed that the final step outputs for poorly rated
performances (Figure 2C) show only a single area of significant output
(at node 29) with a poorly defined secondary output area. In contrast,
a significant secondary peak (Figure 2D) which was at node 43 characterized
the excellent performances at node 29. Through the visualization of the
entire final output, it is apparent that the winning output at node 29
grouped the 2 performances by associating them with the utilization of
excessive test usage, as was shown above. The development of a significant
secondary peak in a defined area associated with an area of high ranking
(i.e., node 43/44) served to differentiate the superior strategies from
the weaker strategies at the same node.
Discussion
The idea of mental models can be useful when discussing procedures that occur through the acquisition
and use of knowledge, and it is this process that we have addressed in
our study. These models are represented by the recurrent performance patterns
or strategies that relate external events to what is already known. These
models are dynamic, continually being updated and modified with experience
in conjunction with the nature of the problem being encountered at a particular
time. Our approach to generating a neural network model of patient management
was deliberately retrospective in approach. The data had been previously
collected, but the nature of the case and the coding of the actions was
unknown during the training and preliminary classification of student
performance. In essence, we worked from the data to the model, a different
approach from traditional scoring and task analysis.
We were able to find consistencies
among our neural network model and the expert-rater model established
by the NBME. The best agreement between the 2 models was observed among
those performances with high ratings at node 43/44 (Figure 1A). It was
evident that among the training performances at this node cluster, a high
rating by expert raters was predictable when the output was classified
to node 43/44. There was no single node that predicted a low rating. In
fact, the lack of performance clustering was often predictive of a poor
performance (Figure 1B). This indicates that incomplete or poorly formed
strategies will more likely reflect elements of inefficiency or error,
and further, is not an inconsistent feature between the CCS expert and
neural network models since a poor rating reflects the lack of a cohesive
and productive strategy. In fact, the uncued manner of testing employed
in the CCS examinations is more likely to lead to more random approaches
when a mental model is inadequate or incomplete.
Further validation of our
claim for the consistency of the NBME model and the neural network model
was shown through the training of multiple networks where we noted retention
of performance clustering in 2 additional networks (Table 1). These had
been trained under the same conditions as the 1st network, and we found
that in 4 of the 5 major nodal outputs analyzed, a greater than 80% correlation
was found in the performances at 1 node matching with the same performances
classified by a different network. This finding could be verified with
at least 2 of the 3 networks analyzed.
We were able to observe areas
of inconsistency among the 2 models represented by the NBME and the neural
network. This was most apparent at node 29 where expert-assigned ratings
ranging from 2 (poor) to 8 (excellent) were classified together. When
the search path maps for these performances were generated, we observed
that the network classification was based on an excessive number of tests
used (Figure 2). This classification by the neural network brought features
of performances together where strategies reflected exhaustive, unproductive
searches (Figure 2A) as well as those that show the development of a productive
approach within, perhaps, a less focused strategy (Figure 2B).
We felt that transition in
progress was emerging at this node, and we were able to observe a meaningful
subclassification within the strategies at node 29. These transitions
were seen when the neural network output for each node in the 10 x 10
output grid was visualized (Figure 2). In this case, the development of
a significant secondary output peak at node 43/44 indicated that these
particular performances contained similarities to excellent performances
(as in Figure 2B), but due to an overwhelming number of tests ordered,
the final output was at node 29. This is an important realization when
considering the true dynamic nature of learning and comprehension.
This example also serves
to illustrate the nature of unsupervised neural networks in classifying
patterns that may be hidden within data.11 ANNs can aid the
assessment process by capturing the full repertoire of performances in
a population even within a specific case in order to provide information
about problem-solving behavior. An ANN model, as a resource in the assessment
of complex, real-world clinical problem solving, is compatible and complementary
with currently employed models for assessment and recognizes a behavioral
dimension that has been difficult to objectify. It is also likely that
the model studied here will be useful with other complex performance data
sets where behaviors are likely to be hidden within the data.
References
-
Elstein AS, Kleinmuntz B, Rabinowitz M,
McAuley R, Murakami J, Heckerling PS, et al. Diagnostic reasoning of
high- and low-domain-knowledge clinicians: a reanalysis. [Published
erratum appears in Med Decis Making. Jul.-Sep. 1993;13(3):267.]
Med Decis Making. 1993;13:21-29.
-
Groen GJ, Patel VL. The relationship between
comprehension and reasoning in medical expertise. In: Chi MTH, Glaser
R, Farr MJ, eds. The Nature of Expertise. Hillsdale: L. Erlbaum
Associates; 1988:287-310.
-
Elstein AS. Beyond multiple-choice questions
and essays: the need for a new way to assess clinical competence. Acad
Med. 1993;68:244-249.
-
Fehrsen GS, Henbest RJ. In search of excellenceexpanding
the patient-centred clinical methoda 3-stage assessment. Fam
Pract. 1993;10:49-54.
-
Lantz MS, Chaves JF. What should biomedical
sciences education in dental schools achieve? J Dent Educ. 1997;61:426-433.
-
Clauser BE, Subhiyah RG, Nungester RJ,
Ripkey DR, Clyman SG, McKinley D. Scoring a performance-based assessment
by modeling the judgements of experts. JEM. 1995;32(no. 4):397-415.
-
Clauser BE, Subhiyah RG, Piemme TE, Greenberg
L, Clyman SG, Ripkey D, et al. Using clinician ratings to model score
weights for a computer-based clinical-simulation examination. Acad
Med. 1993;68:S64-S66.
-
Clauser BE, Swanson DB, Clyman SG. The
generalizability of scores from a performance assessment of physicians'
patient management skills. Acad Med. 1996;71:S109-S111.
-
Dunbar SB, Koretz DM, Hoover HD. Quality
control in the development and use of performance assessments. Appl
Meas in Educ. 1991;4:289-303.
-
Mislevy RJ, Gitomer DH. The role of probability-based
inference in an intelligent tutoring system. In: Anonymous User Modeling
and User-Adapted Interaction. Netherlands: Kluwer Academic Publishers;
1996:253-282.
-
Rumelhart DE, McClelland JL. Parallel
Distributed Processing: Explorations in the Microstructure of Cognition.
Cambridge: MIT Press; 1986.
-
Stevens RH, Lopo AC. Artificial neural
network comparison of expert and novice. Proc Annu Symp Comput Appl
Med Care. 1994;64-68.
-
Stevens RH, Lopo AC, Wang P. Artificial
neural networks can distinguish novice and expert strategies during
complex problem solving. J Am Med Inform Assoc. 1996;3:131-138.
-
Kohonen T. Self Organization and Associative
Memory. Berlin: Springer; 1989.
-
Lawrence J. Introduction to Neural
Networks. Nevada City, CA: California Scientific Software Press;
1993.
-
Stevens RH. Search path mapping: a versatile
approach for visualizing problem-solving behavior. Acad Med.
1991;66(suppl 9):s72-s75.
-
Stevens RH, Kwak AR, McCoy JM. Evaluating
preclinical medical students by using computer-based problem-solving
examinations. Acad Med. 1989;64:685-687.
-
Stevens RH, McCoy JM, Kwak AR. Solving
the problem of how medical students solve problems. MD Computing.
1991;8(1):13-20
|