June 2 and 3, 2009
Note: Abstracts and contact information for
follow the conference schedule
Tuesday, 2 June
8:30 – 9:00 Welcomes: Larry Rudner,
GMAC, and Dave Weiss,
9:00 – 10:15 Realities of CAT: Dave Weiss,
Effect of Early Misfit in Computerized Adaptive Testing on the Recovery of Theta. Rick Guyer and David J. Weiss,
Quantifying the Impact of Compromised Items in CAT. Fanmin Guo, Graduate Management Admission Council
Guess What? Score Differences With Rapid Replies Versus Omissions on a Computerized Adaptive Test. Eileen Talento-Miller and Fanmin Guo, Graduate Management Admission Council
Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss,
10:30 – 12:00 CAT for Classification: Dave Weiss,
Computerized Classification Testing in More Than Two Categories by Using Stochastic Curtailment. Theo
J.H.M. Eggen, CITO and University of
Twente, The Netherlands; and Jasper T. Wouda, CITO,
Utilizing the Generalized Likelihood Ratio as a Termination Criterion. Nathan A. Thompson, Assessment Systems Corporation
Testing Using Decision Theory.
“Black Box" Adaptive Testing by Mutual Information and Multiple Imputations. Anne Thissen-Roe, Kronos
Comparison of Computerized Adaptive Testing Approaches: Real-data Simulations
12:30 – 2:00 Posters: CAT Research and Applications Around the World (Concurrent)
A Comparison of Three Methods of Item Selection for Computerized Adaptive Testing. Denise Reis Costa and Camila Akemi Karino, CESPE/University of Brasilia; Fernando A. S. Moura, Federal University of Rio de Janeiro; and Dalton F. Andrade, Federal University of Santa Catarina, Brazil
Adequacy of an Item Pool for Proficiency in English Language From the University of Brasilia for Implementation of a CAT Procedure. Camila Akemi Karino, Denise Reis Costa, and Jacob Arie Laros, CESPE/University of Brasilia, Brazil
Development of an Item Model Taxonomy for Automatic Item Generation in Computerized Adaptive Testing. Hollis Lai, Mark J. Gierl, and Cecilia Alves,
An Approach to Implementing Adaptive Testing Using Item Response Theory in a Paper-Pencil Mode. V. Natarajan, MeritTrac Services Pvt. Ltd,
Assessing the Equivalence of Internet-Based vs. Paper-and-Pencil Psychometric Tests. Naomi Gafni,
Keren Roded, and Michal Baumer, National Institute for Testing and
Features of a CAT System and Its Application to J-CAT. Shingo Imai, Y. Akagi,
Measurement of Cognitive Ability Based on a Person’s Zone of Nearest
Development. Marina Chelyshkova and
Victor Zvonnikov, State
Figural Matrix Items in a Computerized Adaptive Testing System:
Constrained Item Selection Using a Stochastically Curtailed SPRT. Jasper T. Wouda, and Theo J. H. M. Eggen, CITO, The
Enhanced Effective Response Time to Detect the Extent and Track the Trend of
Item Pre-knowledge on a Large-Scale Computer Adaptive Assessment. Jie Li and Xiang Bo Wang, ACT, Inc.,
Computerized Adaptive Testing for
Criterion-Related Validity of an Innovative CAT-Based Personality Measure. Robert J. Schneider, PDRI, Richard A. McLellan, PreVisor, Inc., Tracy M. Kantrowitz, PreVisor, Inc., Janis S. Houston, PDRI, Walter C. Borman, PDRI, U.S.A.
1:00 – 1:40 CAT in
Adaptive Testing in
Years of Applying CAT for Admission to Higher Education in
2:00 – 3:15 Concurrent Sessions
Item Selection: Larry Rudner, GMAC, Chair
Item Selection and Hypothesis Testing for the Adaptive Measurement of Change.Matthew Finkelman, Tufts University School of Dental Medicine, David J. Weiss, University of Minnesota, and Gyenam Kim-Kang, Korea Nazarene University
A Gradual Maximum Information Ratio Approach to Item Selection in Computerized Adaptive Testing. Kyung (Chris) T. Han, Graduate Management Admission Council
Item Selection With Biased-Coin Up-and-Down Designs.
A Burdened CAT: Incorporating Response Burden with Maximum Fisher’s Information for Item Selection. Richard J. Swartz, The University of
Real-Time Analysis: Fanmin Guo, GMAC, Chair
Adaptive Item Calibration: A Simple Process for Estimating Item Parameters Within a Computerized Adaptive Test. G. Gage Kingsbury, Northwest Evaluation Association
On the Fly Item Calibration in Low Stakes CAT Procedures. Sharon Klinkenberg, Department of Psychology, University of Amsterdam, Marthe Straatemeier, Department of Psychology, University of Amsterdam, Gunter Maris, CITO, and Han van der Maas, Department of Psychology, University of Amsterdam
An Automatic Online Calibration Design in Adaptive Testing.Guido Makransky, University of Twente/ Master Management International A/S and Cees A. W. Glas, University of Twente
Cheating Effects on the Conditional Sympson and Hetter Online Procedure with
Freeze Control for Testlet-based Items. Ya-Hui
3:25 – 5:30
Department of Defense
The Nine Lives of CAT-ASVAB: Innovations and Revelations. Mary Pommerich, Daniel O. Segall, and
Kathleen E. Moreno,
National Institutes of Health
CAT-DI Project: Development of a
Comprehensive CAT-Based Instrument for Measuring Depression. Robert D. Gibbons,
of a CAT to Measure Dimensions of Personality Disorder: The CAT-PD Project. Leonard J. Simms,
The MEDPRO Project: An SBIR Project for a Comprehensive IRT and CAT Software System
IRT Software:David Thissen, The
CAT Software: Nathan Thompson, Assessment Systems Corporation
Wednesday, 3 June
8:15 – 9:25 Concurrent Sessions
Item Exposure: Larry Rudner, GMAC, Chair
Reviewing Test Overlap Rate and Item Exposure Rate as Indicators of Test Security in CATs. Juan Ramón Barrada, Universidad Autónoma de Barcelona; and Julio Olea, Vicente Ponsoda, and Francisco J. Abad, Universidad Autónoma de Madrid.
Optimizing Item Exposure Control and Test Termination Algorithm Pairings for Polytomous Computerized Adaptive Tests With Restricted Item Banks. Michael Chajewski
and Charles Lewis,
Limiting Item Exposure for Key-Difficulty Ranges in a High-Stakes CAT.
Xin Li, Kirk A. Becker, Jerry L. Gorham,
Pearson VUE; and
Multidimensional CAT: Nate Thompson, Assessment Systems Corporation, Chair
Comparison of Adaptive Bayesian Estimation and Weighted Bayesian Estimation in Multidimensional Computerized Adaptive Testing.Po-Hsi Chen,
Multidimensional Adaptive Testing: The Application of Kullback-Leibler Information. Chun Wang and Hua-Hua Chang,
Multidimensional Adaptive Personality Assessment: A Real-Data Confirmation. Alan D. Mead, Avi Fleischer, and Jessica D. Sergent, Illinois Institute of Technology
9:35 – 10:45 Item and Pool Development: Larry Rudner, GMAC, Chair
Adaptive Computer-Based Tasks Under an Assessment Engineering Paradigm.Richard M.
Developing Item Variants: An Empirical Study.Anne Wendt, National Council of State Boards of Nursing, Shu-chuan Kao, Pearson VUE, Jerry Gorham , Pearson VUE, and Ada Woo, National Council of State Boards of Nursing
Evaluation of a Hybrid Simulation Procedure for the Development of Computerized Adaptive Tests. Steven W. Nydick and David J. Weiss,
11:00 - 11:55 Diagnostic Testing: Larry Rudner, GMAC, Chair
Computerized Adaptive Testing for Cognitive Diagnosis.
Obtaining Reliable Diagnostic Information through Constrained CAT. Hua-Hua Chang, Jeff Douglas and Chun
Applying the DINA model to GMAT Focus Data. Alan Huebner, Xiang Bo Wang, and Sung Lee, ACT, Inc.
11:55 - 12:30 Wrap-Up and Future Directions: Larry Rudner and Dave Weiss
Monica M Rudick, Wern How Yam, and Leonard Simms
University at Buffalo, State University of New York
A variety of approaches have been implemented to create CAT personality assessments. Recent research has focused on IRT for CAT personality measures, although its use is both computationally complex and requires certain assumptions to be met that do not always hold for personality measures. As a result, non-IRT-based CAT approaches, such as the countdown method, have also successfully been applied to CAT versions of personality measures. In the countdown method, there is some debate regarding whether classification or full-scores-on-elevated-scales (FSES) methods are more preferable. In addition, it is unclear how order of item administration might impact item savings and the validity of scores. Both IRT and non-IRT based methods appear to yield numerous advantages for CAT assessments, most notably time and item savings, and ease of administration. However, these two methods have yet to be directly compared. The purpose of the present study was to compare non-IRT and IRT-based approaches utilizing real-data CAT simulations on a large diverse sample (N = 8,690) who completed the Schedule for Nonadaptive and Adaptive Personality (SNAP). The report focuses on the three longest SNAP Scales: Disinhibition (DIS), Negative Temperament (NT) and Positive Temperament (PT). Simulation analyses compared item savings, item and test information, test validity, and fidelity across the IRT- and non-IRT CAT methods. In addition, within the countdown method simulations, the simulations examined whether item presentation order impacted the results. Results will have implications for test developers wishing to apply CAT technology to personality measures.
For further information: email@example.com
Marina Chelyshkova and Victor Zvonnikov, State University of Management, Russia
At the present moment the majority
schools and universities of
For further information: firstname.lastname@example.org
Poh Hua Tay and
Raymond Fong, Ministry of Education,
Figural matrix items such as Raven’s Standard Progressive Matrices (SPM) are widely used for assessing general intelligence of pupils. Substantial manpower resources are incurred when administering tests on a large scale basis via paper-and-pencil (P&P). A computer-based test (CBT) would offer the advantages of logistical ease during the data collection stage, and administrative ease during the data entry stage; this is especially so for CAT, as it reduces administration time, as well. Unlike P&P and CBT, the most appropriate set of items in a CAT can be adaptively selected for each pupil based on his/her responses to previous items. This permits each pupil to be evaluated on a smaller subset of the total item pool, having better test experience as items are chosen based on his/her ability; and allows the test developer to control the error of measurement to a desired degree of precision.
In this study, an item bank of 195 figural matrix items that are similar to SPM’s was created. The psychometric properties of these items were then established after trialing them on a sample of 6,821 Primary 2 pupils (equivalent to Grade 2 pupils who are about 8 years in age) of varying academic abilities from 20 coeducational schools in Singapore. IRT was used to calibrate all the figural matrix items. From this item bank, a P&P prototype, two CAT prototypes (one starts with an easy item, while the other starts with an average item), and a CBT prototype were generated and administered, via the FastTEST Pro v2.3 platform, to four groups of Primary 2 pupils in Singapore. These groups consisted of a total of 948 Primary 2 pupils of varying academic abilities and were selected from 12 coeducational schools. SPM was also administered to all of them via P&P. This project was designed to study the comparability of the abilities of pupils estimated from the differentpPrototypes (P&P, CATs, CBT) and SPM.
For further information: email@example.com
Jie Li and Xiang Bo Wang, ACT, Inc.
In addition to being highly efficient and accurate in terms of scoring, diagnosis, and reporting, CAT is also known for its global ease and reach of test delivery (Wainer et al, 2000; Meijer & Nering, 1999; Parshall, Spray, Kalohn, & Davey, 2002). However, the latter advantage of CAT also introduces a tenacious problem of potentially exposing items to a high number of examinees due to its high frequency of test administration, which is likely to increase advance or pre-knowledge of items and to jeopardize score validity. Of great concern and interest to the entire educational testing industry is the possibility of validly detecting and tracking the extent that CAT items are exposed. The purpose of this research was (1) to establish population item response times for all items and associated trends for all items with a large-scale international CAT assessment and (2) to investigate the feasibility of applying “effective response time” (ERT; Meijer & Sotaridona, 2006) to detect the extent and track the trend of item pre-knowledge on suspected compromised items on this assessment. The study was based on both operational and simulated data of a large item pool of a large-scale international CAT assessment. This item pool was selected because (1) it had a substantial number of new items that were pretested in several years ago when little or no item pre-knowledge could be assumed and (2) these pretest items had a long history of operational use in subsequent years when item pre-knowledge could have been accumulated. ERT indices for both items and examinee, as described by Meijer & Sotaridona (2006), were computed against a large collection of new items at their pretest time after they passed stringent pretest item quality reviews. The ERT indices from this round were used as null hypothesis benchmarks since no serious item pre-knowledge could be assumed. In addition, simulations were conducted to project the values of these ERT indices, if examinees’ response times were reduced by one-half and one-fourth, respectively. Examinees ability estimates on the operational items of this item pool were used for ERT modeling. ERT indices were also computed when all the new items were first used operationally and the results were compared with their pretest counterparts.
For further information: Jie.Li@Act.org
Patricia Rickard, CASAS,
James B. Olsen, Alpine Testing Solutions,
Debalina Ganguli, CASAS,
and Richard Ackermann, Team Code, Inc.
This paper presents and demonstrates innovations in computerized adaptive testing of adult workplace literacy and numeracy skills developed by CASAS and customized for the Singapore Employability Skills System (ESS). The Singapore Workforce Development Agency (WDA) plays a pivotal role in the implementation of the ESS “to enhance the employability and competitiveness of employees and job seekers, thereby building a workforce that meets the changing needs of Singapore’s economy.” CASAS has designed and developed CATs for mathematics, reading, and listening, and computer-delivered tests for writing and speaking, suitable for adults. The CATs are administered in secure proctored locations using local area networks and an electronic access key (dongle). This paper presents an overview of the project, demonstrations of sample test items from the test battery, presentation of the test delivery and administration system, review of test score results and psychometric analyses, and plans for future enhancements and extensions. The Singapore CATs use the following psychometric procedures: selection of initial item from a random proficiency value near the center of proficiency distribution of the selected item bank, Rasch model calibration and proficiency estimation, and a stopping rule based on a minimum standard error or administration of a specified maximum number of items. Results for the mathematics and reading CATs are presented showing scale score population distributions, stopping rule exit criteria, item exposure distributions, and ability estimate and standard error curves across the item administration sequence. The paper presents summary recommendations for enhancements and extensions with the CAT tests and additional CAT research and validity investigations.
The CAT results are based on examinee samples of approximately 12,000 for the reading tests and 9,000 for the numeracy tests.
For further information: firstname.lastname@example.org
Francisco J. Abad and David Aguado, Universidad Autónoma de Madrid
Juan Ramón Barrada, Universidad Autónoma de Barcelona
Julio Olea, Vicente Ponsoda, and Francisco J. Abad, Universidad Autónoma de Madrid
eCAT is a CAT developed and applied
For further information: email@example.com
Yanyan Sheng, Southern Illinois University at Carbondale
A basic ingredient in computerized adaptive testing (CAT) is the item selection procedure that sequentially selects and administers items based on a person's responses to the previously administered items. For decades, maximum information (MI; Lord, 1977; Thissen & Mislevy, 2000) has been widely used as the conventional algorithm for item selection in CAT. However, this criterion based on Fisher’s information only targets the middle difficulty level where a person has about 0.5 probability of getting the items correctly, and hence is not applicable in situations where a different percentile is desired. In addition, MI heavily relies on an accurate estimation procedure that works well in all testing situations. Nonetheless, studies have shown that such a procedure is not readily available.
The biased-coin up-and-down design (BCD; Durham & Flournoy, 1994) has been widely used in bioassay for sequential dosage level selection because it can target any arbitrary percentile in addition to being efficient (Bortet & Giovagnoli, 2005). As the problem in bioassay shares many similarities with CAT, it is reasonable to believe that the item selection algorithm based on the BCD, which does not rely on an accurate trait estimate in every step of CAT administrations, provides an efficient alternative to, while being more flexible than, the conventional method. The development of this selection algorithm is essential as schools, professional organizations, and private companies seek to make CAT flexible enough to be implemented in wider testing applications.
The purpose of this study was to illustrate the use of the BCD in CAT and further evaluate its utility by comparing it with the conventional MI algorithm. For ease of comparisons, this study focused on the 1-parameter item response function. To investigate the utility of the BCD in CAT, two Monte Carlo simulation studies were conducted where either a fixed- or a random- stopping rule was employed. With fixed-stopping rule, the number of items administered was manipulated (k = 5, 10, 30, 100) and the item pool was fixed to have 100 different difficulty levels, whereas with random-stopping rule, the number of different difficulty levels in the item pool was manipulated (n = 10, 30, 50, 100). In either case, CAT responses were simulated for persons whose actual trait levels were 0 (average), -1 (1 standard deviation below the average), and -2 (2 standard deviations below the average), and the target difficulty level was at the 20th, 50th or 80th percentile. Each adaptive testing simulation began the trait estimation with an initial value of 0 and proceeded with the maximum likelihood method. The results suggested that item selection with the BCD is more flexible in targeting any arbitrary percentile of the difficulty levels. With respect to the accuracy of the trait estimation, MI performs slightly better with fixed-stopping rule, whereas the BCD is considerably better for tests with small number of different difficulty levels or persons whose trait levels are not at the extremes with random-stopping rule.
For further information: firstname.lastname@example.org
Sharon Klinkenberg, Marthe Straatemeier, and Han van der Maas,
We present a new model for computerized adaptive progress-monitoring. This model is used in the Math Garden, a web-based monitoring system, which includes a challenging web environment for children to practice arithmetic skills. The Math Garden is a CAT web application, which tracks both accuracy and response time. Using a new model (Maris, in preperation) based on the Elo (1978) rating system and an explicit scoring rule, estimates of ability level and item difficulty are updated every trial. Items are sampled with a mean success probability of .75, making the tasks challenging yet not too difficult. By integrating the response time in the scoring rule, we try to compensate for the loss of information associated with the high success rates (van der Maas and Wagenmakers, 2005). In a period of eight months, our sample of 1,053 children completed over 850,000 arithmetic problems. The children completed about 25% of these problems outside their school hours. Results show good validity and reliability, high pupil satisfaction measured in playing frequency, and good diagnostic properties. The ability scores correlatde highly with the Dutch norm-referenced general math ability scale of the pupil monitoring systems of CITO. Also, test retest reliability analysis showed high correlations. In view of the satisfactory validity and reliability of the person ability estimators, our method opens the door to on-the-fly item calibration in low-stakes testing.
For further information: S.Klinkenberg@uva.nl
Ya-Hui Su, University of California, Berkeley
In CAT, if a group of examinees purposefully memorize items and distribute them to other prospective examinees, it certainly ruins the equality and accuracy of CAT. Steffen and Mills (1999) investigated this effect and found that the more the compromised items and the more effective the cheating, the more severe the overestimation for the recipients, especially for those with low ability levels. Su, Chen, and Wang (2004), pointed out that the overestimation for the recipients was more severe when the sources had diverse ability levels, because more items were compromised. Su and Wang (2007) proposed an item exposure control procedure, called the conditional Sympson and Hetter (Sympson & Hetter, 1985) online procedure with freeze control (denoted as SHCOF) procedure. Results showed it superior to many other conventional procedures in terms of measurement and operational efficiency. To assess the cheating effect, Su and Wang (2008) used the SHCOF procedure in a CAT, and found it could obtain precise estimation for persons in real time without requiring simulations to generate item exposure under a unidimensional context. In the past, little research has been done to investigate cheating effects within a testlet context. Hence, it is of great value to ascertain whether the SHCOF is also less affected by the cheating between examinees under a testlet context, when compared to a popular procedure such as the conditional multinomial method (SLC; Stocking & Lewis, 1998). The goal of this study was to use simulations to investigate how these two item exposure control procedures would perform under various cheating conditions. It was hypothesized that SHCOF would be less affected by cheating than SLC.
Four independent variables were manipulated: (1) ability level of sources, (2) ability distribution of recipients, (3) cheating conditions (no cheating, inefficient cheating, efficient cheating, and perfect cheating), and (4) item exposure control procedure (SHCOF and SLC). The root mean squared error (RMSE) was computed to describe the cheating effects; the more serious the cheating effect, the larger the RMSE. Under the no-cheating condition, there is no significant difference in RMSE between SHCOF and SLC. It was also found that SLC had more serious inflation on RMSE than SHCOF under the perfect cheating condition. As the cheating condition got more severe, the overestimation for the recipients got more severe when the SLC was used. In addition, the more diverse the ability of the sources, the larger the RMSE and the mean positive bias would be. More importantly, SHCOF had smaller RMSE than SLC. This was because only SHCOF could simultaneously monitor item exposure and test overlap rates online. SHCOF could obtain precise estimation for persons without requiring simulations to generate item exposure before using in an operational CAT. If test items are memorized by sources and shared to recipients, CAT becomes unfair because the ability levels of the recipients will be overestimated. In this study, it was found that SHCOF was less affected by cheating than SLC. Hence, the SHCOF procedure can be safely implemented in operational CAT.
For further information: email@example.com
The CAT-DI Project: Development of a Comprehensive
CAT-Based Instrument for Measuring Depression
Robert D. Gibbons, University of Illinois at Chicago
The combination of IRT and CAT has proven invaluable in educational measurement. More recently, enormous reduction in patient and physician burden have been demonstrated using IRT based CAT in the area of mental health measurement problems (Gibbons et.al., 2008). CAT administration of a 626-item mood and anxiety spectrum disorder inventory revealed that an average of 24 items per examinee were required to provide impairment estimates with a correlation of 0.93 with the original complete scale. Furthermore, the CAT-based scores revealed twice the effect size than the total scale score in terms of differentiating patients with bipolar disorder based on the mood disorder subscale, despite an 83% reduction in the average number of items administered. These preliminary findings led to further interest and funding by the National Institute of Mental Health to develop a CAT-based instrument for the screening of major depressive disorder (CAT Depression Inventory—CAT-DI) that can be used for routine screening of depression in general medical practice settings as well as specialty mental health clinics. A recent supplement to the parent CAT-DI grant, extends our work on CAT for mental health measurement to CAT for diagnostic assessment of depression and other psychiatric disorders. The CAT Major Depressive Disorder (CAT-MDD) project will explore four different statistical/psychometric models for estimating the probability of an underlying discrete major depressive disorder based on self-administered symptom ratings that are adaptively administered. The ultimate objective of this program of research is to reduce patient and physician burden in terms of screening and diagnosing depression in general practice settings. Potential benefits include reduction in health care costs produced by high rates of service utilization among patients with an undiagnosed depressive illness, increased detection of depressive disorders, and increased access to quality mental health care for patients in need of such services.
For further information: firstname.lastname@example.org
Development of a CAT to Measure Dimensions
of Personality Disorder:
The CAT-PD Project
Leonard J. Simms,
In this presentation, describes the CAT-PD project, a funded, multi-year study designed to develop an integrative and comprehensive model and measure of personality disorder trait dimensions. Our general study aims are to (1) identify a comprehensive and integrative set of dimensions relevant to personality pathology, and (2) develop an efficient CAT method—the CAT-PD—to measure these dimensions. To accomplish our general goals, we plan a five-phase project to develop and validate the model and measure. The presentation describes the project generally, the results of Phase I (which is focused on content domains and initial item bank development), and our plans for IRT/CAT with these item banks. In particular, I will focus on how the item banks will be used, the possible IRT models we are considering for item bank calibration, the CAT algorithms we are planning to test, and our methods for deciding on a final set of procedures for the completed CAT-PD measure. Finally, I will discuss the CAT and IRT challenges that we anticipate facing in the future.
For further information: email@example.com
Alan D. Mead, Avi Fleischer, and Jessica D. Sergent, Illinois Institute of Technology
Although CAT was developed in the context of ability tests (Weiss, 1982), studies have since demonstrated the effectiveness of CAT for measuring attitudes and personality. For example, Koch, Dodd, and Fitzpatrick (1990) applied the rating scale model to a Likert-scale attitudinal questionnaire. The rating scale model (an extension of the one-parameter logistic model for polytomous data) was found to fit the data very well and, although they noted item pool issues, succeeded in measuring effectively. Other studies have found similar results for personality assessments, suggesting that perhaps half the items of an assessment are needed to achieve comparable reliabilities (Waller & Reise, 1989; Reise & Henson, 2000). However, one issue that has not been extensively treated in prior literature is the multidimensional nature of most personality assessments. Prior research has generally applied unidimensional CAT to individual scales. Segall (1996) presented a multidimensional CAT (MCAT) methodology where correlations between the factors could be leveraged to administer and score items even more efficiently. Mead, Segall, Williams and Levine (1997) described a Monte Carlo simulation of the adaptive administration of the 16PF Questionnaire (Cattell, Cattell, & Cattell, 1993; Conn & Rieke, 1994) using Segall’s MCAT method. As in Segall’s simulation, the MCAT method was effective in allowing additional reductions in assessment length, beyond those typically encountered with unidimensional CAT. For example, overall assessment length could easily be cut in half with small decrements in scale reliabilities.
The purpose of the current study was to extend the results of the Monte Carlo simulation (Mead, et al, 1997) to real data. This study is important for two reasons. First, it is always important to show that simulated results generalize to actual use. Even more importantly, recent research on personality (research that specifically included the 16PF; Chernyshenko, Stark, Chan, Drasgow, & Williams, 2001) has suggested that traditional IRT models do not fit personality data well and might not be the most appropriate models (Stark, Chernyshenko, Drasgow, & Williams, 2006). If the IRT model is a poor fit to 16PF data, the Monte Carlo results will not hold for real data. On the other hand, if the real-data results replicate the simulation results, then we might assume that traditional IRT models fit 16PF data sufficiently well. We obtained archival data from the administration of the 16PF Questionnaire to approximately 5,000 individuals and the two-parameter logistic model was fit to the items using BILOG-MG 3.0. Segall’s (1996) software was adapted to read the actual responses of the individuals for a real-data simulation. Results generally supported the use of MCAT with 16PF items. Correlations between actual 16PF scores and MCAT trait estimates were high (averaging .91 to .82) for MCAT tests shortened by up to 40–50% while shorter MCAT tests had moderate correlations (averaging .72 to .58). The presentation will also discuss results for the pool usage (about a third of the pool had exposures greater than 90%), efficiency for individuals with extreme scores, and practical considerations for adaptive personality assessment.
For further information: firstname.lastname@example.org