DOI: 10.3724/SP.J.1041.2017.01234

Acta Psychologica Sinica (心理学报) 2017/49:9 PP.1234-1247

Reporting overall scores and domain scores of bi-factor models

In large-scale assessments, most of the tests have a multidimensional structure. There is an increasing interest in reporting overall scores and domain scores simultaneously. The domain scores complement the overall scores by providing finer grained diagnosis of examinees' strengths and weaknesses. However, due to the small number of items within each dimension, the lack of sufficiently high reliability is the primary impediment for generating and reporting domain scores. A number of methods have been developed recently to improve the reliability and optimality of the overall scores and domain scores. For overall scores, simply averaging or weighted averaging the scores from different content areas, using maximum information method to compute the weights of composite scores under the MIRT framework were some commonly-used procedures. There were also some subscoring methods in the CTT and IRT framework, such as Kelly's (1927) regressed score method, the MIRT method, and the higher order IRT method. Nowadays, the bi-factor model became more and more popular in education measurement. Reporting overall scores and domain scores based on it became an important topic. The purpose of this study was to investigate several methods to generate overall scores and domain scores based on the bi-factor model, and to compare them with the MIRT method under different condition.
Study 1 was a mixed measure design of simulation conditions (between-factors) and methods (within-factor). There were three between-factors:(1) 3 sample sizes (500,1000,2000); (2) 3 test length (18 items, 36 items, 60 items); and (3) 5 correlations between dimensions (0.0, 0.3, 0.5, 0.7, 0.9). The methods for generating overall scores and domain scores were:(1) original scores from bi-factor model (Bifactor-M1); (2) summed original scores from the bi-factor model (Bifactor-M2); (3) weighted sum original scores from the bi-factor model based on all the items (Bifactor-M3); (4) weighted sum original scores from the bi-factor model based on items of each dimension (Bifactor-M4). The overall scores from Bifactor-M3 and Bifactor-M4 were the same. As many studies found that the MIRT-based methods provided the best estimates of overall and subscores, this method was also conducted and compared with the other methods based on the bi-factor model. Under each condition, 30 replications were generated using SimuMIRT (Yao, 2015). BMIRT (Yao, 2015) was applied to estimate domain ability parameters using an MCMC method, then the overall ability was generated by the maximum information method. Finally, the results were evaluated by four criteria:root mean square error (RMSE), reliability, correlation between the estimated scores and true values, and correlation between the estimated domain scores. Study 2 was a real data example. 4815 responses for science test of National College Entrance Examination were collected. The test contained 66 items covering three subjects:Physics (17 items), Chemistry (30 items), Biology (19 items). Four proposed methods and the MIRT method were applied to estimate overall scores and domain scores. For the real data, the overall ability and domain ability estimates from the MIRT model were used as "true" values to compare the relative performances between different methods. The evaluation criteria were similar to the simulation study.
The results of the simulation showed that, for overall scores:(1) the Bifactor-M1 and the Bifactor-M2 had larger RMSE than other methods; when the correlation between dimensions was low, the RMSE of Bifactor-M1 was the largest; as the correlation became larger, the RMSE of Bifactor-M2 became the largest. (2) The Bifactor-M3 and the MIRT method had the smallest RMSE. (3) As the correlation between dimensions increased, the RMSE of the Bifactor-M3 and the MIRT method decreased. (4) When the test length and the correlation between dimensions increased, Bifactor-M3 tended to report more reliable overall scores (reliability higher than 0.8). For domain scores:(1) Bifactor-M1 had the largest RMSE. (2) When test length was short, the RMSE of Bifactor-M2 was smaller than that of the MIRT method; when test length was long, the RMSE of Bifactor-M2 increased as the correlation between dimensions increased, and larger than that of MIRT method when the correlation was 0.9. (3) The RMSE of Bifactor-M3 and Bifactor-M4 decreased as the correlation between dimensions increased. (4) The RMSE of Bifactor-M4 was equal to or smaller than that of MIRT method. (5) When the test length and the correlation between dimensions increased, the Bifactor-M3 and the Bifactor-M4 tended to report more reliable overall scores. Finally, domain scores from the Bifactor-M4 could recover the correlations of true value better than other methods. For the real data example, the results showed that:(1) the bi-factor model fitted the data best as compared to the UIRT and MIRT models; (2) overall scores from the Bifactor-M3 and the domain score from the Bifactor-M4 were similar to those from the MIRT method.
In conclusion, overall scores and domain score from the Bifactor-M4 generally performed better than the other proposed methods. First, the scores from Bifactor-M4 had smaller RMSE and higher reliability. Second, the correlation between domain scores form the Bifactor-M4 was similar to the true value. Therefore, it was highly recommended to use this method in practical, especially in the following situations:(1) the test designers have specific definition of the core competencies, then bi-factor model can provide the estimations of core competencies, overall scores, and domain scores simultaneously. (2) When tests have a multidimensional structure and the correlations between dimensions are high, it is suggested to use bi-factor model to calibrate the data. (3) Other than reporting overall scores and domain scores, if the study focuses on the relationship between general construct, domain specific construct, and criterion as well, it is recommended to use the bi-factor model.

Key words:bi-factor model,multidimensional item response theory,overall scores,domain scores

ReleaseDate:2017-10-20 02:10:46

Ackerman, R. A., Donnellan, M. B., & Robins, R. W. (2012). An item response theory analysis of the narcissistic personality inventory. Journal of Personality Assessment, 94(2), 141-155.

Cai, L., Thissen, D., & du Toit, S. H. C. (2011). IRTPRO:Flexible, multidimensional, multiple categorical IRT modeling[Computer software]. Chicago, IL:Scientific Software.

Cai, L., Yang, J. S., & Hansen, M. (2011). Generalized full-information item bifactor analysis. Psychological Methods, 16(3), 221-248.

Chen, F. F., Hayes, A., Carver, C. S., Laurenceau, J. P., & Zhang, Z. G. (2012). Modeling general and specific variance in multifaceted constructs:A comparison of the bifactor model to other approaches. Journal of Personality, 80(1), 219-251.

Chen, F. P. (2015). The estimation of subscores with the use higher-order item response models (Unpublished master's thesis). Zhejiang Normal University.[陈飞鹏. (2015). 高阶项目反应模型估计子分数(硕士学位论文). 浙江师范大学.]

Cheng, Y. Y., Wang, W. C., & Ho, Y. H. (2009). Multidimensional rasch analysis of a psychological test with multiple subtests:A statistical solution for the bandwidth-fidelity dilemma. Educational and Psychological Measurement, 69(3), 369-388.

de la Torre, J., & Song, H. (2009). Simultaneous estimation of overall and domain abilities:A higher-order IRT model approach. Applied Psychological Measurement, 33(8), 620-639.

de la Torre, J., Song, H., & Hong, Y. (2011). A comparison of four methods of IRT subscoring. Applied Psychological Measurement, 35(4), 296-316.

DeMars, C. E. (2013). A tutorial on interpreting bifactor model scores. International Journal of Testing, 13(4), 354-378.

Fukuhara, H., & Kamata, A. (2011). A bifactor multidimensional item response theory model for differential item functioning analysis on testlet-based items. Applied Psychological Measurement, 35(8), 604-622.

Gavett, B. E., Crane, P. K., & Dams-O'Connor, K. (2013). Bi-factor analyses of the brief test of adult cognition by telephone. Neurorehabilitation, 32(2), 253-265.

Gibbons, R. D., Weiss, D. J., Kupfer, D. J., Frank, E., Fagiolini, A., Grochocinski, V. J., … Immekus, J. C. (2008). Using computerized adaptive testing to reduce the burden of mental health assessment. Psychiatric Services, 59(4), 361-368.

Gu, H. L., Wen, Z. L., & Fang, J. (2014). Bi-factor models:A new measurement perspective of multidimensional constructs. Journal of Psychological Science, 37(4), 973-979.[顾红磊, 温忠麟, 方杰. (2014). 双因子模型:多维构念测量的新视角. 心理科学, 37(4), 973-979.]

Haberman, S. J. (2008). When can subscores have value? Journal of Educational and Behavioral Statistics, 33(2), 204-229.

Holzinger, K. J., & Swineford, F. (1937). The bi-factor method. Psychometrika, 2(1), 41-54.

Huang, H. Y. (2015). A multilevel higher order item response theory model for measuring latent growth in longitudinal data. Applied Psychological Measurement, 39(5), 362-372.

Li, Y., & Lissitz, R. W. (2012). Exploring the full-information bifactor model in vertical scaling with construct shift. Applied Psychological Measurement, 36(1), 3-20.

Liu, Y., & Liu, H. Y. (2012). When should we use testlet model? A comparison study of Bayesian testlet random-effects model and standard 2-PL Bayesian model. Acta Psychologica Sinica, 44(2), 263-275.[刘玥, 刘红云. (2012). 贝叶斯题组随机效应模型的必要性及影响因素. 心理学报, 44(2), 263-275.]

Liu, Y., & Liu, H. Y. (2013). Comparison of MIRT linking methods for different common item designs. Acta Psychologica Sinica, 45(4), 466-480.[刘玥, 刘红云. (2013). 不同铆测验设计下多维IRT等值方法的比较. 心理学报, 45(4), 466-480.]

Reckase, M. D. (2009). Multidimensional item response theory models. New York:Springer.

Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47(5), 667-696.

Reise, S. P., Bonifay, W. E., & Haviland, M. G. (2013). Scoring and modeling psychological measures in the presence of multidimensionality. Journal of Personality Assessment, 95(2), 129-140.

Reise, S. P., Moore, T. M., & Haviland, M. G. (2010). Bifactor models and rotations:Exploring the extent to which multidimensional data yield univocal scale scores. Journal of Personality Assessment, 92(6), 544-559.

Rodriguez, A., Reise, S. P., & Haviland, M. G. (2016). Evaluating bifactor models:Calculating and interpreting statistical indices. Psychological Methods, 21(2), 137-150.

Wang, W. C., Chen, P. H., & Cheng, Y. Y. (2004). Improving measurement precision of test batteries using multidimensional item response models. Psychological Methods, 9(1), 116-136.

Willoughby, M. T., Blanton, Z. E., & Investigators, F. L. P. (2015). Replication and external validation of a Bi-factor parameterization of attention deficit/hyperactivity symptomatology. Journal of Clinical Child & Adolescent Psychology, 44(1), 68-79.

Wu, E. J. C., & Bentler, P. M. (2011). EQSIRT:A user-friendly IRT program. Encino, CA:Multivariate Software, Inc.

Yao, L. (2013). The BMIRT toolkit. Monterey.

Yao, L. H. (2010). Reporting valid and reliable overall scores and domain scores. Journal of Educational Measurement, 47(3), 339-360.

Yao, L. H. (2011). Multidimensional linking for domain scores and overall scores for nonequivalent groups. Applied Psychological Measurement, 35(1), 48-66.

Yao, L. H., & Boughton, K. A. (2007). A multidimensional item response modeling approach for improving subscale proficiency estimation and classification. Applied Psychological Measurement, 31(2), 83-105.

Yao, L. H., & Boughton, K. (2009). Multidimensional linking for tests with mixed item types. Journal of Educational Measurement, 46(2), 177-197.

Zhan, P. D., Chen, P., & Bian, Y. F. (2016). Using confirmatory compensatory multidimensional IRT models to do cognitive diagnosis. Acta Psychologica Sinica, 48(10), 1347-1356.[詹沛达, 陈平, 边玉芳. (2016). 使用验证性补偿多维IRT模型进行认知诊断评估. 心理学报, 48(10), 1347-1356.]