Arifin Ahkam Muhammad(1*)
(1) Institut Parahikma Indonesia
(*) Corresponding Author
DOI : 10.24256/ideas.v6i2.520


This paper begins with the test specifications of the two tests – the First Certification in English (FCE) and the Business Language Testing Service (BULATS). It will then go on to the evaluation of the test usefulness: reliability, (construct) validity, backwash, and practicality (Bachman & Palmer, 1996; see Kunnan, 2004 for a slightly different perspective). This paper explores the test specifications at the outset in that a test would be evaluated (as estimated) based on the test purpose and construct that it is trying to measure (Luoma, 2004). To begin the evaluation, the test (score) reliability would be evaluated first, for a test would not be considered valid if it is not reliable (Brown, 1996; but see Moss, 1994 when a test could be valid without reliability). Throughout this paper, the term “test(ing)” will be used more or less synonymously with “assess(ment)” and “measure(ment)”, in that Bachman and Palmer point out that in the field of language testing these terms have been very broadly defined “as the process of collecting information” to make decisions (2010, p. 20). (See Bachman, 1990; Cohen & Swedlik, 2010; Douglas, 2010 for the distinctions, e.g., a test is a tool for assessment.)


evaluation, validity, reliability, practicality, first certification in English, business language testing service


Alderson, J. C., & Wall, D. (1993). Does washback exist? Applied Linguistics, 14(1), 115–129.

Alderson, J.C., Clapham, C & Wall, D. (1995). Language test construction and evaluation. Cambridge: Cambridge University Press.

American Educational Research Association, American Psychological Association and National Council on Measurement in Education. (1999). Standards for Educational and psychological testing. Washington, DC: American Educational Research Association

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.

Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests. Oxford: Oxford University Press.

Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford: Oxford. University Press.

Bonk, W. J. & G. J. Ockey. (2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing 20(1), 89–110.

Brown, J. D. (1990). Short-cut estimators of criterion-referenced test consistency. Language Testing, 7(1), 77-97.

Brown, J. D. (1995). The elements of language curriculum: A systematic approach to program design. Boston: Heinle & Heinle

Brown, J. D. (1996). Testing in language programs. Upper Saddle River, NJ: Prentice Hall Regents

Brown, J. D. (2005). Testing in language programs: A comprehensive guide to English language assessment. New York, NY: McGraw Hill College.

Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing. Cambridge: Cambridge University Press.

Carr, N. T. (2011). Designing and analyzing language tests. Oxford: Oxford University Press.

Chambers, L and Ingham, K. (2011). The BULATS online speaking test. Research Notes 43(1), 21–25.

Chapelle, C. (1998). Construct definition and validity inquiry in SLA research. In Bachman, L. and Cohen, A., (eds), Interfaces between second language acquisition and language testing research. Cambridge: Cambridge University Press, 32–70.

Cohen, R. J., & Swerdlik, M. E. (2010). Psychological testing and assessment: An introduction to tests and measurement (7th ed.). New York: McGraw-Hill.

Cronbach, L. J., & Meehl, P. E. (2010). Construct validity in psychological tests. Psychological Bulletin, 52(1), 281-302

Davies, A. (1990). Principles of language testing. Oxford: Blackwell.

Davies, A. (2003). Three heresies of language testing research. Language Testing 20(4), 355-368.

Davies, A., Brown, A., Elder, C. and Hill, K. (1999). Dictionary of Language Testing. Cambridge: Cambridge University Press.

Douglas, D. (2010). Understanding language testing. New York: Routledge

Ennis, R. H. (1999). Test reliability: a practical exemplification of ordinary language philosophy. In R. Curren (ed.), Philosophy of education. Urbana, IL: The Philosophy of Education Society, 242–48.

Fulcher, G. (2003). Testing second language speaking. London, UK: Pearson-Longman

Galaczi, E. D. (2008). Peer–peer interaction in a speaking test: The case of the First Certificate in English examination. Language Assessment Quarterly, 5(2), 89-119.

Green, A. (2014). Exploring language assessment and testing: language in action. New York: Routledge.

Hackett, E. (2002). Revising the BULATS standard test. Research Notes 8(1), 7-10

Hughes, A. (2003). Testing for language teachers. Cambridge: Cambridge University Press.

Jones, N (2000). BULATS: a case study comparing computer-based and paper-and-pencil tests. Research Notes 3, 10–13.

Kunnan, A. J. (2004). Test fairness. In M. Milanovic and C. Weir (Eds.), European language testing in a global context. Cambridge: Cambridge University Press.

Lado, R. (1961). Language testing. London: Longman.

Luoma, S. (2004). Assessing speaking. Cambridge: Cambridge University Press.

McNamara, T. (1996). Measuring Second Language Performance. London: Longman.

McNamara, T. F. & Roever, C. (2006). Language testing: The social dimension. Oxford: Blackwell.

Messick, S. (1989). Validity. In R. Linn (ed.), Educational Measurement. New York: Macmillan, pp. 13–103

Moss, P. A. (1994). Can there be validity without reliability?. Educational researcher, 23(2), 5-12.

Newman, I., Newman, C., Brown, R., & McNeely, S. (2006). Conceptual statistics for beginners (3rd ed.). Lanham, MD: University Press of America.

Orr, M. (2002). The FCE speaking test: Using rater reports to help interpret test scores. System, 30(2), 143-154.

O'Sullivan, B., Weir, C. J. and Saville, N. (2002). Using observation checklists to validate speaking-test tasks. Language Testing 19(1): 33–56.

Shepard, L. A. (1993). Evaluating test validity. Review of Research in Education, 19(1), 405–. 450.

Thorndike, R. L., & Hagen, E. P. (1977). Measurement and evaluation in psychology and education. New York: Wiley.

UCLES (University of Cambridge Local Examinations Syndicate). (2015). Cambridge English First: Handbook for Teachers. Cambridge: University of Cambridge Local Examinations Syndicate.

UCLES (University of Cambridge Local Examinations Syndicate). (2011). BULATS Business Language Testing Service: Information for Candidates. Cambridge: University of Cambridge Local Examinations Syndicate.

UCLES (University of Cambridge Local Examinations Syndicate). 2013. Principles of good practice quality management and validation in language assessment: Validity, reliability, impact, practicality. Cambridge: University of Cambridge Local Examinations Syndicate.

Wall, D. (1997). Impact and washback in language testing, in C. Clapham and D. Corson (eds.), Encyclopedia of Language and Education, 291–302

Weigle, S. C. (2002). Assessing writing. Cambridge: Cambridge University Press.

Weir, C. J. (2005). Language testing and validation: An evidence based approach. Houndgrave, Hampshire: Palgrave MacMillan.

Wiliam, D. (1993). Validity, dependability and reliability in national curriculum assessment. The Curriculum Journal, 4(3), 335-350.

Article Statistic

Abstract view : 1435 times
PDF views : 381 times

How To Cite This :


  • There are currently no refbacks.

Copyright (c) 2018 Arifin Ahkam Muhammad

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.