ANALYSIS OF A RESEARCH INSTRUMENT TO MAP ENGLISH TEACHERS ’ PROFICIENCY

Teachers‘ English proficiency can be measured by designing a research instrument in a form of test. The devised test must fulfill the requirement of a good test. This article is aimed at discussing item analysis centering on multiple choice questions used to measure the proficiency of Indonesian High School teachers involved in English instruction. The first set of syllabus oriented test is tried out to 20 subjects, and the second set – general English oriented – to 28 subjects. The test analysis indicates the item difficulty indices range from .20 to 1 for the first set and .07 to .89 for the second set. With regard to item discrimination analysis, the study finds the d values range from -0.33 to 1.0 for the first set, and -0.11 to .78 for the second set. It is found that the whole test has ̳average‘ level of difficulty and is ̳good‘ at discriminating between high and low achieving test takers; to be used for the actual research, a revision of the test is done to eliminate the ̳bad‘ items.


INTRODUCTION
Teachers' subject matter mastery and teaching competence will affect the attainment of instructional objectives.Their skills and knowledge have been highlighted as a key component associated with clear objectives for student learning and accomplished teaching (OECD, 2005cited in Caena, 2011).Teacher quality is in fact the key to enhance students' achievement (Barber & Mourshed, 2007;Chetty, 2011;Rasmussen & Holm, 2012;Harjanto et al., 2017).It is, therefore, crucial that research on teacher competence be conducted.
With the increasing importance of English as a language of global communication, the quality of English instruction in schools has drawn research interest particularly in countries where English is not the lingua franca.A number of studies on teachers' English proficiency have been conducted.Author (20xx) urged that to set advanced competencies in the English curriculum, Indonesian teachers' English proficiency first had to be improved.Tsang (2011) investigated to what extent 20 primary school English teachers in Hong Kong were aware of English metalanguage and found the need for regular or systematic use of metalanguage among school teachers.Sharif (2013) was concerned that limited English proficiency of teachers distorted students' understanding of the content taught.Othman and Nordin (2013) studied the correlation between the Malaysian University English Test (MUET) and academic performance of English teacher education students.Earlier, Lee (2004) criticized the use of the high-stake MUET as a driver to improve English proficiency and suspected that the very traditional approach to teaching reading with the focus on discreet skills may have been the result of teachers' preoccupation with getting their students to pass MUET.
More recently, Nair and Arshad (2018) examined the discursive construction of Malaysian English language teachers in relation to the Malaysian Education Blueprint action plan from 2013 to 2015 and argued for ways to help teachers achieve the desired proficiency and make changes to existing classroom practices that are aligned with the government agenda.
The competence of Indonesian teachers of English has also been the focus of a number of studies.A study (Lengkanawati, 2005) examining the English proficiency of teachers in West Java used a TOEFL-equivalent test and found that the majority of the teachers did not demonstrate a satisfactory proficiency level.-may seem to be taken for granted by many people other than the English teachers themselves.They tend to put a lot of pressure on themselves to excel in the subject matter.Actually this competence is already guaranteed by the requirement that a teacher has to have an S1 or D-IV degree qualification, and as such, it is understandable that other people view subject matter competence as something given by their formal education (p.55).‖ The guarantee of subject matter competence through the teachers' formal education is still very much debatable as graduate competence standards are still yet to be established and enforced in English teacher education.

Assessing
English teachers' competence remains a salient issue.Soepriyatna (2012) investigated and assessed competence of high school teachers of English in Indonesia and set three dimensions of English language competence domain (language skills, linguistic, and sociocultural), two dimensions of content knowledge domain (text types and grammar points), and seven dimensions of teaching skills domain (objectives, material development, learning management, teaching techniques, learning styles, learning strategies, and qualities of an engaging teacher).He developed performance tasks to assess the twelve competence dimensions.The language proficiency covered in the first two domains is addressed in performance indicators statements such as -uses vocabulary correctly and appropriately‖ and -maintains grammatical accuracy.‖Soepriyatna did not address how those indicators can be determined reliably.A test specifically constructed to assess the English proficiency of high school teachers is yet to be developed in Indonesia.The Ministry of Education has been administering annual Teacher Competency Test for all teachers as part of the certification process.The online test comprises of subject area and pedagogy items.Therefore, it does not specifically address language proficiency.Furthermore, there have been concerns that the test was not adequately constructed (Prasetyo, 2017;Putra, 2017).
In line with these concerns, it is reported that of the eight An essential requirement for a test to be employed especially for conveying teachers' proficiency is that the test should be a good one for a research instrument.The test devised ought to be valid and reliable.One extensively used way to perform as the step to fulfill this requirement is analyzing the test items- Gronlund (1982:101) simply puts it -studying the students' responses to each item‖.Plakans and Gebril (2015) assert that item analysis is a checking procedure to see that test questions are at the right level of difficulty.It is also a procedural entity to check that test questions distinguish test takers appropriately.
Test item analysis based on classical measurement theory functions as an analysis tool to measure item difficulty index, item discrimination index, and distractor effectiveness (Hughes, 1989).Classical test theory has less demand on the number of test takers whose answers will be the ones to analyze.This theory is consequently more practical since no formal training is needed prior to analysis undertaking.The item analysis is more easily performed manually-by taking, for instance, a calculator-assisted analysis or by using a simple program in a computer.The weakness of this theory is that there is an interdependency between test takers and item difficulty level.
Item response theory appears as a response to the weakness of classical measurement theory.Based on this item response theory -also called -Rasch analysis‖ (Hughes, 1989: 163), test item difficulty is ideally constant, taking no notice of whichever group is being tested.
This theory performs item analysis by calculating difficulty index only (commonly termed as a oneparameter logistic model), item difficulty index and item discriminating index (prevalently termed as a twoparameter logistic model), and difficulty index, discriminating power, and speculation element (labelled a three-parameter logistic model).The more elements to be analysed, the more test takers will be engaged for their answers to analyse.In conclusion, classical test theory is more practical than item response theory.Classical test theory is more easily conducted as it does not require lots of test takers.It This article presents the result of test item analysis.The analysis is delimited to item difficulty and item discrimination.The analysis is carried out to contribute to revealing the reliability of an instrument to measure high school teachers' English proficiency.
Difficulty level is most often paired with other terms having the same meaning like difficulty index, index of item difficulty, or facility value as used by Hughes (1989), Brown (2004), Brown and Abeywickrama (2010), or Item Facility as used by Brown (1996).They all refer to the same construct.
Difficulty index is a score indicating whether a test item is difficult or easy.The level of item difficulty can be explained by the percentage of the test takers who answer a test item correctly.Gronlund (1982) points out that it is the percentage of answering the items correctly.Brown (1996: 64-65) similarly asserts that it is -a statistical index used to examine the percentage of students who correctly answer a given item.‖Therefore, difficulty index which is symbolized as P value is one which is obtained after a measurement has been done on students who are able to answer the item correctly.The difficulty index functions as an indicator for test makers to know the quality of their test by determining whether the test is difficult or easy.Difficulty item analysis will reveal students' ability to the problem being analyzed.
With regard to good P value, the majority of test analysts would argue for the level of ‗sufficient' or ‗medium' (P value of 0.50) for a good test.Meanwhile, Hughes (1989: 162)  Some literature labels index of item discriminating power with the letter ‗D', while some others use two letters ‗DI'.This D value or DI value reveals the discrimination power of a test item.To be more specific, it indicates -the degree to which an item separates the students who performed well from those who performed poorly‖ (Brown, 1996: 68) therefore it allows test developer to contrast the performance of the high achievers and low achievers.An item discrimination index of 1.00 is considered -very good as it indicates the maximum contrast between the upper group and lower groups of students-that is, all the high-scoring students answered correctly and all the low-scoring students answered correctly.‖(Brown, 1996: 68).
In light of the need for better quality of English instruction in Indonesia, our research team identified the research gap of mapping the content knowledge competence of English language teachers in Indonesia high schools and assessing their English proficiency.This study is a part of a bigger research project funded in 2018 by the Indonesian Ministry of Research, Technology and Higher Education to conduct a mapping of high school teachers of English.
This article presents the construction of a test to assess their English proficiency as a preliminary step before assessing their English language teaching competences.

METHOD
As previously mentioned in the background, the test constructed by the research team will be used as a research instrument to map the English proficiency of high school teachers in Indonesia.

Design
This study which centers on item analysis is quantitative in nature.The statistical formula prevalently employed include the difficulty and discriminating power values.
In order for the test to be an accurate measure of what it is supposed to measure, and also more importantly in order that the test does not result in -a harmful backwash effect‖ (Hughes, 1989: 22-23), or in order for a test to be an effective strategy to determine the content of Multiple Choice questions (Plakans & Gebril, 2015), a test specification is prepared.A test specification is responsible for -the construct framework for operationalizing the test design through subsequent item development‖ (Kopriva, 2008: 65).Despite the counterargument stating that Multiple Choice questions do not adequately simulate how language is used in real life, Multiple Choice questions occasionally provide better coverage of content than the nowadays performance based assessment (Plakans & Gebril, 2015).Furthermore, in spite of its drawbacks, Multiple Choice format offers efficiency of administration, particularly when it involves a large number of test-takers.These particular reasons lead the research team to include Multiple Choice type.

Subjects
There were 20 and 28 subjects involved in the first and second tests respectively.Some subjects consisted of pre-service teachers/fresh graduates of English Department of Teacher Training Faculty; they were not involved in the teaching field yet.Some other subjects were completing their last semester at the English Department of Teacher Training Faculty; they were finishing their thesis writing.The tryout subjects excluded those teachers who would be engaged in the following research.

Instrument
The test was developed to cover three main categories: the syllabusoriented, the general English (grammar and reading comprehension), and essay.There were three test types utilized: Multiple Choice, Cloze test, and Writing.All together 65 items were developed.This paper presents only the analysis of 50 Multiple Choice items (the other test types -Cloze test amounting to 15 items and Writing test -are not analysed).Among the seven Multiple Choice formats (Haladyna, Downing, & Rodriguez, 2002), the one used in this study was Conventional MC.The first test set which consists of 30 items is presented in Table 1.
The test specification guiding the construction of the 30 items in the first test set is taken from the currently used 2013 English Curriculum for high school in Indonesia.
The second test set which is general English consists of 20 items covering 10 Grammar and 10 Reading Comprehension items as presented in Table 2 and Table 3 respectively    The analysis to the second set of test -as seen in Figure 3 -indicates that the item difficulty indices (P value) range from .79 to .89 for easy items which amount to 15%, and .68 to .32 for average items amounting to 75%.The item difficulty indices (P value) range from .07 to .29 for difficult items reaching 10%, the smallest percentage of the total.It is explicitly revealed that the average items occupy the highest percentage rank.Calculating the average percentages of difficulty level for the test with regard to the general English oriented test -the second test set, the writer finds it to be .55revealing average level of difficulty.When all 50 items are combined and analysed for their P value and D value, it is found -as seen in Figure 5 below -that 13 (26%) items belong to easy category (ranging from .75 to 1), 32 (64%) items belong to average category (ranging from .32 to .7), and 5 (10%) items belong to difficult category (ranging from .07 to .29).Having combined the detailed calculation of the two test setscovering syllabus oriented and general English test, the writer finds that the average P value equals to .60 and the D value equals to .41.This finding makes it evident that the devised test has reached the category of average level of item difficulty and the classification of good at discriminating between the high and low achieving test takers.This particular finding of the study is congruent with Sim and Rasiah's (2006) stating that MCQ items that demonstrate good discrimination index tend to be average items for their item difficulty.They further claim that items that are in the moderately difficult to very difficult range are more likely to show negative discrimination.
Nevertheless, as it found that nine and four bad items appear in the first and second test sets respectively, the test devised for inclusion in the actual research should be reassessed.The bad items can simply be eliminated or improved by developing some more items.The items kept for inclusion in the actual research instrument shouldfollowing Boopathiraj and Chellamani (2013)'s suggestion -be arranged in such a way that items of higher indices of difficulty, of moderate indices of difficulty, and of lower indices of difficulty are organized in a balanced composition.

CONCLUSION AND SUGGESTIONS
This article is a report on test item analysis centering on Multiple Choice questions used to measure the proficiency of Indonesian High School teachers involved in English instruction.Restricted to the analyses of item difficulty and item discrimination, the study has found that with regard to the whole test (covering syllabus oriented and general English oriented The result of item analysis to the devised test in this study can hopefully become a section in a good item bank for the decision makers dealing with teacher professional development.Another suggestion might be for test developers to consider the need of the test takers by developing a test which attempts to see further the possibility of co-certification as exemplified by Newbold (2011).

Figure 1 .
Figure 1.Item Difficulty of Syllabus-Oriented Items Meanwhile as displayed in Figure 2 below, the indices of discriminating power range from -0.33 to 1.0.Having D value of .83-1, seven (23.3%) items are ‗very good' at discriminating between the high achieving test takers and the low ones.Having D value of .5 to .67),nine (30%) items are ‗good' at discriminating between the high and low achieving test takers.Five (16.7%) items have the D value of .33 indicating they are ‗sufficient' in discriminating between the high and low achieving test takers.Nine (30%) items belong to ‗bad' ones They cannot distinguish between the two groups well.One of those nine items has negative value (-0.33).The average index of discriminating power for the test with regard to the syllabus oriented test -the first test set -is .43indicating ‗good' discriminating power).

Figure 3 .
Figure 3. Item Difficulty of General English-Oriented Items

Figure 4 .
Figure 4. Discriminating Power of General English Items

Figure 5 .
Figure 5. Item Difficulty of All Items

Figure 6 .
Figure 6.Discriminating Power of All Items IJEE (Indonesian Journal of English Education), 6 (1), 2019 62-64 http://journal.uinjkt.ac.id/index.php/ijee| DOI: http://doi.org/10.15408/ijee.v6i1.11888P-ISSN: 2356-1777, E-ISSN: 2443-0390 | This is an open access article under CC-BY-SA license items) the average P value equals to .60 and the D value equals to .41.It is evident that the devised test has reached the category of average level of item difficulty and the classification of good at discriminating between the high and low achieving test takers.The complete test should, however, be improved for the actual research since some items-slightly above three quarters-are indicated as ‗bad' at discriminating test takers.
claims, -There can be no strict rule about what range of facility values are to be regarded as satisfactory.It depends on what the purpose of the test is … The best advice … is to consider the level of difficulty of the complete test.‖

Implement social function, text structure, and language feature … involving giving and asking information related to future intention based on the appropriate context (Focus on be going to, would like to).
http://journal.uinjkt.ac.id/index.php/ijee| DOI: http://doi.org/10.15408/ijee.v6i1.11888P-ISSN: 2356-1777, E-ISSN: 2443-0390 | This is an open access article under CC-BY-SA license

Distinguish social function, text structure, and language feature … involving recount texts based on the appropriate context (Focus on e.g. transitional words like first, then, after that, before, when, at last.
8. … the movie ends, we head out for a late night snack.Before / Then / After that / When 6. Distinguish social function, text structure, and language feature … involving narrative texts based on the appropriate context (Focus on e.g.simple past tense, past continuous).9. Once upon a time, there was a little boy, who was poor, dirty, and smelly, … into a little village.comes / is coming / coming / was coming 10.Kancil … quick-witted, so that every time his life was threatened, he managed to escape.was / were / is / be 7.

Implement social function, text structure, and language feature … involving giving and asking information related to giving opinion based on appropriate context (Focus on e.g. I think, I suppose). 13
. Giving opinion: In my opinion, she's pretty./ Can you give me your opinion?/ He is thinking about her everyday./ He should go.

Implement social function, text structure, and language feature … involving giving and asking information related to cause-effect based on appropriate context (Focus on e.g. because of, due to). 21
. His defeat was … the lottery issue.due to / because / since / thanked to 22.The crash occurred … the erratic nature of the other driver.due / because / because of / thanked to