A MODEL OF AN ONLINE READING COMPREHENSION SUMMATIVE TEST FOR COLLEGE STUDENTS

There is an emerging phenomenon in some universities including STKIP PGRI Jombang regarding a compelling need of a test that can replace the existing paper-and-pencil based reading comprehension test, which is conventional, impractical, and time consuming. To fulfill the need, a model of an online reading comprehension summative test was developed, involving a number of essential micro skills of reading. The design of the study was Educational Research and Development (R&D), involving 100 subjects in the try-out stage. The instruments used were interview guides and questionnaire. Based on the tryout analysis, the reliability was .779, in which thirty one items were categorized as valid items. For the ease of scoring and the balanced number of the indicators under interest, only 25 items were included in the model test. Based on the students’ questionnaire, more than 80% subjects responded positively. The final product of this research was a set of an online reading comprehension test kit that includes the blueprint, the test (in form of paper and screenshot of the online version), the answer key, and the instruction to access the online test.


INTRODUCTION
Reading is one's inevitable daily needs.Sulistyo (2011, p.20) states that on one occasion, we read for information; on the other for enjoyment.This implies that reading comprehension plays a critical role in our daily lives.To reading teachers who are concerned with students' competence to read for information or knowledge through reading activities, there is a compelling need for them to always find an appropriate way to teach their students and to assess their reading comprehension with a greater attention as the ability to read is an important asset one must have on any occasion, let alone, in the digital era.Reading (critically) is believing; it is the window through which abundance of information is accessed.
A test is a subset of assessment (Brown, 2004, p.4).Further Brown (2004, p.4) states that a test is prepared administrative procedures that occur at identifiable times in a curriculum when learners muster all their faculties to offer peak performance, knowing that their responses are being measured and evaluated.In this way, learners are required to demonstrate their optimum competences elicited through tests in the form of manifest language behaviors.
To develop a good test, there are several criteria that need to be not only known but also fulfilled satisfactorily as a test is a set of data collection instruments that should function properly if accurate information about the learners is to be observed optimally to avoid the so-called gi-go effectsgarbage in garbage out impacts.The first is validity.Gronlund and Linn (1990, p.47) state that validity refers to the appropriateness of the interpretations made from test courses and other evaluation results, with record to a particular use.It means that the result of the test should be meaningful, appropriate, informative, and useful.The second is reliability.Brown (2004, p.20) states that a reliable test is consistent and dependable in terms of the scores yielded by the testing procedures.If we give the same test to the same students on two different occasions, the test should yield about similar results.The third is practicality.Djiwandono (1996) states that practicality means something to do with the test administration, scoring, interpreting of the test results, even with the financial factors of the test administrations.Practicality may be concerned with economy in terms of resources, time, and energy.In line with the idea of Djiwandono (1996), Gronlund and Linn (1990) emphasize that there are some considerations that IJEE (Indonesian Journal of English Education), 4 (2), 2017 170-187 http://journal.uinjkt.ac.id/index.php/ijee| DOI: http://dx.doi.org/10.15408/ijee.v4i2.8344P-ISSN: 2356-1777, E-ISSN: 2443-0390 | This is an open access article under CC-BY-SA license can be used to see the practicality of the test.The first is the use of test administration.For this purpose, the direction should be simple and clear, the subtest should be relatively few, and the timing of the test should not be too long.The second consideration is timing required for administration; it deals with allocated time to do the test.The other consideration is the ease of scoring which includes the clarity in the directions for scoring and simplicity in the scoring key.The following consideration is cost of testing which is important in selecting a test.The last is economy.Gronlund and Linn (1990, p.103) explain that testing should be relatively inexpensive and cost should not be a major consideration.
One of the types of tests that a teacher almost certainly needs to make is an achievement test.There are two types of achievement test: they are formative and summative tests (Brown, 2004, p. 48).A formative test aims at measuring the extent to which students have mastered the learning outcomes of a rather limited segment or instruction, such as a unit or a textbook chapter (Gronlund & Waugh, 2009, p.7).A summative test or it is also known as summative assessment aims to measure, or to summarize what students have grasped, and typically occurs at the end of a course or unit of instruction (Brown, 2004, p. 6).
Popularly, the test that is mostly and continually carried out by classroom teacher is a summative test to know the students' mastery of the course.So, as it is crucial to know what the students have grasped, the concern about the summative test in reading needs to get greater attention.
Nowadays, considerable attention is paid to the nature a test as a part of three partite functions of assessment: assessment of learning, for learning, and that as learning.Earl, Katz, and WNCP team (2006, p. 55) state that assessment of learning refers to strategies designed to confirm what students know, demonstrate whether or not they have met curriculum outcomes or the goals of their individualized programs, or to certify proficiency and make decisions about students' future programs or placements.It is designed to provide evidence of achievement to parents, other educators, the students themselves, and sometimes to outside groups (e.g., employers, other educational institutions).It means that assessment is a crucial tool to show the students' learning mastery of the lesson based on the curriculum applied and further to decide what fits them in the future.Assessment of learning is in other words on the students' side.On the other hand, Earl, Katz, and WNCP team (2006, p. 29) Katz, and WNCP team (2006, p. 41) have stated that assessment as learning focusses on students and emphasizes assessment as a process of metacognition (knowledge of one's own thought processes) for students.It means that in the process of learning with their own understanding, students can do self-assessment to make sense of the information and use it for new learning under the guidance and the direction of the teacher.Assessment as learning in other words involves both the teachers' and students' side as well.Supporting the ideas above, further Sulistyo (2015, p.5) states that assessment then implies an ongoing monitoring process on students' learning applied as soon as the teaching learning process begins, continuing up to the end of each class session.It informs teachers about their teaching effectiveness, students' learning progress, and even feedback on the level of implementation of a curriculum.As such, assessment is inseparably aligned to instruction.Further he also states that in a way, if carefully planned and implemented accurately, assessment can provide teachers with a source of useful information to reflect their teaching practices.It means that teaching cannot be separated from testing; they are linked to each other.Test results provide an important basis for the teacher to better design their teaching so that the teaching delivery can boost the students' performance in learning.
In recent days, reading from computer screens is becoming more and more common in human daily life as the amount of reading material available from online is rapidly increasing.This phenomenon has been seen in the field of language assessment such as computer-based tests (CBTs), computer-adaptive tests (CATs) and also TOEFL.As stated by Sulistyo (2009), for instance the advances in computing technology also boosts the presence of the new version of TOEFL, the iBT in 2005 which has been a significant shift from older TOEFL versions of computer based TOEFL (CBT for short) as well as paper-andpencil based TOEFL (PBT, henceforth).This iBT version, as its name indicates, makes the functional use of information and communication technology (ICT) Hricko & Howell, 2006, p. 4) said, "The availability of assessment software to address these tasks is leading to assessment services becoming one of the fastest growing software niches, both in the corporate and in the educational markets.".Regardless the rapid growth of the demand in this area, development and implementation of this new mode of testing is currently in its initial stages.Therefore, sufficient empirical data, which would allow researchers to look into the soundness of computerized language tests with regard to construct validity and fairness, are yet to be available.
STKIP PGRI Jombang is one private university in operation in Jombang, East Java.In this university, the rapid use of the Internet network is also increasing but not yet functioned in the best way.Online assessment is in fact very helpful to not only students but also the lecturers to be the media in assessing processes.As Pallof and Pratt (2009, p. 3) put it to say, "The convenience of working online has proven to be very attractive to students and instructors alike."Further, Lynch (1997) (as cited in Millsap, 2000, p. 4) found that subjects responded more honestly on computer-administered tests than on paper and that the testretest reliability was comparable for both groups.This means that online assessment offers convenience more than the traditional one in the now era.
In this university, in the Reading Comprehension 2 class, a substantial problem emerges.The test of the course is held by using a face-to-face interview to make the students explore more, to minimize the cheating, and to simplify the test.This face-to-face test is time consuming since with total students of forty has spent six hundreds minutes (10 hours) to assess student reading comprehension.A more efficient yet accurate and reliable test is then needed.The choice is an ICT-based test.By using an online test, the teacher can manage the time in the computer and score student reading performance in the test more quickly.In addition, online assessment is cost effective as lecturers do not need to copy the paper test to the whole students.As it has been said by Dowsing, Long, & Craven, (2000), Weisburgh, (2003) (as cited by Hricko & Howell, 2006, p.11) that "it has been proposed that one of the main advantages of using assessment software over manually assessing performance is primarily the savings in cost and time".In addition, computeradministered testing benefits include rapid up-dates, random item selection, test item banks, and automatic data collection and scoring (Millsap, 2000, p. 6).Practicality will also improve since the manual scoring will not be carried out by the lecturer like paper and pencil tests.As Weisburgh (2003) (cited in Hricko & Howell, 2006, p.11) said "Scoring and evaluating tests used to take a lot of manual effort, whereas software can dramatically reduce, or even eliminate, the manual effort, and results can be instantaneous".By all the facts elaborated above, this online test has huge possibility to be lower in cost.Another weakness point to be discussed is about the existing reading comprehension test is that the questions are in the form of oral questions, which implies impracticality of administration.Furthermore, these questions do not completely represent the indicators in the syllabus as the questions are only about the content, the generic structure and feature of the test and text building.The test only covers one type of text while the students must know all genres.This fact may lead to invalidity i.e. inaccuracy and error test results because of the teacher's subjectivity or tiredness.By having an online test, the problems will be solved as Krug (1989) reported that in an estimated ten percent of hand-scored objective tests, errors of one point or more in the final score were made.Computerized test administration ensures accurate test scores (as cited in Millsap, 2000, p.16).
Studies on the use of technology in testing have been conducted.A study by Sawaki (2001) aimed to examine the comparability of conventional and computerized tests of reading in a second language.The study used a survey design by a large sample as the subjects of the research.The general trends found in this study indicated that comprehension of computerpresented texts is, at best, as good as that of printed texts (Sawaki, 2001, p. 49).The second study was conducted by Noyes and Garland (2008) that investigated whether computer and paper-based tasks are equivalent.A survey design was conducted by reviewing literature and research.In the study, it is indicated that in some cases, paper and computerized tests were equivalent, but in some cases they were not for example in the form of the test.In addition to this finding, achievement of equivalence in computer-based and paper-based tasks poses a difficult problem.It is probably influenced by the test takers' confidence in using the computer, and other psychological factors.users adding to input the user of the test in the database, publishing to bring the test online so it can be accessed by the students enrolled the course, the last is result exporting to take the data easily for later use.Data in this case refers to the students' names, scores, duration, timing and others in the excel format for later use in the item analysis stage.The name of the computer program utilized was Chamilo version 1.9.10.2.

IJEE (Indonesian
The design of the needs assessment was qualitative.The instrument was interview to one Reading Comprehension 2 lecturer.It was about how the lecturer previously conducted the test, the form of the test, the reason why choosing certain form of test, the material included in the test and the availability of later online reading test for the students.After the information was gathered, the activity of collecting and preparing appropriate passages in various genres for the material in the body of the test started. Three test and three ICT experts were invited to review and conceptually validate the products.The instrument used was in the form of questionnaire.In the test expert review, it was focusing on the items, the instruction (wording), and the construction of language test.The analysis was qualitatively carried out since the date got was in form of description.In the ICT expert review, it was focusing on the easiness of the instruction, the loading of the questions, the ease of the navigation menu, the readiness of the font and the User Interface generally.
The subjects of the tryout involved were 100 students of STKIP PGRI Jombang who had finished their Reading Comprehension 2 course.The decision of choosing the subjects employed simple random sampling.Latief (2012, p. 183) states that simple random sampling technique is the best technique in assuring the representativeness of the sample from the accessible population.It fits the needs of the samples since all students have an equal chance to be the representativeness of the sample.The try-out was carried out within two sessions to minimize the subjects to get tired.
A set of questionnaires is also addressed to the subjects.It is about the ease of the instruction, the ease of the questions, the time allotment, the suitability of the test and the material given in the class, the easiness of the texts, the length of the texts, the number of items and the level of difficulty of the items.
After conducting the informal tryout, the process of analyzing the test's result by using software called ITEMAN 3.00 was carried out.The reliability is shown by the alpha score, which ranges from 1.00 for perfect reliability to 0.00 for completely unreliable (Ary et al., 2002, p. 261).The item validity can be known by the point-biserial correlation coefficient or symbolized by r-pbis coefficient.It is a statistic used to estimate the degree of relationship between naturally occurring dichotomous nominal scale and an interval or ratio scale (Brown, 2001, p.13), if the coefficient is > .2 it is categorized that the item is good.
Item difficulty is shown by the proper correct score (category easy range >.7, moderate range between .3-.7, and difficult is < .3)(Brown, 2001), item discrimination is presented in p-bis coefficients.The categorization of the item discrimination is shown below.The effectiveness of distractor is important to be known as Brown (2004, p. 60) notes that the efficiency of distractor is the extent to which (a) the distracters "lure" a sufficient number of test takers, especially lower ability ones and (b) those responses are somewhat evenly distributed across all distractors.The efficiency of distractor can be known by the positive of negative value in p-bis key in each item.If there is a positive score of the efficiency distracter it means the distracter should be reviewed or changed.

FINDINGS AND DISCUSSIONS Findings
The results of the development have been known after the research was carried out in STKIP PGRI Jombang.

The Result of Needs Assessment
It was found that the previous test was not practical, time consuming, and the material was only few than what it should be tested.The other fact from the interview was the availability of an online test in recent days has become a trend so that the availability of a model of a Reading Comprehension 2 summative test is needed to be carried out.

The Test Characteristics
Based on the syllabus of Reading Comprehension 2 course, the course intends to measure several micro reading skills that follow: identifying topics, identifying main ideas, identifying specific and detailed information (explicit and implicit), understanding the organization of ideas  Crawley and Mountain (1995, p. 104-105) as follows: literal and inferential.The critical level is not included since the level of the students is intermediate and the critical level will be beyond of the scope of the competences for them.In the test, the literal level has 40% out of 100 items since it easier, inferential level have 60% out of 100 items.This percentage is taken for the inferential level dealing with inferring implicit information from the text which is more difficult but fit to the students' level.So, based on the percentage, there are 40 items in the literal level, and 60 items in the inferential level.
In this present study the passage theme is mostly those dealing with education, literature, science, life, and entertainment.They range from 212-495 words since the average students are still in the low level of intermediate.Although the biggest number is 495 words but the passage is in the level of 8 th which means it is still standard in terms of the level.
The readability of the texts that were used is calculated by using Flesch-Kincaid Formula.The result can be seen in Table 2.

The Result of Expert Review
There were two domains of experts in the validation stage.There were test experts and the ICT experts.The test experts did validation twice, the first one was about the blueprint review validation and the second one was the online test or the product itself.

Blueprint Review
Based on the feedback from the three experts, the inputs were about the level of skills, the numbering of the items, the grammar, the order of the item indicators, and title for the texts and record for number of sub competences to be rationally balanced.

Test Review
The inputs were the running of the try-out which should be divided into two sessions to diminish tiredness of subjects which can influence the result, the readability, the order of questions based on paragraph, and the language mistakes.The last was about the sources, quality of options and IJEE (Indonesian Journal of English Education), 4 ( 2 Suggestions from the three ICT experts were about the type of passage format, the attractiveness of the test, the use of auto-save for the saving, and the interface.
In order to know how good the item in discriminating the low and high ability students, the analysis of item  4.
There are 22 items categorized as very good items, 17 items as good items, 20 items as fair items and 41 items as poor items.
Regarding the item validity, based on the result in the ITEMAN, the item validity is shown in Table 5.
From the result shown in the table 5, it can be seen that there are 31 items categorized as valid items and 69 items categorized as not valid items.These 69 items were dropped from the product and only 31 valid items were used.
The last analysis was the effectiveness of distractor.Based on the data from ITEMAN result analysis, there are 32 items which have suggested answer keys.These 32 items were dropped from the products and they were items numbers 2, 7, 19, 20, 21, 22, 24, 25, 28, 31, 39, 42, 43, 47, 51, 54, 56, 57, 58, 60, 61, 67, 71, 73, 77, 79, 81, 86, 91, 93, 98, 99.The 31 good items were run to the ITEMAN 3.00 to be re-analyzed.The reliability is shown by the alpha score, which score is 0.779 and it can be categorized as good and can be used as the items in the test.The next thing is item difficulty, as shown in Table 6.
Based on the result, 25 items are categorized as very good and 6 items are categorized as good which means that they can discriminate the students well.
Related to the item validity, all the 31 items are categorized as valid items and later for the easiness of scoring and the balanced number of the indicators under interest, the used items are only 25 items.

The Result of Students' Questionnaire Analysis
To gain the information about how the online test worked for the subjects' point of view, questionnaires with 10 multiple choice items and 2 essay questions were distributed to the 100 subjects.The result of the subjects' answer is presented in the table 8.

Discussion
The result of needs assessment has revealed all the problems in the previous test, which is considered to be impractical.This online test is practical since it is easy in administration, easy in scoring and interpreting the result.The previous test is time consuming while this online test is time effective.The previous test covers only one genre while this online test covers all of the genres.The additional benefits of this online test are that this online test is cost effective and up to date.All the result of the needs assessment indicated that the online test has fulfilled the theory of criteria of a good test elaborated above by Djiwandono (1996) and Gronlund & Linn (1990).This online test is also has the advantages as what previous study by Noyes and Garland (2008) elaborated for example the richness of interface, accessible at home, less error in administration, online scoring which is greater in accuracy and less human error, and cost saving.Singh, Rylander & Mims (2012) also support the increase use of the Internet.They said that as preferences for online learning increases, mostly due to the convenience and flexibility it offers students, universities find themselves increasing the number of online format courses to meet the growing demand (p.96).Coiro (2014, p.12) added that there are many opportunities when students do learning activities online, such as question, wonder, and think more deeply about things with puzzle games, creating digital products , it also offers time for students to practice questioning, locating, evaluating, and synthesizing information collaboratively with a partner or in a small group (Coiro,p.16).Based on those facts, it is argued that the test in the present study can overcome technical problems in the previous tests ever developed.
As the items analysis was run, most of the items, the 69 items were invalid items, which should be dropped from the test.This means that only 31 items can be saved and used for the test.The reliability that is shown by the Alpha coefficient is .779,which can be categorized as good.The coefficient demonstrated that this scores generated from the test are consistent and reliable across measurement to show the real student's performance.The result indicates that this online test has one more quality of a good test in terms of reliability as explained above by Brown (2004).
The questionnaires show that most students respond positively toward the online test.Most of them respond that the test instruction (the instruction to operate the test and to answer the question) is generally easily understood which means the instruction is clear, causing no bias.They also respond positively that the questions are easily understood.The time is sufficient which means that the texts, the questions, and the time allocation are proportional to their level.The result is in line with what stated above by Gronlund and Linn (1990) about practicality and Zandvliet and Farragher, (1997) as cited in Noyes andGarland (2008, p.1369) about the advantages of computer testing.The material used in the test are suitable which means the test does not cover the material that was never taught in the classroom.A number of the subjects (44%) stated that the passages are difficult, which possibly because some students are actually in the lower proficiency level while this test is designed for the intermediate ones as it is stated in the syllabus.This fact also could be a reason behind the nonoptimal alpha score.The last is about the subjects' opinion.Although few subjects say that the online test makes the eyes tired, mostly they say that the online test is good, interesting, effective, fun, practical, minimizing the chance of cheating.They also think that they do not need to open the page too often, and the test goes along with the ICT era.This means that the availability of this online test overcomes the problem emanating from the previous test used.

CONCLUSION AND SUGGESTION
The conclusions comprise the strengths and also the weaknesses of the product of this research.Related to the strengths, first, the product of this research can be a model of an online reading summative test in STKIP PGRI Jombang.Second, based on the try-out stage, it is shown that some items of the proposed test are valid and reliable.The product of this research is packaged into one part.It covers the blueprint, the test in the paper printed form and the screenshot of the online version, the answer key, and the instruction for access to the online test.
As the product has strengths, it also has weaknesses.The final product of this test only consists of 25 items due to the elimination of the non-valid items.The reading level is not in the precise percentage as this study suggested.This product has no construct validity process to reveal the psychological quality of the students.In addition, this study is still at the automaticity process from the paper-based format to the computer-format one.Some suggestions are presented after completing the whole processes in conducting this research.This online test can be a model for other reading courses and also other courses in general in conducting tests since it has been validated.This product can be an insight for the effectiveness of an online reading test in enhancing students' reading motivation with better qualifications for example random setting.And as this research had limited subjects (only 100 subjects), it is suggested that future researcher can have larger subjects to gain more reliable and valid result.Further, although low, but as this test still open the chance for the students to do the cheating, so the researcher will be working on the online test in randomized options.This attempt is hoped to not only diminish the cheating action but also increase the students' independence and self-esteem.

Table 1 .
Item Discrimination Categorization

Table 3 .
The Results of Item Difficulty Analysis

Table 4 .
The Results of Item Discrimination Analysis

Table 5
The Result of Item Validity Analysis

Table 6
The Result of Item Difficulty Analysis

Table 7 .
The Result of Item Discrimination Analysis

Table 8 .
The Result of Item Validity Analysis