Chapters 8-9 Discussion
99 unread replies.99 replies.
Read Moore C. 8-9; Post YOUR thought-provoking and discussion worthy summaries, questions, riveting points from chapter reading to the Discussion Board AND YOUR RESPONSES TO AT LEAST TWO OF YOUR CLASSMATES’ POSTS no later than SUNDAY BY MIDNIGHT AT THE END OF THE WEEK.
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/342918149
ASSESSMENT AND EVALUATION IN EDUCATION
Article · July 2020
2 authors, including:
Some of the authors of this publication are also working on these related projects:
teaching and learning View project
Assessment and evaluation View project
Tomas de Aquino Caluyua Yambi
13 PUBLICATIONS 3 CITATIONS
All content following this page was uploaded by Tomas de Aquino Caluyua Yambi on 14 July 2020.
The user has requested enhancement of the downloaded file.
ASSESSMENT AND EVALUATION IN EDUCATION
By: Tomás de Aquino Caluyua Yambi
Unleashing the potential of continuous improvement in teaching/learning
requires an appreciation of the difference in spirit between assessment and evaluation.
Assessment is frequently confused and confounded with evaluation. The purpose of
an evaluation is to judge the quality of a performance or work product against a
standard. The fundamental nature of assessment is that a mentor values helping a
mentee and is willing to expend the effort to provide quality feedback that will
enhance the mentee’s future performance. While both processes involve collecting
data about a performance or work product, what is done with these data in each
process is substantially different and invokes a very different mindset. This paper first
looks at what assessment is and the various aspects involving. Then attention will be
turn to evaluation and its components. Furthermore, it will look at testing as a tool
used by both assessment and evaluation, lastly some differences between assessment
and evaluation will be presented.
2. Epistemology of Assessment and Evaluation
Assessment and Evaluation are two different concepts with a number of
differences between them starting from the objectives and focus. Before we go into
details about these differences that set assessment and evaluation apart, let us first pay
attention to the two words themselves. According to the Webster Dictionary (2017),
assessment means appraisal. Then, according to the same dictionary, evaluation is
estimation or determining the value of something. So, these processes are used in the
field of education very often to test the quality of teaching and learning processes.
That is done to let the educational institutes find out what more can be done to
improve the education offered by those educational institutes.
3. What is Assessment
As stated above, and according to Brown, (1990) assessment refers to a related
series of measures used to determine a complex attribute of an individual or group of
individuals. This involves gathering and interpreting information about student level
of attainment of learning goals.
Assessments also are used to identify individual student weaknesses and
strengths so that educators can provide specialized academic support educational
programming, or social services. In addition, assessments are developed by a wide
array of groups and individuals, including teachers, district administrators,
universities, private companies, state departments of education, and groups that
include a combination of these individuals and institutions.
In classroom assessment, since teachers themselves develop, administer and analyze
the questions, they are more likely to apply the results of the assessment to their own
teaching. Therefore, it provides feedback on the effectiveness of instruction and gives
students a measure of their progress. As Brown (1990) maintains, two major functions
can be pointed out for classroom assessment: One is to show whether or not the
learning has been successful, and the other one is to clarify the expectations of the
teachers from the students (Brown, 1990).
Assessment is a process that includes four basic components:
1) Measuring improvement over time.
2) Motivating students to study.
3) Evaluating the teaching methods.
4) Ranking the students’ capabilities in relation to the whole group evaluation.
3.1. Why Assessment is Important
First and foremost, assessment is important because it drives students learning
(Brown 1990). Whether we like it or not, most students tend to focus their energies on
the best or most expeditious way to pass their ‘tests.’ Based on this knowledge, we
can use our assessment strategies to manipulate the kinds of learning that takes place.
For example, assessment strategies that focus predominantly on recall of knowledge
will likely promote superficial learning. On the other hand, if we choose assessment
strategies that demand critical thinking or creative problem solving, we are likely to
realize a higher level of student performance or achievement. In addition, good
assessment can help students become more effective self-directed learners (Darling-
Hammond 2006). As indicated above, motivating and directing learning is only one
purpose of assessment. Well-designed assessment strategies also play a critical role in
educational decision-making and are a vital component of ongoing quality
improvement processes at the lesson, course and/or curriculum level.
3.2. Types and Approaches to Assessment
Numerous terms are used to describe different types to learner assessment. Although
somewhat arbitrary, it is useful to these various terms as representing dichotomous
poles (McAlpine, 2002).
Formative <———————————> Summative
Informal <———————————> Formal
Continuous <———————————-> Final
Process <———————————> Product
Divergent <———————————> Convergent
3.3. Formative vs. Summative Assessment
Formative assessment is designed to assist the learning process by providing
feedback to the learner, which can be used to identify strengths and weakness and
hence improve future performance. Formative assessment is most appropriate where
the results are to be used internally by those involved in the learning process
(students, teachers, curriculum developers). Summative assessment is used primarily
to make decisions for grading or determine readiness for progression. Typically
summative assessment occurs at the end of an educational activity and is designed to
judge the learner’s overall performance. In addition to providing the basis for grade
assignment, summative assessment is used to communicate students’ abilities to
external stakeholders, e.g., administrators and employers (Darling-Hammond, 2006).
3.4. Informal vs. Formal Assessment
With informal assessment, the judgments are integrated with other tasks, e.g.,
lecturer feedback on the answer to a question or preceptor feedback provided while
performing a bedside procedure. Informal assessment is most often used to provide
formative feedback. As such, it tends to be less threatening and thus less stressful to
the student. However, informal feedback is prone to high subjectivity or bias. Formal
assessment occurs when students are aware that the task that they are doing is for
assessment purposes, e.g., a written examination. Most formal assessments also are
summative in nature and thus tend to have greater motivation impact and are
associated with increased stress. Given their role in decision-making, formal
assessments should be held to higher standards of reliability and validity than
informal assessments (McAlpine 2002).
3.5. Continuous vs. Final Assessment
Continuous assessment occurs throughout a learning experience (intermittent
is probably a more realistic term). Continuous assessment is most appropriate when
student and/or instructor knowledge of progress or achievement is needed to
determine the subsequent progression or sequence of activities (McAlpine 2002).
Continuous assessment provides both students and teachers with the information
needed to improve teaching and learning in process. Obviously, continuous
assessment involves increased effort for both teacher and student. Final (or terminal)
assessment is that which takes place only at the end of a learning activity. It is most
appropriate when learning can only be assessed as a complete whole rather than as
constituent parts. Typically, final assessment is used for summative decision-making.
Obviously, due to its timing, final assessment cannot be used for formative purposes
3.6. Process vs. Product Assessment
Process assessment focuses on the steps or procedures underlying a particular
ability or task, i.e., the cognitive steps in performing a mathematical operation or the
procedure involved in analyzing a blood sample. Because it provides more detailed
information, process assessment is most useful when a student is learning a new skill
and for providing formative feedback to assist in improving performance (McAlpine
2002). Product assessment focuses on evaluating the result or outcome of a process.
Using the above examples, we would focus on the answer to the math computation or
the accuracy of the blood test results. Product assessment is most appropriate for
documenting proficiency or competency in a given skill, i.e., for summative purposes.
In general, product assessments are easier to create than product assessments,
requiring only a specification of the attributes of the final product (McAlpine 2002).
3.7. Divergent vs. Convergent Assessment
Divergent assessments are those for which a range of answers or solutions
might be considered correct. Examples include essay tests. Divergent assessments
tend to be more authentic and most appropriate in evaluating higher cognitive skills.
However, these types of assessment are often time consuming to evaluate and the
resulting judgments often exhibit poor reliability. A convergent assessment has only
one correct response (per item). Objective test items are the best example and
demonstrate the value of this approach in assessing knowledge. Obviously,
convergent assessments are easier to evaluate or score than divergent assessments.
Unfortunately, this “ease of use” often leads to their widespread application of this
approach even when contrary to good assessment practices. Specifically, the
familiarity and ease with which convergent assessment tools can be applied leads to
two common evaluation fallacies: the Fallacy of False Quantification (the tendency to
focus on what’s easiest to measure) and the Law of the Instrument Fallacy (molding
the evaluation problem to fit the tool) (McAlpine 2002).
3.8. Approaches to Assessment
In approaches to assessment, two central tendencies emerge which are relevant
to language as subject. One places emphasis on the assessment of learning where
reliable, objective measures are a high priority. The focus here is on making
summative judgements which in practice is likely to involve more formal
examinations and tests with marks schemes to ensure that the process is sound
(McAlpine 2002). An alternative approach is to change the emphasis from assessment
of learning to assessment for learning, implying a more formative approach where
there is much more emphasis on feedback to improve performance. The approach
here might be through course work and portfolio assessment in which diverse
information can be gathered which reflects the true broad nature of the subject
4. Between Assessment and Evaluation
After collecting data from students there is then the need for assigning
students with numbers or others symbols to a certain characteristic of the objects of
interest according to some specified rules in order to reflect quantities of properties.
This is called measurement and can be attributed to students’ achievement,
personality traits or attitudes. Measurement then is the process of determining a
quantitative or qualitative attribute of an individual or group of individuals that is of
academic relevance. A test will serve as the vehicle used to observe an attribute
whether in a written test or an observation or an oral question or an assessment
intended to measure the respondents’ knowledge or other abilities. Then if the test is
the vehicle then the test score is the indication of what was observed through the test
and can also be quantitative and qualitative in nature.
A good test should possess not only validity and reliability but also
objectivity, objective basedness, comprehensiveness, discriminating power,
practicability, comparability and also utility (Shohamy 1993). Objectivity is when a
test is to be said objective if it is free from personal biases in interpreting its scope as
well as in scoring the responses. It can be increased by using more objective type test
items and the answers are scored according to model answers are provided. Objective
basedness is that a test should be based on pre-determined objectives. And a test setter
should have definite idea about the objective behind each item (Shohamy 1993).
Comprehensiveness is that the test should cover the whole syllabus, due importance
should be given all the relevant learning materials, and a test should cover all the
anticipated objectives. Validity is the degree to which test measures what it is to
measure. Reliability is of a test refers to the degree of consistency which it measures
what is intended to measure. A test may be reliable but need not be valid. This is
because it may yield consistent scores but these scores need not be representing what
is exactly measured what we want to measure (Shohamy 2001). Discriminating power
of the test is its power to discriminate between the upper and lower groups who took
the test. The test should have different difficulty level of questions. Practicality of the
test depends on administrative, scoring, interpretative ease and economy.
Comparability is when a test possesses comparability when scores resulting from its
use can be interpreted in terms of a common base that has a natural or accepted
meaning. Then lastly the utility, a test has utility if it provides the test condition that
would facilitate realization of the purpose for which it is mean.
Educators believe that every measurement device should possess certain
qualities. Perhaps the two most common technical concepts in measurement are
reliability and validity (Weir 2005). Any kind of assessment, whether traditional or
“authentic,” must be developed in a way that gives the assessor accurate information
about the performance of the individual (Weir 2005). At one extreme, we wouldn’t
have an individual paint a picture if we wanted to assess writing skills. A test high
validity has to be reliable also for the score will be consistent in both cases. A valid
test is also a reliable test, but a reliable test may not be a valid one (Shohamy 2001).
5. What is Evaluation
Evaluation is determining the value of something. So, more specifically, in the
field of education, evaluation means measuring or observing the process to judge it or
to determine it for its value by comparing it to others or some kind of a standard
(Weir & Roberts, 1994). The focus of the evaluation is on grades. It is rather a final
process that is determined to understand the quality of the process. The quality of the
process is mostly determined by grades. That is such an evaluation can come as a
paper that is given grades. This type of paper will test the knowledge of each student.
So, here with the grades, the officials come try to measure the quality of the
programme. Furthermore, Evaluation is comparing a student’s achievement with other
students or with a set of standards (Howard & Donaghue 2015). It refers to
consideration of evidence in the light of value standards and in terms of the particular
situations and the goals, which the group or individuals are striving to attain.
Evaluation designates more comprehensive concept of measurement than is implied
in conventional tests and examination. The emphasis of evaluation is based upon
broad personality change and the major objectives in the educational program
(Howard & Donaghue 2015).
Evaluation can, and should, however, be used as an ongoing management and
learning tool to improve learning, including five basic components according to
1) Articulating the purpose of the educational system.
2) Identifying and collecting relevant information.
3) Having ideas that are valuable and useful to learners in their lives and
4) Analyzing and interpreting information for learners.
5) Classroom management or classroom decision making.
Well-run classes and effective programs are those that can demonstrate the
achievement of results. Results are derived from good management. Good
management is based on good decision making. Good decision making depends on
good information. Good information requires good data and careful analysis of the
data. These are all critical elements of evaluation.
5.1. Functions of evaluations
Evaluation refers to a periodic process of gathering data and then analyzing or
ordering it in such a way that the resulting information can be used to determine how
effective your teaching or program is, and the extent to which it is achieving its stated
objectives and anticipated results (Howard & Donaghue (2015). Teachers can and
should conduct internal evaluations to get information about their programs, to know
who passes and who fails so that they can make sound decisions about their practices.
Internal evaluation should be conducted on an ongoing basis and applied
conscientiously by teachers at every level of an institution in all program areas. In
addition, all of the program’s participants (managers, staff, and beneficiaries) should
be involved in the evaluation process in appropriate ways. This collaboration helps
ensure that the evaluation is fully participatory and builds commitment on the part of
all involved to use the results to make critical program improvements (Howard &
Although most evaluations are done internally, conducted by local
stakeholders, there is still a need for larger-scale, external evaluations conducted
periodically by individuals from outside the program or institution. Most often these
external evaluations are required for funding and accreditation purposes or to answer
questions about the program’s long-term impact by looking at changes in
demographic indicators such as graduation rate, changes n economy and other levels.
In addition, occasionally a teacher may be observed by an external stakeholder with
purpose of assessing programmatic or operating problems that have been identified
but that cannot be fully diagnosed or resolved through the findings of internal
evaluation (Weir & Roberts, 1994).
5.2. Principles of Evaluation
Here are some principles to consider for your own classroom summarised
from (Weir & Roberts, 1994; Howard & Donaghue 2015; (Kellaghan & Stufflebean
Ø Effective evaluation is a continuous, on-going process. Much more than
determining the outcome of learning, it is rather a way of gauging learning
over time. Learning and evaluation are never completed; they are always
evolving and developing.
Ø A variety of evaluative tools is necessary to provide the most accurate
assessment of students’ learning and progress. Dependence on one type of tool
to the exclusion of others deprives students of valuable learning opportunities
and robs you of measures that help both students and the overall program
Ø Evaluation must be a collaborative activity between teachers and students.
Students must be able to assume an active role in evaluation so they can begin
to develop individual responsibilities for development and self-monitoring.
Ø Evaluation needs to be authentic. It must be based on the natural activities and
processes students do both in the classroom and in their everyday lives. For
example, relying solely on formalized testing procedures might send a signal
to children that learning is simply a search for “right answers.”
6. Assessment vs. evaluation
Depending on the area of study, authority or reference consulted, assessment
and evaluation may be treated as synonyms or as distinctly different concepts. In
education, assessment is widely recognized as an ongoing process aimed at
understanding and improving student learning. Assessment is concerned with
converting expectations to results. It can be a process by which information is
collected through the use of test, interview, questionnaire observation, etc. For
example, having your students to write on a given topic your are collecting
information, this is what we mean here by assessment (Kizlik 2010; Richards and
Schmidt 2002; Weir & Roberts, 1994).
Evaluation on the other hand, is recognized as a more scientific process aimed
at determining what can be known about performance capabilities and how these are
best measured. Evaluation is concerned with issues of validity, accuracy, reliability,
analysis, and reporting. It can therefore be seen as the systematic gathering of
information for purposes of decision-making, using both quantitative methods (tests)
and qualitative methods (observations, ratings and value judgments) with purpose of
judging the gathered information. In other words, when the teachers receive written
assignment from students, some kind of correction and/or response and a possible
mark will be given. Thus we are in presence of evaluation. However, assessment and
evaluation are similar in that they both involve specifying criteria and collecting
data/information. In most academic environments, they are different in purpose,
setting criteria, control of the process, and response. For example, an instructor can
use the results of a midterm exam for both assessment and evaluation purposes. The
results can be used to review with the students course material related to common
mistakes on the exam (i.e. to improve student learning as in assessment) or to decide
what measurement or grade to give each student (i.e. to judge student achievement in
the course as in evaluation) (Howard & Donaghue 2015).
7. Key Differences Between Assessment and Evaluation
The significant differences between assessment and evaluation are discussed
in the points given below summarized from (Weir & Roberts, 1994; Howard &
Donaghue 2015; (Kellaghan & Stufflebean 2003):
1. The process of collecting, reviewing and using data, for the purpose of
improvement in the current performance, is called assessment. A process of
passing judgment, on the basis of defined criteria and evidence is called
2. Assessment is diagnostic in nature as it tends to identify areas of
improvement. On the other hand, evaluation is judgemental, because it aims at
providing an overall grade.
3. The assessment provides feedback on performance and ways to enhance
performance in future. As against this, evaluation ascertains whether the
standards are met or not.
4. The purpose of assessment is formative, i.e. to increase quality whereas
evaluation is all about judging quality, therefore the purpose is summative.
5. Assessment is concerned with process, while evaluation focuses on product.
6. In an assessment, the feedback is based on observation and positive &
negative points. In contrast to evaluation, in which the feedback relies on the
level of quality as per set standard.
7. In an assessment, the relationship between assessor and assessee is reflective,
i.e. the criteria are defined internally. On the contrary, the evaluator and
evaluatee share a prescriptive relationship, wherein the standards are imposed
8. The criteria for assessment are set by both the parties jointly. As opposed to
evaluation, wherein the criteria are set by the evaluator.
An effective, goal-oriented, teaching-learning sequence contains clearly
understood objectives, productive classroom activities, and a sufficient amount of
feedback to make students aware of the strengths and weaknesses of their
performances. Assessment and evaluation are related to both instructional objectives
and classroom learning activities and are indispensable elements in the learning
process. They are useful for gathering data/information needed into various interests.
The data can be used to make decision about the content and methods of instruction,
to make decisions about classrooms climate, to help communicate what is important,
and to assign grades. Among other techniques to do evaluation and assessment, The
teachers can use tests to evaluating and assessing, starting from the small one,
incorporating evaluation into the class routine, setting up an easy and efficient record-
keeping system, establishing an evaluation plan, and personalizing the evaluation
Bachman, L. F. (1995). Fundamental considerations in language testing. Oxford:
Oxford University Press.
Brown, D. H. (1990). Language assessment: Principles and classroom practices.
Darling-Hammond, L. (2006). Assessing teacher education: The usefulness of
multiple measures for assessing program outcomes. Journal of Teacher
Education, 57(2), 120-138.
Kellaghan, T., & Stufflebean, D.L. (Eds) (2003). International Handbook of
educational evaluation. Dordrecht: Klüver Academic Publisher
Kizlik, B. (2010). How to Write an Assessment Based on a Behaviorally Stated
Objective. [online Document] Available at
http://www.adprima.com/assessment.htm Accessed on September 15, 2017.
McAlpine, M. (2002). Principles of Assessment. Glasgow: University of Luton.
Available at http://caacentre.lboro.ac.uk/dldocs/Bluepaper1.pdf
Merriam-Webster’s collegiate dictionary (11th ed.). (2017). New York, NY:
Richards, C. and Schmidt 2002. Longman Dictionary of Language Teaching and
Applied Linguistics. (3rd edition). Harlow, Essex: Pearson Education.
Shohamy, E. (1993). The Power of Tests. The Impact of Language Tests on Teaching
and Learning. Washington, DC: NFLC Occasional Papers..
Shohamy, E. (2001). The Power of Tests: A Critical Perspective on the Uses of
Language Tests. Harlow: Pearson Education.
Weir, J. C. (2005). Language testing and validation: Evidence-based approach. New
York, NY: Palgrave Macmillan.
Weir, J. C., & Roberts, J. (1994). Evaluation in ELT. Oxford: Blackwell
View publication statsView publication stats
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/314894221
Test Construction and Evaluation: A Brief Review
Article in Indian Journal of Applied Research · June 2015
20 PUBLICATIONS 49 CITATIONS
All content following this page was uploaded by Shafaat Hussain on 13 March 2017.
The user has requested enhancement of the downloaded file.
INDIAN JOURNAL OF APPLIED RESEARCH X 725
Volume : 5 | Issue : 6 | June 2015 | ISSN – 2249-555XReseaRch PaPeR
Test Construction and Evaluation: A Brief Review
Shafaat Hussain Sumaiya Sajid
Assistant Professor of Communication, Madawalabu
University, Bale-Robe, Ethiopia
Assistant Professor of English, Falahe Ummat Girls PG
College, Bhadohi, Uttar Pradesh, India
Keywords test construction, test evaluation, item analysis
ABSTRACT Beginning from intuitive via scientific, today we are in communicative era of testing. The pursuit for pro-
fessionalism is evidenced by a host of standards or codes of practice which have been developed, imple-
mented and enforced by testing organizations from all over the world. Creating professionally sound assessment re-
quires both art and science. Engaging in fair and meaningful assessments and producing relevant data about students’
achievement is an art. Designing a test, formulating item, and processing grade is a complete science. This review
article reports a survey of test construction; its cyclic formulation process; the phases that it involves (deciding content,
specifying objectives, preparing table of specification and fixing items); and the way it is evaluated (item difficulty and
item discrimination). Teaching and testing are inseparable and in order to be professionally sound in judging the stu-
dent’s performance, it is significant to know the norms, standards and ethics of test construction and evaluation.
Good testing practice has been discussed very extensively in
language testing literature, and has been approached from
different perspectives by language testing researchers. A com-
mon approach to addressing this issue, for example, is to dis-
cuss how a language test should be developed, administered,
and evaluated (Alderson, Clapham and Wall 1995; Li 1997;
Heaton 2000; Fulcher 2010). These discussions are primarily fo-
cusing on good practice in each and every step in the testing
cycle, including, for instance, test specifications, item writing,
test administration, marking, reporting test results, and post
hoc test data analyses. Another common approach to discuss
good testing practice is to focus upon one particular dimen-
sion of language testing, to develop theoretical models about
this particular dimension, and then to apply those theoretical
models on language testing practice. For example, Bachman
and Palmer (1996) developed a model of ‘test usefulness’,
which, as they argued, was ‘the most important consideration
in designing and developing a language test’. Other exam-
ples adopting this approach are Cheng, Watanabe, and Cur-
tis (2004), focusing on test washback, Kunnan (2000, 2004) on
test fairness, Shohamy (2001a, b) on use-oriented testing and
the power of tests, and McNamara and Roever (2006) on the
social dimensions of language testing. Good testing practice
has also been considerably documented in the standards or
codes of practice which have been developed by testing or re-
search organizations from all over the world. For example, the
ILTA guidelines (2007), the ALTE Code of Practice (1994), the
EALTA Guidelines (2006), ETS Standards for Quality and Fair-
ness (2002) and the like (Boyd and Davies 2002; Fulcher and
Davidson 2007; Bachman and Palmer 2010).
Figure 1: Cyclic process of teaching and testing
2. TEST CONSTRUCTION
Ideally, effective tests have some characteristics. They are
valid (providing useful information about the concepts they
were designed to test), reliable (allowing consistent meas-
urement and discriminating between different levels of per-
formance), recognizable (instruction has prepared students
for the assessment), realistic (concerning time and effort
required to complete the assignment) practical and ob-
jective. To achieve these, the teacher must draw up a test
blue print or plan of specifying the objectives, preparing
table of specification, allocating the test length as per time
limit, and deciding the types of items to be set (Wiggins
1998; Svinicki 1999).
Figure 2: Stages in Test construction
It should include details of test content in the spe-
cific course. Moreover, each content area should be
weighted roughly in proportion to its judged impor-
tance. Usually, the weights are assigned according to
the relative emphasis placed upon each topic in the
textbook. The median number of pages on a given
topic in the prescribed books is usually considered
as an index of its importance. To devise a classroom
tests, the advice and assistance of fellow teachers can
prove to be of immense importance (Wiggins 1998;
1.2 Specifying the Objectives
Each subject demands a different set of instructional ob-
jectives. For example, major objectives of the subjects like
726 X INDIAN JOURNAL OF APPLIED RESEARCH
Volume : 5 | Issue : 6 | June 2015 | ISSN – 2249-555XReseaRch PaPeR
sciences, social sciences, and mathematics are: knowledge,
understanding, application and skill. On the other hand
the major objectives of a language course are: knowledge,
comprehension and expression. Knowledge objective is
considered to be the lowest level of learning whereas un-
derstanding, application of knowledge is considered high-
er levels of learning. As the basic objectives of education
are concerned with the modification of human behavior,
the teacher must determine measurable cognitive out-
comes of instruction at the beginning of the course. The
test determines the extent to which the objectives have
been attained, both for the individual students and for
the class as a whole. Some objectives are stated as broad,
general, long-range goals, e.g., ability to exercise the
mental functions of reasoning, imagination, critical appre-
ciation. These educational objectives are too general to be
measured by classroom tests and need to be operationally
defined by the class teacher (Wiggins 1998; Riaz 2008).
2.3 Preparing Table of Specifications
A table of specifications is a two-way table that represents
along one axis the content area/topics that the teacher
has taught during the specified period and the cognitive
level at which it is to be measured, along the other axis.
In other words, the table of specifications highlights how
much emphasis is to be given to each objective or topic.
While writing the test items, it may not be possible to at-
tempt to adhere very rigorously to the weights assigned
in each cell. Thus, the weights indicated in the original ta-
ble may need to be slightly changed during the course of
test construction, if sound reasons for such a change are
encountered by the teacher. For instance, the teacher may
find it appropriate to modify the original test plan in view
of data obtained from the experimental try-out of the new
test (Wiggins 1998; Riaz 2008).
2.4 Deciding Test Length
The number of items that should constitute the final form of
a test is determined by the purpose of the test or its pro-
posed uses, and by the statistical characteristics of the items.
Three important considerations in setting test length are: (i)
the optimal number of items for a homogenous test is lower
than for a highly heterogeneous test; (ii) items that are meant
to assess higher thought processes like logical reasoning,
creativity, abstract thinking etc., require more time than those
dependent on our ability to recall important information and
(iii) another important consideration in determining the length
of test and the time required for it is related to the validity
and reliability of the test. The teacher has to determine the
number of items that will yield maximum validity and reliabil-
ity of the particular test (Wiggins 1998; Riaz 2008).
2.5 Fixing Types of Items
Each type of exam item has its advantages and disadvan-
tages in terms of ease of design, implementation, and
scoring, and in its ability to measure different aspects of
students’ knowledge or skills. Multiple choice and essay
items are often used in college-level assessment because
they readily lend themselves to measuring higher order
thinking skills (e.g., application, justification, inference,
analysis and evaluation). Yet instructors often struggle to
create, implement, and score these items (Worthen et al.
1993; Wiggins 1998; McMillan 2001). Here, an attempt
would be made to examine the guidelines to be followed
while designing major types of items like true-false, gap-
filling, matching, multiple-choice and essay types.
2.5.1 Constructing True-False Items
While constructing true-false items, attempts should be
made to avoid trivial, broad, general and negative state-
ments. When a negative word is necessary and cannot be
ignored, it should be underlined or put in italics so that
students do not overlook it. Second, it is better not to
include two ideas in one statement unless there is cause-
effect relationship. Third, those opinions should not be
used which are attributed to some sources, or the ability
to identify opinion is being specifically measured. Fourth,
true or false statements should be equal in length. Fifth,
there should be proportionate numbers of true and false
statements and finally, statements should be simple in lan-
guage and understanding (Gronlund and Linn 1990; Chase
and Jacobs, 1992; Wiggins 1998; McMillan 2001).
2.5.2 Constructing Completion/Gap Filling Items
While constructing the completion/gap filling items, an at-
tempt should be made to word the item so that the re-
quired answer is both brief and specific. A direct question
is generally more desirable than an incomplete statement.
Direct statements from textbooks should not be taken
as an item. If the answer is to be expressed in numeri-
cal units, we should indicate the type of answer wanted.
Blanks for answers (gap filling space) should be equal in
length and in a column to the right of the question. In-
cluding too many blanks in one statement is not advisable
(Gronlund and Linn 1990; Chase and Jacobs, 1992; Wig-
gins 1998; McMillan 2001).
2.5.3 Constructing Matching Items
While constructing matching items, homogeneous ma-
terial in a single matching exercise should be used. It is
advised to include an unequal number of responses and
premises and instruct the student that responses may be
used once, more than once, or not at all. Brief portion
should be kept in the left column, and shorter responses
should be placed on the right. We should arrange the
list of responses in logical order. Care should be made
on placing words in alphabetical order and numbers in
sequence. There must be indications in the directions
the basis for matching the responses and premises. We
should stay away from ambiguity so that testing time dur-
ing examination may be saved and finally, care must be
taken in placing one matching exercise on the same page
(Gronlund and Linn 1990; Chase and Jacobs, 1992; Wig-
gins 1998; McMillan 2001).
2.5.4 Constructing Multiple-choice items
Stem and choices are the two parts of a multiple-choice
item. A stem should be clearly described and should be
complete in itself. The options provided should be as short
as possible. Only that information should be placed in the
stem which needs to make the problem clear and specific.
The stem of the question should communicate the nature
of the task to the students and present a clear problem or
concept. The stem should provide only information that is
relevant to the problem or concept, and the options (dis-
tracters) should be succinct. Avoid the use of negatives in
the stem (use only when you are measuring whether the
respondent knows the exception to a rule or can detect er-
rors). You can word most concepts in positive terms and
thus avoid the possibility that students will overlook terms
of “no, not, or least” and choose an incorrect option not
because they lack the knowledge of the concept but be-
cause they have misread the stated question. Italicizing,
capitalizing, using bold-face, or underlying the negative
term makes it less likely to be overlooked.
Regarding choices or options, attempt should be taken to
have only one correct answer. Make certain that the item
INDIAN JOURNAL OF APPLIED RESEARCH X 727
Volume : 5 | Issue : 6 | June 2015 | ISSN – 2249-555XReseaRch PaPeR
has one correct answer. Multiple-choice items usually have
at least three incorrect options (distracters). Write the cor-
rect response with no irrelevant clues. A common mistake
when designing multiple-choice questions is to write the
correct option with more elaboration or detail, using more
words, or using general terminology rather than techni-
cal terminology. Write the distracters to be plausible yet
clearly wrong. An important, and sometimes difficult to
achieve is ensuring that the incorrect choices (distracters)
appear to be possibly correct. Distracters are best creat-
ed using common errors or misunderstandings about the
concept being assessed, and making them homogeneous
in content and parallel in form and grammar. We should
refrain from “all of the above,” “none of the above,” or
other special distracters (use only when an answer can
be classified as unequivocally correct or incorrect). None
of the above should be restricted to the items of factual
knowledge with absolute standards of correctness. It is in-
appropriate for questions where students are asked to se-
lect “the best” answer. All of the above is awkward in that
many students will choose it if they can identify at least
one of the other options as correct and therefore assume
all of the choices are correct – thereby obtaining a correct
answer based on partial knowledge of the concept/content
(Gronlund and Linn, 1990). We must use each alternative
as the correct answer about the same number of times.
Check to see whether option “a” is correct about the
same number of times as option “b” or “c” or “d” across
the instrument. It can be surprising to find that one has
created an exam in which the choice “a” is correct 90% of
the time. Students quickly find such patterns and increase
their chances of “correct guessing” by selecting that an-
swer option by default. (Gronlund and Linn 1990; Wiggins
1998; McMillan 2001).
2.5.5 Constructing Essay Items
Essays can tap complex thinking by requiring students to
organize and integrate information, interpret information,
construct arguments, give explanations, evaluate the merit
of ideas, and carry out other types of reasoning. In prac-
tice, we must restrict the use of essay questions to edu-
cational outcomes that are difficult to measure using other
formats. Construct the item to elicit skills and knowledge
in the educational outcomes. Write the item so that stu-
dents clearly understand the specific task. Other assess-
ment formats are better for measuring recall knowledge
but the essay is able to measure deep understanding and
mastery of complex information. Once you have identi-
fied the specific skills and knowledge, you should word
the question clearly and concisely so that it communicates
to the students the specific task(s) you expected them to
complete (e.g., state, formulate, evaluate, use the principle
of, create a plan for, etc.). If the language is ambiguous
or students feel they are guessing at “what the instructor
wants me to do,” the ability of the item to measure the in-
tended skill or knowledge decreases. Indicate the amount
of time and effort students should spend on each essay
item. In essay items, especially when used in multiples
and/or combined with other item formats, you should pro-
vide students with a general time limit or time estimate to
help them structure their responses. Providing estimates of
length of written responses to each item can also help stu-
dents manage their time, providing cues about the depth
and breadth of information that is required to complete
the item. In restricted-response items a few paragraphs are
usually sufficient to complete a task focusing on a single
We should stay away giving students options as to which
essay questions they will answer. A common structure in
many exams is to provide students with a choice of es-
say items (e.g., “choose two out of the three essay ques-
tions to complete…”). Instructors, and many students, of-
ten view essay choice as a way to increase the flexibility
and fairness of the exam by allowing learners to focus on
those items for which they feel most prepared. However,
the choice actually decreases the validity and reliability of
the instrument because each student is essentially taking
a different test. Creating parallel essay items (from which
students choose a subset) that test the same educational
objectives (skills, knowledge) is very difficult, and unless
students are answering the same questions that measure
the same outcomes, scoring the essay items and the infer-
ences made about student ability are less valid. While al-
lowing students a choice gives them the perception that
they have the opportunity to do their best work, you must
also recognize that choice entails difficulty in drawing con-
sistent and valid conclusions about student answers and
performance. Consider using several narrowly focused
items rather than one broad item. For many educational
objectives aimed at higher order reasoning skills, creating
a series of essay items that elicit different aspects students’
skills and knowledge can be more efficient than attempting
to create one question to capture multiple objectives. By
using multiple essay items (which all students complete),
you can capture a variety of skills and knowledge while
also covering a greater breadth of course content (Cashin
1987; Gronlund and Linn 1990; Worthen et al. 1993; Wig-
gins 1998; McMillan 2001).
3. WHOLISTIC PERSPECTIVE
Different types of questions can be devised for an
achievement test, for instance, multiple choice, fill-in-the-
blank, true-false, matching, short answer and essay. Each
type of question is constructed differently with different
principles. Instructions for each type of question must be
simple and brief. Questions ought to be written in simple
language. If the language is difficult or ambiguous, even
a student with strong language skills and good vocabu-
lary may answer incorrectly if his/her interpretation of the
question is different from the author’s intended meaning
(Worthen et al. 1993; Thorndike 1997; Wiggins 1998).
Test items must assess specific ability or comprehension
of content developed during the course of study (Gron-
lund and Linn 1990). Write the questions as you teach so
that your teaching may be aimed at significant learning
outcomes. A tester has to devise questions that call for
comprehension and application of knowledge skills. Some
of the questions must aim at appraisal of examinees’ abil-
ity to analyze, synthesize, and evaluate novel instances of
the concepts. If the instances are the same as used in in-
struction, students are only being asked to recall (knowl-
edge level). Questions should be written in different for-
mats, e.g., multiple-choice, completion, true-false, short
answer etc. to maintain interest and motivation of the stu-
dents. The teacher should prepare alternate forms of the
test to deter cheating and to provide for make-up testing
(if needed). The items should be phrased so that the con-
tent rather than the format of the statements will deter-
mine the answer. Sometimes, the item contains “specific
determiners” which provide an irrelevant cue to the cor-
rect answer. For example, statements that contain terms
like always, never, entirely, absolutely, and exclusively are
much more likely to be false than to be true. On the oth-
er hand, such terms as may, sometimes, as a rule, and in
general are much more likely to be true.
Besides, care should be taken to avoid double negatives,
728 X INDIAN JOURNAL OF APPLIED RESEARCH
Volume : 5 | Issue : 6 | June 2015 | ISSN – 2249-555XReseaRch PaPeR
complicated sentence structures, and unusual words. The
difficulty level of the items should be appropriate for the
ability level of the group. Optimal difficulty for true-false
items is about 75 percent, for five-option multiple choice
questions about 60 percent, and for completion items ap-
proximately 50 percent. However, difficulty in itself is not
an end. The item content should be determined by the
importance of the subject matter. It is desirable to place
a few easy items in the beginning to motivate students,
particularly those who are of below average ability (Wig-
gins 1998; Halpern and Hakel 2003). The items should be
devised in such a manner that different taxonomy levels
are evaluated. Besides, items pertaining to a specific topic
or of a particular type should be placed together in the
test. Such a grouping facilitates scoring and evaluation.
It will also be helpful for the examinees to think and an-
swer the items, similar in content and format, in a better
manner without fluctuation of attention and changing the
mind set. Directions to the examinees should be as simple,
clear, and precise as possible, so that even those students
who are of below average ability can clearly understand
what they are expected to do. Scoring procedures must
be clearly defined before the test is administered. The test
constructor must clearly state optimal testing conditions for
test administration. Item analysis should be carried out to
make necessary changes, if any ambiguity is found in the
items (Gronlund and Linn 1990; Wiggins 1998; McMillan
4. TEST EVALUATION
A good test has good items. Good test making requires
careful attention to the principles of item evaluation. Of-
ten students judge, after taking the exam, whether the test
was fair and good. Teacher is also usually interested about
how the test worked for the students.
4.1 Item Analysis
Item analysis is about how difficult an item is and how
well it can discriminate between the good and the poor
students. In other words, item analysis provides a numeri-
cal assessment of item difficulty and item discrimination.
It provides objective, external and empirical evidence for
the quality of the items. The objective of item analysis is to
identify problematic or poor items which might be either
confusing the respondents or do not have a clearly correct
response or a distracter might well be competing with the
keyed answer. Item analysis comprises item difficulty and
item discrimination (Wiggins 1998; Riaz 2008).
4.1.1 Item Difficulty
Item difficulty is determined from the proportion (p) of
students who answered each item correctly. Item difficulty
can range from zero (none could solve it) to hundred (all
persons solved it correctly). The goal is usually to have
items of all difficulty levels in the test so that test could
identify poor, average as well as good students. However,
most of the items are designed to be average in difficulty
levels for they are more useful. Item analysis exercise pro-
vides us the difficulty level of each item. Optimally difficult
items are those that 50% − 75% of students answer cor-
rectly. Items are considered low to moderately difficult if
(p) is between 70% and 85% Items that only 30% or below
solve correctly are considered difficult ones. Item Difficulty
Percentage can also be denoted as Item Difficulty Index by
expressing it in decimals e.g. .40 for items which could be
solved by 40 % of the test-takers. Thus index can range
from 0 to 1. Items should fall in a variety of difficulty lev-
els in order to differentiate between good and average as
well as average and poor students. Easy items are usually
placed in the initial part of the test to motivate students
in taking the test and alleviating test-anxiety. The optimal
item difficulty depends on the question type and number
of possible distracters as well (Wiggins 1998; Riaz 2008).
4.1.2 Item Discrimination
Another way to evaluate items is to ask “Who gets this
item correct” − the good, average and the weak students?
Assessment of item discrimination answers this query. Item
discrimination refers to the percentage difference in cor-
rect responses between the poor and the high scoring stu-
dents. In a small class of 30 students, one can administer
the test items, score them and then rank the students in
terms of their overall score. Next, we separate the upper
15 students and the low 15 into two groups: The upper
and the lower groups. Finally, we find how well each item
was solved correctly (p) by each group. In other words,
percentage of students passing (p) each item in each of
the two groups is worked out. Discrimination (D) power of
the item is then known by finding difference between the
percentage of upper group and the low group. The high-
er the difference, the greater the discrimination power of
an item. An item with a discrimination of 60% or greater
is considered a very good item, whereas a discrimination
of less than 20% indicates a low discrimination and the
item needs to be revised. An item with a negative index
of discrimination indicates that the poor students answer
correctly more often than do the good students. Strange!
Such items should be dropped from the test. Most difficult
items having negative discrimination should be removed
from the quiz. 100% discrimination would occur if all those
in the upper group answered correctly and all those in the
lower group answered incorrectly. Zero discrimination oc-
curs when equal numbers in both groups answer correctly.
Negative discrimination, a highly undesirable condition,
occurs when more students in the lower group than the
upper group answer correctly. Items with 25% and above
discrimination are considered good (Wiggins 1998; Riaz
Tests have undergone radical changes in the past sixty
years due to improvements in measurement techniques
and better understanding of learning processes. From a
lengthy three hours essay type examination one can ass-
es more comprehensively in thirty minutes objective type
paper which can assess not only the knowledge but also
comprehension and application of knowledge. Additionally,
a well prepared paper can evaluate the students objective-
ly and quickly and large number of students in a class is
not a problem. Tests are the goal posts which act as guide
and motivators for students to learn. We all know from our
own experiences how students prepare for the examina-
tions. They not only learn what interests them the most or
are presented in a better way but also what type of pa-
per they expect from the teacher. Due to this factor a well
prepared examination paper is a guarantee of an effective
teaching learning process.
INDIAN JOURNAL OF APPLIED RESEARCH X 729
Volume : 5 | Issue : 6 | June 2015 | ISSN – 2249-555XReseaRch PaPeR