Thursday, August 16, 2007

How To Build A Round Concrete Bathtub

CORRECTION AND CONSISTENCY BETWEEN MULTIPLE ASSESSORS

Usually when assessments are carried out large-scale academic performance, it is common to use different types of questions. While initially, National Assessments used only multiple choice closed questions, from the IN 2001 is also used open questions in the tests administered by the Ministry of Education. At present, various evaluation systems used for such questions as reported by Patz, Junker, Johnson and Mariano (2002) open questions are commonly used in large-scale educational assessments, as they allow more complex to evaluate educational achievement.

This improved assessment tools brings additional complexity, is that automated systems can not be used for the correction of the questions. A test consists only of closed questions can be described using any computer program a template for comparing and rating the responses of each of the persons assessed to each of the questions, in order to assign the score accordingly. In contrast, open response correction involves a different procedure. Usually working with a group of judges [1] independent read, evaluate and score the answers to the questions.

When working with the evaluations made by the judges, is always involved some degree of subjectivity, which is reduced by training the assessors and the creation of manuals qualification criteria that seek to standardize the assessments made by the judges (Stemler 2004). Despite the important and useful it can be given training to judges, MacMillan (2000) warns that it has been very systematic training has given them, many investigations show that the variability between braces can not be eliminated altogether. For example, Wolfe (2004) distinguishes three types of effects or biases that may occur even after the judges have been trained:

1. Accuracy / Unclear:
- involves how well trained or who has much experience is a judge to assign accurate ratings. That is, it assumes that there is a standard, correct score and you want to see that both approaches a qualifier to that standard.
- The ability to assign precise scores depends on many factors, both personal (education, training, thinking styles, etc..) And contextual (no distracting social interactions that occur in the environment rating, etc.)..

2. Severity / Leniencia
- In these cases there may be qualifiers that consistently give higher scores when compared with the other qualifiers (Lenient) or that give lower grades (severe).

3. Centrality / extremism
- This effect implies that raters tend to use mainly the intermediate grades (centralization) or only the high and low (extremism)
Given the presence of these biases is very important to evaluate inter-rater reliability, which is defined by Stemler (2004) as the degree of agreement between a particular set of judges, using a specific assessment instrument at a specific time. This is a property of the assessment situation, not the instrument itself. Therefore must be tested each time you change the assessment situation.

analyzes inter-rater reliability have generally worked from three theoretical models: the Classical Test Theory, Generalizability Theory and Rasch Model multifaceted. It is also important to note that inter-rater reliability is not a unitary concept as there are different perspectives on the conceptualization and analysis. Thus, Stemler (2004) proposed a tripartite classification of the various types of estimates of inter-rater reliability, the synthesis presented below

1. Consensus:
- The consensus estimates are based on the assumption that two or more independent observers must reach an agreement over exactly how to apply various levels of a rating scale to observed behavior.
- This index is most useful when data are nominal in nature, ie the rating scale represents qualitative differences.
- The methods used for this type of reliability involves calculating the percentage of agreement, sometimes referred rate agreements (Drain 1998), Cohen's Kappa and other less used as the Jaccard J, G Index Delta index proposed by Martin and Femia (2004). These indices have the disadvantage of having to apply to each question and each pair of judges.
- is important to note that there is a variant of Kappa, Kappa known as a hub, which can be applied when more than two raters (Watkins 2002).

2. Consistency:
- consistency estimates assume that no two judges need to show consensus on the use of a rating scale, provided that the differences of consensus is applied consistently. That is, the judge may be assigned to always or almost always grade 1 to certain types of responses, while the judge assigned B always or almost always score 3 to the same type of response. Therefore, the difference in scores between the two will be predictable and may corrected by the use of additive constants.
- This approach is used when quantitative data is continuous, but can also be applied to ordinal variables, if you are supposed to represent a continuum along a single dimension.
- The advantage of this approach is that if there is a consistency in ratings between judges, strategies can be applied correction for differences in severity. For example, if a judge gives a grade point above that of another judge B and this is repeated consistently, the correlation between both ratings is quite high and may equate the scores of both judges minus one for all people who were tested by Judge A.
- The procedures for consistency estimates include the Pearson correlation (continuous variables) and Spearman (ordinal variables). In the case of having several judges, you can use the Kendall's W (Cairns 2003, Legendre 2005).

3. Size:
- Estimates of measurement postulate that should be used all the available information on the judges when it comes to creating a final grade for each person tested. For example, you can control the effects of severity or leniency when assigning grades.
- are used when the different levels of the rating scale intended to represent different levels of a latent variable dimensional. They also serve when you have several judges and it is impossible that all qualified judges all questions. That is, they have an incomplete data matrix, with connections through ordinary people (a person or group of persons is rated independently by different judges, but not all people are ranked by all judges).
- The procedure most often used for such estimates is the multi-faceted Rasch analysis. In the case of modeling the qualifications of judges as a facet, assumed that the probability of response to a question is a function of the ability of those who answered the question, the difficulty of the question and the severity of the correction (Bond and Fox 2001). Another alternative is proposed by Patz, Junker, Johnson and Mariano (2002) called the Hierarchical Model of Assessors (HRM), which in the context of Generalizability Theory, using the distributions of the latent abilities inherent Response Theory the Item, instead of distributions of true scores and normal distribution assumptions.



REFERENCES Barrett, P. (2001). Assessing the reliability of rating data. Retrieved April 25, 2005, from http://www.liv.ac.uk/ ~ pbarrett / rater.pdf. T. Bond
and Fox, Ch (2001). Applying the Rasch Model: Fundamental Measurement in the Human Sciences. New Jersey: Lawrence Earlbaum Associates.
Cairns, P. (2003). MSC in Research Methods Statistics: Examples of Correlations. UCL Interaction Centre. Retrieved April 26, 2005, from http://www.uclic.ucl.ac.uk/paul/PsyStats/4NonParaCorrel/4Examples.pdf.
Drain, M. (1998). Quantification of content validity for judging criteria. Journal of Psychology, 6.
Legendre, P. (2005). Species Associations: The Kendall Coefficient of Concordance Revisited. Journal of Agricultural, Biological, and Environmental Statistics, 10 (2), 226-245.
Macmillan, PD (2000). Classical, generalizability, and multifaceted Rasch interrater variability in detection of large, sparse sets. The Journal of Experimental Education, 68 (2), 167-190.
Martín, A. y Femia, M. (2004). Delta: A new measure of agreement between two raters. British Journal of Mathematical & Statistical Psychology, 57, 1-19.
Patz R.J., Junker, B.W., Johnson, M.S. y Mariano, L. (2002). The hierarchical rater model for rated tests items and its application to large-scale educational assessment data. Journal of Educational and Behavioral Statistics, 27 (4), 341-384.
Stemler, S. E. (2004) A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research & Evaluation, 9 (4). Consulta hecha en 06/03/2005.
Watkins, M. W. (2002). MacKappa [programa informático]. Pennsylvania State University: Author.
Wolfe, EW (2004). Identifying rater effects using latent trait models. Psychology Science, 46 (1), 35-51.
[1] In some cases they refer to people who perform this function as assessors, evaluators, observers, encoders or editors.

0 comments:

Post a Comment