It is important to note that in each of the three situations in Table 1, the pass percentages of the two examiners are the same, and if both examiners are compared to a usual test of 2 × 2 for matched data (McNemar test), no difference between their performance would be noticed; On the other hand, the agreement between observers is very different in all three situations. The basic concept to understand here is that the “agreement” quantifies the concordance between the two examiners for each of the “pairs” of notes, rather than the similarity of the overall percentage of success between the examiners. The statistical methods used to assess conformity vary according to the type of variable studied and the number of observers between whom agreement is sought. These are summarized in Table 2 and are explained below. Cohen`s kappa (κ) calculates the agreement between observers, taking into account the expected random correspondence, as follows: Consider the case of two examiners A and B who evaluate the answer sheets of 20 students in a class and rate each of them as “passed” or “failed”, with each examiner passing half of the students. Table 1 shows three different situations that can occur. In situation 1 of this table, eight students receive a “passed” grade from both examiners, eight receive a “failed” grade from both examiners, and four receive a successful grade from one examiner but a “failed” grade from the other (two passed by A and the other two by B). Thus, the results of the two examiners for 16/20 students agree (agreement = 16/20 = 0.80, disagreement = 4/20 = 0.20). That sounds pretty good.

However, this does not take into account the fact that some of the notes may have been conjectures and that the agreement may have only been reached by chance. Bland JM, Altman DG. Compliance of measurements in method comparison studies. Stat Methods Med Res 1999;8:135-60. The statistic κ can take values from − 1 to 1 and is interpreted somewhat arbitrarily as follows: 0 = correspondence equivalent to chance; 0.10–0.20 = slight chord; 0.21–0.40 = fair agreement; 0.41–0.60 = moderate approval; 0.61 to 0.80 = substantial agreement; 0.81–0.99 = near-perfect match; and 1.00 = perfect match. Negative values indicate that the observed match is worse than might be expected by chance. Another interpretation is that kappa levels below 0.60 indicate a significant degree of disagreement. Readers are invited to consult the following documents, which contain measures of agreement: κ = (agreement observed [Po] – agreement expected [Pe])/(1 agreement expected [Pe]). Bland JM, Altman DG. Statistical methods for assessing the correspondence between two clinical measurement methods.

Lancet 1986;1:307-10. Chen CC, Barnhart HX. Comparison of CCI and CCC to assess agreement for data without and with replications. Comput Stat Data Anal 2008;53:554-64. An alternative to Pearson correlation that is more suitable for comparing diagnostic tests is the intraclass correlation coefficient (ICC). It was first proposed by Fisher4 and is defined assuming that diagnostic test results follow a disposable ANOVA model with a random effect on the subject. This random effect explains the repeated measurements for each subject. CCI is defined as the ratio of variance between subjects (( sigma_{alpha }^{2} )) to total variance, which is composed of variance between subjects and variance within subject (( sigma_{e}^{2} )): For ordinal data where there are more than two categories, it is useful to know whether the scores of the different evaluators varied by a small measure or a large amount. For example, microbiologists can assess bacterial growth on culture plates as: none, occasional, moderate or confluent. Here, evaluations of a particular plate by two assessors as “occasional” or “moderate” would imply a lower degree of discord than if these scores were “growth-free” or “confluent”.

Kappa`s weighted statistics take this difference into account. This therefore gives a higher value if the respondents` responses match more closely, with the maximum values for a perfect match; Conversely, a larger difference between two ratings provides a lower weighted kappa value. The techniques for assigning the weighting of the difference between categories (linear, square) may vary. Compliance limits = mean difference observed ± 1.96 × standard deviation of observed differences. When comparing two measurement methods, it is almost always wrong to present a correlated point cloud as a measure of the correspondence between the matched data. Highly correlated results often disagree badly, in fact, large changes in measurement scales can leave the correlation coefficient unchanged. It is therefore necessary to provide for a certain degree of agreement. StatsDirect provides a graph of the difference from the mean for each pair of measures. This graph also shows the average total difference, which is limited by the limits of the agreement.

A good overview of this topic was provided by Bland and Altman (Bland and Altman, 1986; Altman, 1991). The matched t-test provides a hypothesis test of the difference between population means for a pair of random samples whose differences are approximately normally distributed. Please note that a pair of samples, each of which does not come from a normal distribution, often results in differences that are normally distributed. Methods for evaluating agreement between observers according to the type of variable measured and the number of observers Two methods are available to evaluate the correspondence between the measurements of a continuous variable between observers, instruments, time, etc. One of them, the intraclass correlation coefficient (ICC), provides a single measure of the degree of agreement, and the other, the Bland-Altman diagram, further provides a quantitative estimate of the proximity of the values of two measures. If the primary purpose of examining a pair of samples is to see exactly how the samples match, rather than looking for evidence of differences, then limits of agreement are useful (Bland and Altman 1986, 1996a, 1996b). StatsDirect displays these limits with an agreement diagram if you enable the Agreement field before running a paired t-test. For a more detailed analysis of this type, see Analysis of agreements.

Note that if one of the tests is a reference or a “gold standard”, the bias is based on the difference between the result of the new test and the “actual value” of the quantity to be measured, and therefore on a precision measurement.10 For these cases, it can be said that the CCC measures both accuracy and consistency. But if neither test is a gold standard, it is not appropriate to say that CCC also provides accuracy measurement. The question is often raised as to whether measurements made by two (sometimes more than two) different observers or by two different techniques lead to similar results. This is called agreement or concordance or reproducibility between measures. Such an analysis examines pairs of measurements, either categorically or both numerically, each pair being performed on an individual (or a pathology slide or X-ray). While we cannot “prove zero” that there is no difference between test results, we can use equivalence tests to determine whether the mean difference between test results is small enough to be considered (clinically) insignificant. Bland and Altman`s Limits of Agreement (LOA) approach the problem in this way by providing an estimate of a range where 95% of the differences between test results should decrease (assuming that these differences are roughly normally distributed).2,3 LOAs are calculated as ( bar{d} pm 1.96 cdot s_{text{d}}), where ( bar{d} ) is the average of the sample of differences and s d is the default deviation of the sample. If the LOA range contains what are considered clinically significant differences, this result would indicate that the agreement between the tests is not satisfactory.

LOAs are also often represented graphically by plotting the average result for each subject in relation to the difference between these results. Figure 1 illustrates this tool using a hypothetical example. Bland and Altman warn that the LOA is only a significant result if the mean and variance of the differences between the test results are constant across the range of test results.3 In other words, the LOA should not be used if the correspondence between the tests varies with the quantity measured. Such a situation can occur if the tests provide similar results for subjects whose test results are within a normal range, but have a poor match for subjects outside that range. This function gives a matched student t-test, confidence intervals for the difference between an average pair, and optionally matching limits for a pair of samples (Armitage and Berry, 1994; Altman, 1991). As stated above, correlation is not synonymous with agreement. .