Estimating the reliability of cognitive task datasets is commonly done via split-half methods. We review four methods that differ in how the trials are split into parts: a first-second half split, an odd-even trial split, a permutated split, and a Monte Carlo-based split. Additionally, each splitting method could be combined with stratification by task design. These methods are reviewed in terms of the degree to which they are confounded with four effects that may occur in cognitive tasks: effects of time, task design, trial sampling, and non-linear scoring. Based on the theoretical review, we recommend Monte Carlo splitting (possibly in combination with stratification by task design) as being the most robust method with respect to the four confounds considered. Next, we estimated the reliabilities of the main outcome variables from four cognitive task datasets, each (typically) scored with a different non-linear algorithm, by systematically applying each splitting method. Differences between methods were interpreted in terms of confounding effects inflating or attenuating reliability estimates. For three task datasets, our findings were consistent with our model of confounding effects. Evidence for confounding effects was strong for time and task design and weak for non-linear scoring. When confounding effects occurred, they attenuated reliability estimates. For one task dataset, findings were inconsistent with our model but they may offer indicators for assessing whether a split-half reliability estimate is appropriate. Additionally, we make suggestions on further research of reliability estimation, supported by a compendium R package that implements each of the splitting methods reviewed here.
Split Half Reliability Pdf Download
An alternative approach that has been popular with cognitive tasks is split-half reliability. Trials are distributed across two parts, a score is calculated per part, and a two-part reliability coefficient, such as a Spearman-Brown adjusted Pearson correlation, is calculated between the two sets of part scores. Because split-half reliability is estimated via scores of aggregates of trials, practical issues with models for individual trials can be sidestepped. However, is the coefficient obtained an accurate estimate of reliability? One factor that may affect the accuracy of split-half reliability estimates is the method by which the task is split. A variety of methods have been proposed and applied to cognitive task data. To the best of our knowledge, these splitting methods have not yet been comprehensively reviewed nor systematically assessed, so we aimed to conduct such an examination.
The first effect we review is time: This effect can manifest, for example, when participants learn or become fatigued throughout a task. First-second halves splitting assigns trials to each part based on whether they belonged to the first or second half of the sequence administered to a participant. Since first-second splitting assigns early trials to one part and late trials to the other, first-second splitting is confounded with time effects. This confound has been used to argue against first-second splitting (Webb et al., 2006). Time effects have also been used to explain comparatively low reliability estimates found with first-second splitting of a Go/No-Go (GNG) task (Williams & Kaufmann, 2012), a Wisconsin Card Sorting Test (Kopp et al., 2021), and a comparable splitting method of a learning referent task (Green et al., 2016). Time effects can be controlled by balancing early and late trials between parts, which can be achieved by splitting trials based on whether their position in the sequence was odd or even.
Both overestimation and underestimation of reliability can occur depending on how such confounds violate the measurement model assumed by a reliability coefficient. For instance, in a model of essential tau equivalence, correlated errors may inflate reliability while unequal loadings of trials on a true score may attenuate it (Green et al., 2016). Regardless, we assume that any splitting method that controls for task design yields a more accurate reliability estimate, and that confounds between task design and splitting method may yield either overestimations or underestimations in a given task dataset. We conceptualize controlling for task design by balancing conditions between parts as stratified splitting: strata are constructed from the trials that belong to each condition. Each stratum is then split into two parts by applying another splitting method. This approach ensures that strata are balanced between parts and allows direct comparison of splitting methods that are confounded with task design and spitting methods that are not confounded with task design.
Each of the splitting methods reviewed so far (first-second and odd-even in combination with stratification) constructs a single pair of parts for each participant. Regarding this pair as a sample, we will collectively refer to these methods as single-sample methods. Single-sample splits have been popular, perhaps in part for being relatively easy to perform. However, besides that they may be confounded with time and task design, they can also be subject to trial-sampling effects; any single split may yield anomalously high or low reliability estimates, purely by chance. In contrast, resampling methods estimate reliability by averaging multiple coefficients calculated from parts that are composed of randomly sampled trials. Some of these coefficients may yield overestimations of reliability and some underestimations. On the aggregate level, trial sampling effects are attenuated by averaging coefficients from each random sample of trials. Resampled splitting may benefit less from stratification than single-sample splitting since anomalously high or low reliability estimates that are an artifact from a confound between one particular split and the task design are already averaged out. However, stratification could still benefit resampled splitting by making splits more equivalent.
However, acknowledging that resampled distributions can be informative, we have made an explorative assessment of their range, skew, and kurtosis. Additionally, we explored whether reliability estimates of resampled coefficients varied as a function of stratification. We have used the distributions of permutated coefficients to qualify reliability estimates of single-sample splitting methods. The percentile of a reliability estimate of a single-sample splitting in the empirical cumulative distributions of non-stratified permutated coefficients indicates how extreme that estimate is compared to any random splitting method.
Continuing to sampling with replacement, we first note that each splitting method reviewed above constructs parts that are half of the length of the original task. Reliability coefficients estimate the reliability for the full-length task based on half-length scores and certain measurement model assumptions. An assumption shared by various reliability coefficients is a linear relation between task score and trial scores. Linearity is assumed by alpha (Green et al., 2016), Spearman-Brown-adjusted intraclass correlation coefficient (de Vet et al., 2017; Hedge et al., 2018; Warrens, 2017), and Spearman-Brown-adjusted Pearson correlation (Abacioglu et al., 2019; Chapman et al., 2019; Cooper et al., 2017; de Hullu et al., 2011; Enock et al., 2014; Lancee et al., 2017; MacLeod et al., 2010; Schmitz et al., 2019; Waechter et al., 2014). However, task-scoring algorithms may apply a range of non-linear transformations, thereby being incompatible with these coefficients by design.
In summary, we reviewed four effects that may confound splitting methods and so affect the accuracy of split-half reliability estimates: time, task design, trial sampling, and non-linear scoring. We listed two splitting methods that produce a single sample of parts based on trial sequence (first-second and odd-even) and two that randomly resample parts (permutated and Monte Carlo). A fifth method, stratified splitting, balances conditions between parts by constructing strata that are next split with one of the other methods. We have argued that confounding effects are likely to interact with splitting methods as follows: Time is confounded with first-second and controlled for by odd-even. Task design can be confounded with single-sample splitting and controlled for by stratification. Resampled splitting controls for confounds by averaging them out. Non-linear scoring is confounded with first-second, odd-even, and permutated splitting, and controlled for by Monte Carlo splitting. Hence, based on our theoretical review, Monte Carlo splitting, perhaps stratified by task design, could be considered as most robust against the four confounding effects that we listed.
To examine to what degree interactions between confounding effects and splitting methods affected reliability estimates, we systematically applied each splitting method to four different cognitive task datasets. Datasets were selected from three published studies in the fields of gambling-related cognitive processes (Boffo et al., 2018), ethnic stereotypes (Abacioglu et al., 2019), and executive functioning (Hedge et al., 2018). Each dataset was scored with a different non-linear algorithm, commonly used in that literature. Task datasets were from four different paradigms: Approach Avoidance Task (AAT), GNG, IAT, and SST. We compared reliability estimates between splitting methods and task datasets to assess to what degree confounding effects of time, task design, and non-linear scoring were likely to be present. When a splitting method that was confounded with time, task design, or non-linear scoring yielded a different reliability estimate than a splitting method that controlled for these effects, we considered that evidence for the presence of that confounding effect. Note that our hypotheses are non-directional; we did not expect any reliability estimate to be larger or smaller than another estimate, but only expected differences being present between reliability estimates. Exploratively, we examined the combination of resampled splitting and stratification, as well as the shapes of the distributions of the coefficients from resampled splits. 2ff7e9595c
コメント