Training and Experience in Study Selection (TESS): study protocol for a pilot randomised trial within a systematic review

Elayne Ahern; Temitayo Adedeji; Aoife Whiston; Sarah Dillon; Fiona Lynn

doi:10.12688/hrbopenres.14129.1

Home Browse Training and Experience in Study Selection (TESS): study protocol...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Study Protocol

Training and Experience in Study Selection (TESS): study protocol for a pilot randomised trial within a systematic review

[version 1; peer review: 1 approved]

Elayne Ahern ^1,2, Temitayo Adedeji¹, Aoife Whiston^1,2, Sarah Dillon^2,3, Fiona Lynn⁴

Elayne Ahern ^1,2, Temitayo Adedeji¹, [...] Aoife Whiston^1,2, Sarah Dillon^2,3, Fiona Lynn⁴

PUBLISHED 14 Apr 2025

Author details Author details

¹ Department of Psychology, University of Limerick, Castletroy, Limerick, V94 T9PX, Ireland
² Health Research Institute, University of Limerick, Castletroy, Limerick, V94 T9PX, Ireland
³ School of Allied Health, University of Limerick, Castletroy, Limerick, V94 T9PX, Ireland
⁴ School of Nursing and Midwifery, Queen's University Belfast, Belfast, BT9 7BL, UK

Elayne Ahern
Roles: Conceptualization, Data Curation, Funding Acquisition, Methodology, Project Administration, Resources, Supervision, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Temitayo Adedeji
Roles: Methodology, Project Administration, Resources, Writing – Original Draft Preparation, Writing – Review & Editing

Aoife Whiston
Roles: Conceptualization, Funding Acquisition, Methodology, Resources, Writing – Review & Editing

Sarah Dillon
Roles: Conceptualization, Funding Acquisition, Methodology, Resources, Writing – Review & Editing

Fiona Lynn
Roles: Conceptualization, Funding Acquisition, Methodology, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

Abstract

Background

Systematic reviews can be resource-intensive and require timely completion, yet limited availability of experienced reviewers often necessitates incorporating novice members into review teams. The purpose of this Study Within A Review (SWAR) will be to determine whether training and level of experience within the screening pair affects the reliability of decisions made by novice screeners during study selection for a systematic review

Methods

A 2(training: task-specific, minimal guidance) x 2(experience level of screening partner, ‘Reviewer 1’: moderate experience, minimal experience) pilot randomised trial will be conducted within a host systematic review in the topic area of depression and psychosocial functioning. Participants (N = 12), consisting of higher education students with no prior experience in evidence synthesis, will be randomised to one of the four conditions to complete a standardised study selection task at title/abstract level (k = 219 records) on Covidence systematic review screening software, blindly and independently. Total participation time is estimated at 5 hours. Screening decisions made by participants will be assessed for reliability against the consensus-based decisions by two reviewers with content and methodological expertise (expert standard), through calculation of chance-corrected Cohen’s kappa and percentage of agreement, then compared across the conditions. Secondary outcomes will include reliability within the screening pair (participant and allocated screening partner), validity of screening decisions (false positives, false negatives, sensitivity, specificity), feasibility measures, including time taken to complete the study selection task and success of blinding, as well as acceptability.

Conclusions

Findings will be used to inform the design of subsequent trial work to determine the efficacy of training and screener pairing for study selection. Ultimately, these insights will help to build capacity among novice screeners to engage with evidence synthesis and work alongside experienced review teams.

Registration

Northern Ireland Hub for Trials Methodology Research SWAR Registry: SWAR 38.

Keywords

higher education, novice, interrater reliability, study selection, study within a review, SWAR, systematic review, training

Corresponding author: Elayne Ahern

Competing interests: No competing interests were disclosed.

Grant information: This study was supported by the Health Research Board (Ireland) and the HSC Public Health Agency [ESI-2021-001] through Evidence Synthesis Ireland/Cochrane Ireland.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2025 Ahern E et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Ahern E, Adedeji T, Whiston A et al. Training and Experience in Study Selection (TESS): study protocol for a pilot randomised trial within a systematic review [version 1; peer review: 1 approved]. HRB Open Res 2025, 8:54 (https://doi.org/10.12688/hrbopenres.14129.1) First published: 14 Apr 2025, 8:54 (https://doi.org/10.12688/hrbopenres.14129.1) Latest published: 14 Apr 2025, 8:54 (https://doi.org/10.12688/hrbopenres.14129.1)

Background

Systematic reviews have gained prominence with a 20-fold increase in the number of systematic reviews produced from 2009–2019 (Hoffmann et al., 2021). This type of review involves a rigorous synthesis of studies on a chosen topic, aiming to assess the current state of knowledge and identify gaps in the literature through a systematic approach from literature searching, through to study selection, data extraction, quality appraisal or assessment and, finally, synthesis (Siddaway et al., 2019). For this reason, systematic reviews are highly regarded as a basis for evidence-based policy and practice recommendations. However, systematic reviews require time and significant human resources for completion due to their comprehensive and rigorous approach to identifying and synthesising all relevant literature in a given topic area (Borah et al., 2017). This can present as a barrier to timely completion and necessitates that consideration is given towards how to improve efficiencies in the process, whilst also maintaining high quality standards.

Specifically, study selection is a resource-intensive step in the process. As per international best practice recommendations (Edwards et al., 2002; Waffenschmidt et al., 2019), study selection should involve at least two independent reviewers to ‘screen’ study records, and thus can be referred to as ‘screeners’ for this specific stage of the systematic review (Stoll et al., 2019). By accurately removing records at the earliest stages, study selection is key to efficiency in the process. Moreover, study selection is integral to ensuring that relevant studies proceed through to inclusion in the systematic review, supporting a complete and valid synthesis of evidence relevant to the review question (Polanin et al., 2019). The use of two screeners offers several benefits, such as ensuring consistent application of the eligibility criteria (i.e. determining which records should be included or excluded in order to address the review question) to avoid any systematic errors as well as reducing the likelihood of any random errors by one screener compromising the accurate selection of records (Waffenschmidt et al., 2019). Having at least two, independent screeners on each study record helps to ensure a transparent and reproducible method where the eligibility criteria have been applied appropriately and consistently throughout, as agreed by both screeners (Edwards et al., 2002). This can be evaluated through the calculation of interrater reliability, or a chance-corrected estimate known as Cohen’s kappa (Cohen, 1960), which is more conservative and accounts for chance, spurious agreement.

However, the recommendation of two or more reviewers to screen each study record is not always feasible. Researchers with content and methodological expertise (Lasserson et al., 2021) may not be available to complete vast amounts of screening for study selection within defined project periods. Furthermore, systematic reviews are often constrained by tight deadlines and limited budgets, which can hinder their completion within the required timeframe (Waffenschmidt et al., 2019). Consequently, systematic review teams may comprise of not only experienced researchers (Lasserson et al., 2021) but also more novice team members such as student supervisees who undertake research as part of their programme of study (e.g., Tendal et al., 2009). Although rapid review methods present as a solution, allowing for a timely response to inform evidence-based decision making, these methods can be deemed responsive to situational constraints and the urgency for the evidence (Moons et al., 2021). For example, only one database may be searched or study selection undertaken by just one screener. Rapid reviews, although adhering to core standards of systematic review (Garritty et al., 2021; Nussbaumer-Streit et al., 2023), are not a substitute for the comprehensiveness and rigour of the full systematic review process. Thus, understanding how efficiencies can be achieved in the systematic review process is still of utmost interest.

Study selection methods are relatively under explored (k = 11; Robson et al., 2019) compared to other steps in the systematic review process such as data extraction and study appraisal. Current guidance primarily focuses on the study selection procedure itself (e.g., single vs. dual screening, Gartlehner et al., 2020; Waffenschmidt et al., 2019; titles-first vs. title and abstract, Mateen et al., 2013), while practical guidelines on the composition of the screening team, particularly in terms of experience level, have received little attention. The few studies that address this, such as Cooper et al. (2006) and Ng et al. (2014), have found that the experience level of screeners can impact performance during study selection. Cooper and colleagues (2006), for example, examined how novice screeners, either professional dieticians or graduate nutrition students, compare with an expert standard (PhD trained nutritional scientists) during study selection for records regarding the diet-disease relationship. Findings indicated that interrater reliability within the pairs of professionals or pairs of students did not significantly differ, nonetheless, important differences between these pairs with the expert standard were observed. Novice screener pairs, whether professionals or students, may remove potentially eligible studies during title and abstract screening and thus impact the accuracy of study selection. This is further underscored by Ng et al. (2014) who found that study selection undertaken by medical students with no prior experience in systematic reviews varied to a large degree, with only a modest level of agreement (sensitivity) with the experienced reviewer.

Further, no studies have been identified on the role of training to improve study selection outcomes among novice screeners, although training has been assessed for other steps of the systematic review process such as quality appraisal (e.g., Acosta et al., 2020; da Costa et al., 2017; Oremus et al., 2012). For example, Acosta and colleagues (2020) explored the effect of a 2-hour structured training on reducing errors and bias during the quality appraisal stage of a systematic review among 14 doctoral students, with no prior experience in systematic reviews. This training included a practice activity where the novices compared their scores across the nine criteria on the Methodological Quality Questionnaire for a given study record to that of the expert, discussing hits (agreements) and misses (disagreements) for the practice activity. The student reviewers were randomly assigned one of three study records of varying quality for appraisal. Following independent completion of the study quality appraisal task, and when compared to expert raters, 43% of the novice student reviewers accurately assessed 7 out of 9 (78%) methodological quality criteria when evaluating the quality of the study record. Training-based benefits have not always been observed among reviewers with already established expertise (Fourcade et al., 2007) or even among novice student reviewers (Oremus et al., 2012) during the quality assessment phase. Nonetheless, when formats of training (e.g., intensive vs. minimal) have been investigated among novice student reviewers, findings have suggested that the content of training could influence the likelihood of positive training effects. da Costa et al. (2017) examined minimal training, consisting of an introductory 1-hour lecture on risk of bias assessment, with intensive training, which included the same lecture but with specific assessment instructions provided by an experienced reviewer (vs. reading material provided to the minimal training condition). Further, the student reviewers who received intensive training then undertook a practice assessment in a purposely selected sample of 10 records, with assessment decisions then discussed by the experienced reviewer. Overall, this produced more favourable kappa estimates for both within- and between-group interrater reliability, as novice screeners who received the intensive training were afforded an opportunity to calibrate their assessments to that of the experienced reviewer during the practice task. This underscores the value of specific training in enhancing the ability of novice reviewers to engage with systematic review processes.

Alongside the role of training, there is reason to believe that the pairing of novice screeners with more experienced review team members when undertaking study selection could help improve reliability estimates. As evidenced by da Costa et al. (2017), interrater reliability estimates can be improved when the opportunity is afforded to novice reviewers to align their decisions with that of the experienced reviewer. During study selection, the use of software such as Covidence (Veritas Health Innovation) can help to streamline the process among screeners and promptly notify of conflict (disagreement) detection between screeners, offering an opportunity for novice screeners to subsequently adjust and re-calibrate their performance. As such, the experience level of the screening partner is also of interest. To the best of our knowledge, studies that have examined novice reviewer performance in dual assessment tasks have only ever done so investigating novice-novice pairings (e.g., da Costa et al., 2017; Oremus et al., 2012). Although experience level has not always been associated with more valid and reliable performance during systematic review processes (e.g. Horton et al., 2010; data extraction), it is plausible that novice reviewers could use conflict detection to recognise patterns of disagreements during study selection as a result of exposure to more expert decision-making and adjust their approach for subsequent study records accordingly. As per guidelines, explicit discussions should be had within the review team to resolve conflicts and consolidate the rationale behind the decision to include or exclude a study record (Lefebvre et al., 2024). Nonetheless, the active pairing of a novice reviewer with a more experienced reviewer could offer a referential standard through real-time feedback during the screening process, helping to overall improve efficiency. Real-time feedback on conflict detection during dual screening with an experienced reviewer could help to consolidate the impact of training, however, this has not yet been investigated.

Aim and objectives

Systematic reviews are considered the highest form of evidence, guiding policy and practice through the transparent and rigorous synthesis of research (Haddaway & Pullin, 2014). The reliability of study selection is integral to this process, particularly when review teams include more novice screeners. Inconsistent selection processes can directly influence which studies are included, ultimately shaping the available evidence base upon which recommendations are developed. Thus, the aim of this pilot study is to determine whether training and level of experience within the screening pair affects the reliability of decisions made by novice screeners during study selection for a systematic review. Drawing on the Study Within A Review (SWAR; Devane et al., 2022) methodology, where a study is embedded within a systematic review in order to address questions of methodological uncertainty during the review process, we propose the following objectives:

(i) To examine whether experience level of the screening partner and the provision of training affects the reliability of screening decisions made by novice, student screeners (between-group reliability).
(ii) To examine whether pairing a novice, student screener with an experienced screener (vs. novice screener), alongside the provision of task-specific training (vs. minimal guidance), improves reliability estimates within the screening pair during the study selection process (within-group reliability).
(iii) To explore the feasibility and acceptability of training materials for study selection by novice, student screeners.

Specifically, we hypothesise that reliability in study selection will be highest among the group of novice screeners that receive task-specific training and are paired with an experienced screener.

Methods

This pilot study will be conducted in accordance with SWAR guidance (Devane et al., 2022) and is pre-registered on the Northern Ireland Network for Trials Methodology Research SWAR Repository Store (SWAR 38; see ‘Extended Data’ section below, Ahern et al., 2025), with notice of acceptance on 4 February 2025. The protocol is reported according to SPIRIT guidelines for trial protocols (Standard Protocol Items: Recommendations for Interventional Trials; Chan et al., 2013), and supplemented by the CONSORT extension to pilot trials (Eldridge et al., 2016), as is recommended for the reporting of pilot and feasibility study protocols (Thabane & Lancaster, 2019). See the completed SPIRIT and CONSORT extension checklists, available from the OSF link provided in the ‘Reporting Guidelines’ section below. Any modifications to the protocol will be recorded and subsequently presented in the pilot study report.

Study design

This SWAR is a pilot study that will employ a 2x2 factorial randomised controlled trial (RCT) design. Participants will be randomised using a random number generator for simple randomisation to one of four conditions: (i) task-specific training with a novice screening partner, (ii) task-specific training with an experienced screening partner, (iii) minimal guidance training with a novice screening partner, and (iv) minimal guidance training with an experienced screening partner (see Figure 1). A member of the research team (AW) has computer-generated the randomisation sequence (conditions A, B, C, D) in advance of recruitment and will be kept blind to treatment allocation as well as data analysis. Recruitment and onboarding of participants will be separately managed by an unblinded research assistant coordinator (TA), who following participant informed consent, will contact the respective team member that handles randomisation to determine the next pre-determined allocation in the sequence. Members of the team that will be involved in data analysis (EA) will remain blinded to allocation to ensure minimal risk of bias to data analysis. Participants will remain blinded to their allocated group throughout the study. Ethical approval for this study has been obtained from the Faculty of Education and Health Sciences Research Ethics Committee at the University of Limerick (EHS Approval Number: 2024_06_24_EHS).

Figure 1. Flow diagram of randomisation to the respective pilot study conditions.

Figure Note. Participants (N =12) will be randomised to a training condition (task-specific or minimal guidance) and screening partner (novice screener or experienced screener). The consensus-based decisions of the expert reviewers (E1, E2) will inform the expert reference standard. E1 = Expert reviewer 1 with content and methodological expertise; E2 = Expert reviewer 2 with content and methodological expertise.

Study selection task

Study records. An on-going systematic review project led by the PI (PROSPERO registration: CRD42022324367) will be used to host the SWAR. This host review addresses non-pharmacological treatments for psychosocial functioning in major depressive disorder. Please see ‘Extended Data’ (Ahern et al., 2025) for a sample database search strategy. A comprehensive search of the literature has been undertaken across 6 databases, identifying in excess of 12,500 potentially eligible study records from inception through to the most recent updated search in August 2024. From this, a sample of 2% (k = 219) has been identified for use in the study selection task for this SWAR. This sample has been selected by the PI to represent the full breadth of eligibility criteria so as to best capture the efficacy of training to promote high reliability during title/abstract study selection. Based on calculations using the kappaSize package (Rotondi, 2018) and PowerBinary function on r, 219 records were determined for the study selection task in order to achieve 80% power to detect a significant difference in reliability assuming kappa = 0.51 (weak/fair agreement, McHugh, 2012; Orwin, 1994) between the expert standard and the minimal guidance training+novice screening partner condition relative to kappa = 0.71 (moderate/good agreement, McHugh, 2012; Orwin, 1994) between the expert standard and task-specific training+experienced screening partner condition, with an alpha level of .05, and assuming a 20% prevalence of include at title/abstract level. Consistent with other SWAR methodologies assessing rater agreement (Cooper et al., 2006; da Costa et al., 2017), power calculations are used to determine the adequate number of study records that need to be assessed to ensure the calculation of reliable kappa statistics. Prior to finalising the selection of records to be used for the screening task, a pilot sample of k = 20 records have been selected for screening within the review team (EA, AW), with the rationale for inclusion/exclusion documented. Interrater reliability, that is both reviewers agreeing ‘YES’ to include or ‘NO’ to exclude to each respective study record, was 100% (κ = 1.00) suggesting perfect agreement (Cohen, 1960; McHugh, 2012). The reference list of study records for the study selection task are available on the project OSF, see ‘Extended Data’ (Ahern et al., 2025).

Expert standard. All study records have been screened independently by two members of the review team (EA, AW) with several years of methodological experience in evidence synthesis and content expertise in the area of depression (M = 9.5, SD = 1.41 years of experience). As per the criteria proposed by Horton et al. (2010), EA and AW would meet the standards of ‘substantial experience’ in systematic review (>7 years involved in systematic reviews, >7 reviews for which involved in at least study selection). Screening has been completed on separate Covidence project sites to ensure complete independence in decision making. Any screening discrepancies have been resolved through discussion with the consensus-based decisions then used as the expert standard. The interrater reliability was 93.15% agreement (κ = 0.77, SE = 0.06, 95% CI[0.66, 0.88]).

Screening partner (‘Reviewer 1’). A novice member of the review team (TA) with ‘minimal experience’ (≤2 years involved in systematic reviews, ≤2 systematic reviews for which involved in at least study selection) and another member (SD) with ‘moderate experience’ in systematic review methodology (4 – 6 years involved in systematic reviews, 4 – 6 systematic reviews for which involved in at least study selection), as per the classifications proposed by Horton et al. (2010), have independently completed screening for the selected records on separate Covidence project sites to avoid influencing their respective screening decisions. The screening decisions made by TA and SD will then be used for the ‘Screening Partner’ condition, corresponding to novice and experienced screening partner, respectively. The screening decisions have been loaded on to Covidence project sites in advance and described as the screening decisions made by ‘Reviewer 1’. Participants will be randomly allocated to one of two screening partners in combination with a training condition (task-specific, minimal guidance; See ‘Training’ section below).

Participants

A total of 12 participants (N = 12) will be recruited within an anticipated 3-month timeframe. Recruitment commenced in February 2025. As this SWAR is designed as a pilot study, power analyses were not used to determine sample size and it was deemed appropriate to recruit a small number of participants per trial condition, consistent with other SWAR methodologies (e.g., Acosta et al., 2020; Cooper et al., 2006; da Costa et al., 2017; McGuire et al., 1985; Oremus et al., 2012). Further, the small sample size is appropriated given that this pilot trial is not intended for hypothesis-testing but more so for exploration of hypotheses, and to assess the feasibility and acceptability of the proposed methods to inform the design of a subsequent fully powered trial.

Participants will be recruited from higher education institutions with convenience sampling of health sciences students from the University of Limerick in the first instance given the small target sample size and the access afforded to the research team given roles in teaching at undergraduate and postgraduate level within the discipline. Information regarding the study will be disseminated through course site postings on the virtual learning environment, with word-of-mouth and snowball sampling methods also employed. The eligibility criteria for participation include: (i) being at least 18 years of age; (ii) being enrolled as a student in higher education; (iii) having no prior training or experience in conducting evidence synthesis; and (iv) having access to a laptop or computer with a reliable internet connection. Participants will be entered into a draw to win a monetary prize (2 x €50 vouchers) as renumeration for their participation.

Training conditions

Both the task-specific training and minimal guidance training conditions will receive a standardised outline of eligibility criteria for study selection (see ‘Extended Data’, Ahern et al., 2025). Additionally, a recorded online training with narration, embedded video content, and slide deck will be provided, but the extent of training will differ between the conditions (task-specific vs. minimal guidance). Covidence Support (Veritas Health Innovation) provides a bank of online training videos which have served as a point of reference in developing the content for this session and instructional guide on use of the software.

The content of the training has been piloted within the review team before finalisation. Both the training and control have been time-matched for an estimated completion time of 45 minutes, consisting of the following core content: an introduction to the review topic area, what is a systematic review, study selection and eligibility criteria, an introduction to Covidence systematic review software, guidelines for study selection, and instructions for the study selection task.

Control (minimal guidance training). In addition to the core training content, the control comparator training session includes extended content on what is a systematic review, considering when to or when not to conduct a systematic review and instruction on some reporting guidelines, such as the PRISMA flow diagram to visualise the study selection process. Specifically, the content relevant to study selection and Covidence is generalised and not specific to the host review.

Intervention (task-specific training). In addition to the core training content, an extended discussion on eligibility criteria for the host review is provided, offering clarification and recommendations when screening, followed by a demonstration of screening for a sample of 10 records in the review topic area, not included in the sample of records for the study selection task, with clear rationale provided for the screening decisions made. Similar online training approaches with a sample demonstration of records, reflecting what is likely to be encountered during screening, have been implemented for novice screeners (‘crowd,’ non-specialist) on systematic review projects (Noel-Storr et al., 2021). The purpose of this training is to refine knowledge and understanding specific to engaging with study selection for the host review.

For ease of access, all details will be hosted on a virtual learning environment project site, separate for each training condition. A sample of the project site interface is available from the project OSF (see ‘Extended Data’, Ahern et al., 2025). Across both the task-specific and minimal guidance conditions, participants will start at the first unit, ‘TESS Study: Introduction’, followed by ‘The Research Team’, and then proceed through Steps 1–5 accordingly: (1) ‘Demographic and Background Questionnaire’, (2) ‘Training Materials’, (3) ‘Access the Study Selection Task’, (4) ‘The Study Selection Task’, (5) ‘Post-Task Questionnaire’.

Measures and outcomes

Interrater reliability. All relevant performance outcome data will be recorded on Covidence systematic review screening software. Participants will be instructed to select between two options, ‘YES’ (include) and ‘NO’ (exclude). This will allow for the calculation of an unweighted, chance-corrected Cohen’s kappa (Cohen, 1960) as the response decisions are dichotomous. Based on the criteria set out by McHugh (2012), Cohen’s kappa values can be interpreted as no agreement (κ = ≤ 0.20), minimal (κ = 0.21 - 0.39), weak (κ = 0.40 - 0.59), moderate (κ = 0.60 - 0.79), strong (κ = 0.80 - 0.90), or almost perfect (κ > 0.90). Furthermore, the percentage agreement will be reported alongside Cohen’s (1960) kappa, as per recommendations (McHugh, 2012) given that the kappa statistic can underestimate reliability due to the underlying assumptions made on chance rater agreement.

Primary and secondary outcomes are:

Primary (i) between-group reliability, defined as the reliability between the decisions of the participants and the expert standard, calculated as percentage agreement and unweighted, chance-corrected Cohen’s kappa (Cohen, 1960). The mean kappa estimate, standard error, and 95% CI will be generated for each of the four trial conditions (e.g., task-specific training and experienced screening partner; minimal guidance and novice screening partner) to enable comparison across the conditions. Between-group reliability will be calculated as, κ _{ns, expert standard} = (P_o1 – P_e1) / (1 - P_e1), where P_o1 is the proportion of records screened when the decision by the participant novice screener is in agreement with the expert standard, and P_e1 is the proportion of records where the decision by the participant novice screener is in agreement with the expert standard by chance.

Secondary (i) within-group reliability, defined as the reliability within each pair of participant and screening partner, calculated as κ_{ns, screening partner} = (P_o2 – P_e2) / (1 - P_e2), where P_o2 is the proportion of records screened when the decision by the participant novice screener is in agreement with their allocated screening partner, and P_e2 is the proportion of records where the decision by the participant novice screener is in agreement by chance.

Similarly, the mean kappa estimate, standard error, and 95% CI will be generated to allow for comparison across conditions, and, (ii) the validity of screening decisions made by participants relative to the expert standard, including the total number of false positives (vote ‘YES’ by participant to include irrelevant articles), false negatives (vote ‘NO’ by participant to exclude relevant articles), sensitivity (ability of participant to vote ‘YES’ to include relevant articles), and specificity (ability of participant to vote ‘NO’ to exclude irrelevant articles).

Several measures will be collected from participants through use of the online survey software tool, Qualtrics.

Sociodemographic, education, and experience level. A Demographic and Background Questionnaire will be administered at baseline to collect data relevant to experience in evidence synthesis, level of education, area of study, age, gender.

Feasibility and Acceptability Outcomes. Following screening completion, participants will again be asked to complete a brief Qualtrics questionnaire to collect data on our primary feasibility and acceptability outcomes, namely:

Efficiency. Participant's self-report of the time taken in hours and minutes to complete the study selection task.

Success of blinding. Participants will be asked to indicate what training condition they believe they were allocated to in order to determine the success of participant blinding procedures.

Perceived usefulness of training and screening partner decisions in study selection. The perceived usefulness of the training and pairing with the screening partner in the completion of study selection will be assessed using 7 items on a Likert scale from 1 (strongly disagree) to 7 (strongly agree). Items include, for example, ‘The screening decisions of ‘Screener 1’ (whether there was a conflict in decision or not) helped me to adjust my approach to the study selection process’ and ‘The training was useful in assisting me to complete study selection’.

Secondary feasibility measures include:

Motivation/interest. In response to the question ‘What was your main interest or motivation in participating in this study?’, participants will be asked to rank the options in order of their main motivation and/or interest, including: interest in the topic area, interest in systematic review methods, the chance to get involved in research, skill development/capacity building in systematic review methods, and other (please specify).

Recommendations for improvement. Participants will be provided with an open-ended response type question on ‘What, if anything, would have improved the overall experience?’

Procedure

Individuals who express interest in participation will be provided with an information sheet and access to an online Qualtrics survey by the study research assistant (TA) where participants can record their provision of informed consent (see ‘Extended Data’, Ahern et al., 2025). Consenting participants will then be enrolled in the study following self-declaration that they meet the eligibility criteria. A unique identification number will be assigned to allow for participant questionnaire responses and reliability data from the study selection task on Covidence to be linked. The study research assistant (TA) will generate a pseudonymisation key, only accessible to them during the study, and this will be stored securely and separately on the institution’s cloud-based storage. Participants will proceed to complete a brief demographic and background questionnaire on Qualtrics. Participants will then be randomised to a trial condition (for randomisation process, see ‘Study Design’) and provided with instructions on how to access the training site on the virtual learning environment and their Covidence account, which has been set up on their behalf by the study research assistant (TA) through an institutional licence. Participants will be able to select their own log-in password, with their e-mail as the username. Covidence will be set to 'dual screener' mode, meaning that each study record requires a decision by two independent screeners. For the purposes of this study, only the ‘YES’ and ‘NO’ screening options will be used to avoid difficulties in ascertaining the interrater reliability of responses assessed as ‘MAYBE’. The screening decisions of ‘Reviewer 1’ will be loaded up on the respective review page on Covidence, prior to the participant commencing (as 'Reviewer 2'). Please refer to section ‘Screening Partner (‘Reviewer 1’)’ for further detail. For each record screened, the participant will be notified by Covidence as to whether they are in agreement or in conflict with ‘Reviewer 1’. The participant need only complete the study selection task at the title/abstract level, with no requirement to proceed to conflict resolution or full-text screening. Instructions on the study selection task will be provided in the training material (see ‘Extended Data’, Ahern et al., 2025). In brief, participants will be instructed to read the title and abstract of each study record and make a decision based on the eligibility criteria using their best judgement. If there is reason to believe that a study may be eligible, but cannot confirm at the title/abstract stage, then a conservative decision should be made to retain the study record (‘YES’ vote). If the decision conflicts with Reviewer 1’s, this is a typical part of the study selection process. No further action is needed in resolving the conflicts and this will be managed by the review team. In order to maintain the integrity of the independent screening process (between Reviewer 1 and Reviewer 2), a decision should not be changed once made. Participants will also be asked not to discuss their training with others to minimise the potential risk of contamination across the trial conditions. It will be recommended to complete the study selection task in focused, distraction free blocks (e.g., 30 minutes) to avoid screening fatigue. In total, the recommendation is to complete 1-hour of screening each day, across 4 days. Queries or any technical difficulties can be raised with the study team, however, queries directly related to the study selection task cannot be answered to avoid potentially influencing the participant’s screening decision. Upon completion of the study selection task, the participant will be invited to complete a brief questionnaire on Qualtrics to self-report on feasibility outcomes such as the time taken to complete the study selection task, success of blinding, and acceptability outcomes, including perceived usefulness of the training and screening decisions of their screening partner, motivation/interest to engage in the task, and recommendations, before being debriefed on the study. A gentle e-mail reminder will be sent following 1-week of enrolment, and then every week for up to 3 weeks in the instance that the participant has not yet completed the study selection task. Total participation time across the study conditions is estimated at 5 hours. A schematic summary of the participation timeline is provided in Figure 2.

Figure 2. Schematic figure of participation tasks and anticipated timeline.

Figure Note. In the instance that the study selection task has not been completed following 1-week of enrolment, a gentle e-mail reminder will be sent and then again every week for up to 3 weeks.

Analysis

Data will be collated from the Qualtrics survey and the respective project sites on Covidence systematic review screening software and then reviewed by EA for completeness and accuracy. Analysis will be conducted by EA who will remain blinded to participant allocation. Any queries related to interrater reliability will be directed to the study research assistant (TA) who will confirm the accuracy of the data from the respective Covidence project sites, ensuring that the integrity of blinding is upheld. The final dataset will be accessible via cloud-based sharing to members of the research team only.

As this SWAR is designed as a pilot study, the focus is on descriptive statistics and estimation, not hypothesis-testing (Lee et al., 2014; Leon et al., 2011). Inferential statistics will be generated for the purposes of inferring if differences exist between the trial conditions, but conclusions on the efficacy of training or screening partner will not be drawn.

Cohen’s kappa will be calculated on r using the irr package (Gamer et al., 2019). A mean kappa will be computed for each trial condition, along with the 95% CI and standard error. Percentage agreement in screening decisions will also be calculated as a simple percentage and interpreted alongside the respective kappa statistics. All available screening data, including partial data (i.e., screening decisions available for 1 ≤ k < 219 records) in the instance of study selection task non-completion, will be used in the calculation of interrater reliability. As interrater reliability can be calculated from the completion of just k = 1 study record, a minimum threshold of 10% record completion for the study selection task (k = 22 records) is imposed for data to be included in the condition-level summaries (Mean, 95% CI) and comparison between conditions. The minimum 10% threshold is consistent with the Agency for Healthcare Research and Quality Methods guidance on the pilot phase of study selection to reduce errors and as a form of calibration in the interpretation of eligibility criteria between two independent reviewers, before a single reviewer subsequently proceeds to complete study selection (McDonagh et al., 2013). This criterion ensures that reported summary statistics and comparisons will be based on a meaningful amount of data while still allowing (partially) available data from this SWAR pilot study to be appropriately incorporated. To assess the potential impact of missing data, sensitivity analyses will be conducted using only complete cases (i.e., screening decisions available for k = 219 records) to determine the robustness of findings.

To compare differences between the kappa estimates produced by paritipcnats (vs. expert standard) in each trial condition and within the screening pairs, bootstrapping will be conducted using the boot package (Canty & Ripley, 2024; Davison & Hinkley, 1997) to generate bootstrapped CIs and p-values. Differences in kappa estimates will be visually presented using a forest plot. In instances where the plotted kappa estimate and 95% bootstrapped CI cross 0, it can be inferred that the respective trial conditions produce equivalent kappa estimates.

Measures of validity, including false positives, sensitivity, and specificity, will be calculated and presented in tabular form. Rank-based responses on motivation/interest to engage in the study and categorical responses indicating the success of blinding will be reported as frequencies and percentage, with open-text responses narratively outlined. Furthermore, feasibility outcomes will be presented descriptively with inferential statistics generated. For Likert scale response outcomes on acceptability, total scores will be computed and reported using M and SD estimates, with a factorial between-group ANOVA generated, where appropriate, to explore potential differences in acceptability between the groups. Efficiency in the study selection task (time taken) will be assessed descriptively and inferentially to determine if time taken for completion likely varies across the trial conditions.

Conclusions

It is intended that the findings will be disseminated in a peer-reviewed publication and knowledge exchange event, with associated materials from the design and reporting of the project made available on the OSF project site. Findings are anticipated to provide insights into further efficiencies that can be achieved in the systematic review process by determining whether novice screeners can achieve a high standard of reliability during study selection following training and pairing with a screening partner. Thus, novice screeners could successfully be integrated into review teams to assist with project completion. Alongside this is the impact of these potential findings on capacity building for evidence synthesis among student or early career researchers. It is intended that findings will be applied to help inform the development of training for student researchers which could potentially lead to a sustainable research culture, where students have the opportunity to work alongside experienced faculty on evidence synthesis projects. To this end, the findings of this SWAR pilot trial will be used to inform the design of subsequent trial work to appropriately test the efficacy of training and screener pairings for study selection.

Study status

At the time of protocol submission, recruitment had commenced with data collection on-going.

Data availability

Underlying data

No data are associated with this article.

Extended data

Open Science Framework: Training and Experience in Study Selection - The TESS Study. https://doi.org/10.17605/OSF.IO/WKHQJ. (Ahern et al., 2025).

This project contains the following extended data:

- Participant Information Sheet and Consent Form, including TESS Study Ethical Consent Form.pdf and TESS Study Participant Information Sheet.pdf
- Project Training Site, including TESS Study_VLE project site interface_sample.pdf
- Protocol Reporting Checklists, including CONSORT Extension Pilot and Feasibility Trials Checklist.pdf and SPIRIT Checklist.pdf
- Study Selection Task, including TESS Study Eligibility Criteria.pdf, TESS Study Eligibility Criteria_Supporting Materials.pdf, TESS Study Instructions for Study Selection Task.pdf, TESS Study Sample Database Search Strategy.pdf, and TESS Study Selection Task References.pdf

Data are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).

The Northern Ireland Network for Trials Methodology Research. SWAR Repository Store. SWAR registration for ‘Training and Experience in Study Selection (TESS): A Pilot Randomised Trial within a Systematic Review’. SWAR 38.

Author contributions

EA: Conceptualization; methodology; data curation; project administration; resources; supervision; writing original draft; writing review and editing; funding acquisition; visualization

TA: Methodology; resources; project administration; writing original draft; writing review and editing

AW: Conceptualization; methodology; resources; writing review and editing; funding acquisition

SD: Conceptualization; methodology; resources; writing review and editing; funding acquisition

FL: Conceptualization; methodology; writing review and editing; funding acquisition

Faculty Opinions recommended

References

Acosta S, Garza T, Hsu HY, et al.: Assessing quality in systematic literature reviews: a study of novice rater training. Sage Open. 2020; 10(3): 1–11. Publisher Full Text
Ahern E, Whiston A, Dillon S, et al.: Training and Experience in Study Selection - The TESS study. [Open Science Framework Project]. 2025. http://www.doi.org/10.17605/OSF.IO/WKHQJ
Borah R, Brown AW, Capers PL, et al.: Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open. 2017; 7(2): e012545. PubMed Abstract | Publisher Full Text | Free Full Text
Canty A, Ripley BD: boot: bootstrap R (S-Plus) functions. R package version 1.3-31. 2024.
Chan AW, Tetzlaff JM, Altman DG, et al.: SPIRIT 2013 statement: defining standard protocol items for clinical trials. Ann Intern Med. 2013; 158(3): 200–207. PubMed Abstract | Publisher Full Text | Free Full Text
Cohen J: A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960; 20(1): 37–46. Publisher Full Text
Cooper M, Ungar W, Zlotkin S: An assessment of inter-rater agreement of the literature filtering process in the development of evidence-based dietary guidelines. Public Health Nutr. 2006; 9(4): 494–500. PubMed Abstract | Publisher Full Text
da Costa BR, Beckett B, Diaz A, et al.: Effect of standardized training on the reliability of the Cochrane risk of bias assessment tool: a prospective study. Syst Rev. 2017; 6(1): 44. PubMed Abstract | Publisher Full Text | Free Full Text
Davison AC, Hinkley DV: Bootstrap methods and their application. Cambridge University Press, 1997. Publisher Full Text
Devane D, Burke NN, Treweek S, et al.: Study Within A Review (SWAR). J Evid Based Med. 2022; 15(4): 328–332. PubMed Abstract | Publisher Full Text | Free Full Text
Edwards P, Clarke M, DiGuiseppi C, et al.: Identification of randomized controlled trials in systematic reviews: accuracy and reliability of screening records. Stat Med. 2002; 21(11): 1635–1640. PubMed Abstract | Publisher Full Text
Eldridge SM, Chan CL, Campbell MJ, et al.: CONSORT 2010 statement: extension to randomised pilot and feasibility trials. BMJ. 2016; 355: i5239. PubMed Abstract | Publisher Full Text | Free Full Text
Fourcade L, Boutron I, Moher D, et al.: Development and evaluation of a pedagogical tool to improve understanding of a quality checklist: a randomised controlled trial. PLoS Clin Trials. 2007; 2(5): e22. PubMed Abstract | Publisher Full Text | Free Full Text
Gamer M, Lemon J, Fellows I, et al.: Package ‘irr’: various coefficients of interrater reliability and agreement. R package version 0.84.1, 2019. Reference Source
Garritty C, Gartlehner G, Nussbaumer-Streit B, et al.: Cochrane rapid reviews methods group offers evidence-informed guidance to conduct rapid reviews. J Clin Epidemiol. 2021; 130: 13–22. PubMed Abstract | Publisher Full Text | Free Full Text
Gartlehner G, Affengruber L, Titscher V, et al.: Single-reviewer abstract screening missed 13 percent of relevant studies: a crowd-based, randomized controlled trial. J Clin Epidemiol. 2020; 121: 20–28. PubMed Abstract | Publisher Full Text
Haddaway NR, Pullin AS: The policy role of systematic reviews: past, present and future. Springer Sci Rev. 2014; 2(1–2): 179–183. Publisher Full Text
Hoffmann F, Allers K, Rombey T, et al.: Nearly 80 systematic reviews were published each day: observational study on trends in epidemiology and reporting over the years 2000-2019. J Clin Epidemiol. 2021; 138: 1–11. PubMed Abstract | Publisher Full Text
Horton J, Vandermeer B, Hartling L, et al.: Systematic review data extraction: cross-sectional study showed that experience did not increase accuracy. J Clin Epidemiol. 2010; 63(3): 289–298. PubMed Abstract | Publisher Full Text
Lasserson TJ, Thomas J, Higgins JPT: Chapter 1: starting a review. In: J. P. T. Higgins, J. Thomas, J. Chandler, M. Cumpston, T. Li, M. Page, & V. Welch (Eds.), Cochrane handbook for systematic reviews of interventions. version 6.5. Cochrane 2024, 2021. Reference Source
Lee EC, Whitehead AL, Jacques RM, et al.: The statistical interpretation of pilot trials: should significance thresholds be reconsidered? BMC Med Res Methodol. 2014; 14(1): 41. PubMed Abstract | Publisher Full Text | Free Full Text
Lefebvre C, Glanville J, Briscoe S, et al.: Chapter 4: Searching for and selecting studies. In: JPT Higgins, J Thomas, J Chandler, M Cumpston, T Li, MJ Page, & VA Welch (Eds.), Cochrane Handbook for Systematic Reviews of Interventions. Cochrane, version 6.5. 2024. Reference Source
Leon AC, Davis LL, Kraemer HC: The role and interpretation of pilot studies in clinical research. J Psychiatr Res. 2011; 45(5): 626–629. PubMed Abstract | Publisher Full Text | Free Full Text
Mateen FJ, Oh J, Tergas AI, et al.: Titles versus titles and abstracts for initial screening of articles for systematic reviews. Clin Epidemiol. 2013; 5: 89–95. PubMed Abstract | Publisher Full Text | Free Full Text
McDonagh M, Peterson K, Raina P, et al.: Avoiding bias in selecting studies. In: Methods Guide for Comparative Effectiveness Reviews. AHRQ Publication No. 13–EHC045–EF. Agency for Healthcare Research and Quality, 2013. PubMed Abstract
McGuire J, Bates GW, Dretzke BJ, et al.: Methodological quality as a component of meta-analysis. Educational Psychologist. 1985; 20(1): 1–5. Publisher Full Text
McHugh ML: Interrater reliability: the kappa statistic. Biochem Med (Zagreb). 2012; 22(3): 276–282. PubMed Abstract | Publisher Full Text | Free Full Text
Moons P, Goossens E, Thompson DR: Rapid reviews: the pros and cons of an accelerated review process. Eur J Cardiovasc Nurs. 2021; 20(5): 515–519. PubMed Abstract | Publisher Full Text
Ng L, Pitt V, Huckvale K, et al.: Title and Abstract Screening and Evaluation in Systematic Reviews (TASER): a pilot randomised controlled trial of title and abstract screening by medical students. Syst Rev. 2014; 3: 121. PubMed Abstract | Publisher Full Text | Free Full Text
Noel-Storr A, Dooley G, Elliott J, et al.: An evaluation of Cochrane Crowd found that crowdsourcing produced accurate results in identifying randomized trials. J Clin Epidemiol. 2021; 133: 130–139. PubMed Abstract | Publisher Full Text
Nussbaumer-Streit B, Sommer I, Hamel C, et al.: Rapid reviews methods series: guidance on team considerations, study selection, data extraction and risk of bias assessment. BMJ Evid Based Med. 2023; 28(6): 418–423. PubMed Abstract | Publisher Full Text | Free Full Text
Oremus M, Oremus C, Hall GB, et al.: Inter-rater and test–retest reliability of quality assessments by novice student raters using the Jadad and Newcastle–Ottawa Scales. BMJ Open. 2012; 2(4): e001368. PubMed Abstract | Publisher Full Text | Free Full Text
Orwin RG: Evaluating coding decisions. In: H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis. Russell Sage Foundation, 1994; 139–162. Reference Source
Polanin JR, Pigott TD, Espelage DL, et al.: Best practice guidelines for abstract screening large-evidence systematic reviews and meta-analyses. Res Synth Methods. 2019; 10(3): 330–342. Publisher Full Text | Free Full Text
Robson RC, Pham B, Hwee J, et al.: Few studies exist examining methods for selecting studies, abstracting data, and appraising quality in a systematic review. J Clin Epidemiol. 2019; 106: 121–135. PubMed Abstract | Publisher Full Text
Rotondi MA: kappaSize: sample size estimation functions for studies of interobserver agreement. R package version 1.2, 2018. Publisher Full Text
Siddaway AP, Wood AM, Hedges LV: How to do a systematic review: a best practice guide for conducting and reporting narrative reviews, meta-analyses, and meta-syntheses. Annu Rev Psychol. 2019; 70(1): 747–770. PubMed Abstract | Publisher Full Text
Stoll CRT, Izadi S, Fowler S, et al.: The value of a second reviewer for study selection in systematic reviews. Res Synth Methods. 2019; 10(4): 539–545. PubMed Abstract | Publisher Full Text | Free Full Text
Tendal B, Higgins JP, Jüni P, et al.: Disagreements in meta-analyses using outcomes measured on continuous or rating scales: observer agreement study. BMJ. 2009; 339: b3128. PubMed Abstract | Publisher Full Text | Free Full Text
Thabane L, Lancaster G: A guide to the reporting of protocols of pilot and feasibility trials. Pilot Feasibility Stud. 2019; 5(1): 37. PubMed Abstract | Publisher Full Text | Free Full Text
Veritas Health Innovation: Covidence systematic review software. Reference Source
Waffenschmidt S, Knelangen M, Sieben W, et al.: Single screening versus conventional double screening for study selection in systematic reviews: a methodological systematic review. BMC Med Res Methodol. 2019; 19(1): 132. PubMed Abstract | Publisher Full Text | Free Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 14 Apr 2025

Author details Author details

Temitayo Adedeji
Roles: Methodology, Project Administration, Resources, Writing – Original Draft Preparation, Writing – Review & Editing

Aoife Whiston
Roles: Conceptualization, Funding Acquisition, Methodology, Resources, Writing – Review & Editing

Sarah Dillon
Roles: Conceptualization, Funding Acquisition, Methodology, Resources, Writing – Review & Editing

Fiona Lynn
Roles: Conceptualization, Funding Acquisition, Methodology, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

This study was supported by the Health Research Board (Ireland) and the HSC Public Health Agency [ESI-2021-001] through Evidence Synthesis Ireland/Cochrane Ireland.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (1)

version 1

Published: 14 Apr 2025, 8:54

https://doi.org/10.12688/hrbopenres.14129.1

© 2025 Ahern E et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

VIEWS

209

downloads

Citations

SEE MORE DETAILS

CITE

how to cite this article

Ahern E, Adedeji T, Whiston A et al. Training and Experience in Study Selection (TESS): study protocol for a pilot randomised trial within a systematic review [version 1; peer review: 1 approved]. HRB Open Res 2025, 8:54 (https://doi.org/10.12688/hrbopenres.14129.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 14 Apr 2025

Views

Reviewer Report 30 Sep 2025

Declan Devane, University of Galway, Galway, Ireland

Approved

https://doi.org/10.21956/hrbopenres.15528.r50011

This protocol addresses a practical and under‑studied step of evidence synthesis: how to train and pair novice screeners so that study selection remains reliable and efficient. The rationale is well supported by prior work on dual screening, training effects, and kappa‑based reliability. Strengths include a clear factorial design, preregistration, transparency about materials, and attention to feasibility and acceptability. Overall, the protocol is clear and useful, but several aspects would benefit from clarification or modest redesign to strengthen internal validity and the interpretability of results.

Your paper already says this is a pilot that will inform a larger trial. To make that unmistakable, add a brief sentence near the start of Methods and again in Analysis that the goal is learning and planning. Then list two or three simple progression rules, for example completion rate, time to finish, and acceptability thresholds, based on the feasibility data you already collect on pages 7 to 9.

The current aim reads like a test of which arm is best. A small rewording might fix that e.g., “This pilot will assess feasibility and acceptability to plan a definitive trial” and align outcomes accordingly.

Readers will appreciate a brief expansion of how the 219 records will be chosen.

Because Covidence shows agreement or conflict as people work, learning during the task is very plausible. Suggest that the record order be randomised for each participant if possible, and report performance by time or by item deciles to show any learning pattern.

With only three participants per cell, the averages of kappa by arm will be wide regardless. You might consider keeping those and adding a complementary item-level mixed model that treats each screened record as one observation and includes effects for training, partner type, and their interaction, with random effects for participant and item.

In Analysis on page 9 you plan to treat confidence intervals that cross zero as evidence of equivalence. That can be misread. Suggest you avoid equivalence language or set a small, prespecified equivalence margin for kappa differences and say any tests are exploratory.

Kappa can be sensitive to prevalence. Keep kappa and percent agreement, then add one robust agreement index such as PABAK. This may make the results easier to read if the include rate is low or high.

You define the partner with 4 to 6 years as moderate experience on page 6, but label the arm “experienced”. Maybe revised labelling.

Because one person represents each partner type, partner identity is confounded with partner type. Flag this as a pilot limitation and say that the full trial will include multiple partners per level.

The Procedure seems to ask participants to vote YES when unsure. Suggest adding a sentence that explains the consequence of this rule for sensitivity and specificity, so readers know you expect higher sensitivity and possibly lower specificity by design.

Time is self-reported in your feasibility outcomes. If Covidence logs can provide start or finish times, say that you will also capture those as an objective check on efficiency.

Is the rationale for, and objectives of, the study clearly described?

Yes
Is the study design appropriate for the research question?

Yes
Are sufficient details of the methods provided to allow replication by others?

Yes
Are the datasets clearly presented in a useable and accessible format?

Not applicable

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Clincial trials and evidence synthesis methodology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Author Response 31 Dec 2025

Elayne Ahern, Department of Psychology, University of Limerick, Castletroy, V94 T9PX, Ireland

31 Dec 2025

Author Response

Thank you kindly for the time and consideration given to reviewing our protocol. For ease, we have provided the peer-review comments, formatted in bold, with our response then presented immediately ... Continue reading Thank you kindly for the time and consideration given to reviewing our protocol. For ease, we have provided the peer-review comments, formatted in bold, with our response then presented immediately thereafter, point-by-point.

This protocol addresses a practical and under‑studied step of evidence synthesis: how to train and pair novice screeners so that study selection remains reliable and efficient. The rationale is well supported by prior work on dual screening, training effects, and kappa‑based reliability. Strengths include a clear factorial design, preregistration, transparency about materials, and attention to feasibility and acceptability. Overall, the protocol is clear and useful, but several aspects would benefit from clarification or modest redesign to strengthen internal validity and the interpretability of results.
Thank you for positively receiving our work. The recommendations are much appreciated, with the validity and clarity of our protocol strengthened following these suggested revisions.

Your paper already says this is a pilot that will inform a larger trial. To make that unmistakable, add a brief sentence near the start of Methods and again in Analysis that the goal is learning and planning. Then list two or three simple progression rules, for example completion rate, time to finish, and acceptability thresholds, based on the feasibility data you already collect on pages 7 to 9.
Thank you for this suggestion. We have added a statement at the beginning of the Methods and again throughout the Analysis section to clarify that the primary purpose of this SWAR is to learn from pilot study findings and plan for a future definitive trial. In addition, we have now added a ‘Progression criteria’ subsection to the Methods, where we have specified criteria to guide decisions about progression and refinement of the full-scale trial based on feasibility and acceptability outcomes, including completion rate, time required of participants, and perceived usefulness of training and screener pairings when undertaking the study selection task.

The current aim reads like a test of which arm is best. A small rewording might fix that e.g., “This pilot will assess feasibility and acceptability to plan a definitive trial” and align outcomes accordingly.
Thank you for this helpful suggestion. We have revised the aim to more clearly reflect the pilot purpose of assessing feasibility and acceptability to inform a future definitive trial. We have also updated the objectives and language so that reliability outcomes are framed as exploratory estimates intended to guide the design of the future trial, rather than as formal comparative tests.

Readers will appreciate a brief expansion of how the 219 records will be chosen.
We have expanded the Methods section to clarify the process used to select the 219 records. Specifically, we now explain that the sample was purposefully drawn to reflect the breadth of eligibility criteria in the host review, including a proportion of records previously screened as “include” (k = 44) and a diverse set of “exclude” records (k = 175) to represent the full range of exclusion reasons. This ensures that participants encounter a variety of screening decisions and the need to apply various exclusion rationales, supporting a meaningful assessment of how well the training conditions prepare novice screeners to apply the eligibility criteria.

Because Covidence shows agreement or conflict as people work, learning during the task is very plausible. Suggest that the record order be randomised for each participant if possible, and report performance by time or by item deciles to show any learning pattern.
Thank you for this helpful guidance. We explored options for randomising record order and analysing learning patterns. However, as confirmed by Covidence Support, Covidence software does not currently support randomisation of the screening order. Records are displayed by default using the “most relevant” ordering, and this is how the study selection task was presented to participants. The “most relevant” ordering is underpinned by Covidence’s internal active learning algorithm, although users can switch to alternative static orderings (e.g., Title or Author) if they choose. Further, the Covidence platform does not provide time-stamped screening data (only the date the record was screened is available), which prevents analyses by time or by item deciles. We recognise that not being able to explore learning during study selection is a limitation and unfortunately cannot action this for our pilot SWAR given the constraints of the platform.

With only three participants per cell, the averages of kappa by arm will be wide regardless. You might consider keeping those and adding a complementary item-level mixed model that treats each screened record as one observation and includes effects for training, partner type, and their interaction, with random effects for participant and item.
Thank you for this recommended addition to our analysis plan. We agree that with three participants per condition, arm-level kappa estimates will be imprecise and should be interpreted cautiously. Consistent with the pilot and exploratory aims of this SWAR, we have added an item-level mixed-effects analysis to explore patterns in screening decisions while accounting for clustering by participant and by item (i.e. individual study records). We clarify that this analysis is exploratory and is intended to inform the design and analytical approach of a subsequent definitive trial, rather than to support confirmatory analyses.

In Analysis on page 9 you plan to treat confidence intervals that cross zero as evidence of equivalence. That can be misread. Suggest you avoid equivalence language or set a small, prespecified equivalence margin for kappa differences and say any tests are exploratory.
We agree. Any reference to evidence of equivalence has now been removed and we further underscore that any analyses are exploratory.

Kappa can be sensitive to prevalence. Keep kappa and percent agreement, then add one robust agreement index such as PABAK. This may make the results easier to read if the include rate is low or high.
Thank you for this suggestion. We have revised the protocol to include reporting of the prevalence-adjusted and bias-adjusted kappa (PABAK), alongside Cohen’s kappa and percent agreement, to support interpretation of reliability given the low number of ‘includes’ during the study selection task.

You define the partner with 4 to 6 years as moderate experience on page 6, but label the arm “experienced”. Maybe revised labelling.
This was an oversight on our part and we thank you for drawing this to our attention. We have now updated our figure labels to reflect ‘moderately experienced’ and likewise updated this throughout the manuscript.

Because one person represents each partner type, partner identity is confounded with partner type. Flag this as a pilot limitation and say that the full trial will include multiple partners per level.
Thank you for prompting consideration of this. We have now included a note on this limitation in our conclusion section with a view to addressing it in the future trial, where feasible.

The Procedure seems to ask participants to vote YES when unsure. Suggest adding a sentence that explains the consequence of this rule for sensitivity and specificity, so readers know you expect higher sensitivity and possibly lower specificity by design.
Many thanks for this suggestion. We acknowledge the importance of this caveat and have added a note to our methods to indicate the impact and trade-off between sensitivity and specificity by reducing decision options to just two, ‘YES’ or ‘NO’.

Time is self-reported in your feasibility outcomes. If Covidence logs can provide start or finish times, say that you will also capture those as an objective check on efficiency.
We have queried the availability of such logs with Covidence Support and unfortunately such data is not available. Our self-reported measure was included as a means to overcome this and have access to data on time efficiency. We nonetheless acknowledge that an objective measure would strengthen the validity and reliability of this outcome.
Thank you kindly for the time and consideration given to reviewing our protocol. For ease, we have provided the peer-review comments, formatted in bold, with our response then presented immediately thereafter, point-by-point.

This protocol addresses a practical and under‑studied step of evidence synthesis: how to train and pair novice screeners so that study selection remains reliable and efficient. The rationale is well supported by prior work on dual screening, training effects, and kappa‑based reliability. Strengths include a clear factorial design, preregistration, transparency about materials, and attention to feasibility and acceptability. Overall, the protocol is clear and useful, but several aspects would benefit from clarification or modest redesign to strengthen internal validity and the interpretability of results.
Thank you for positively receiving our work. The recommendations are much appreciated, with the validity and clarity of our protocol strengthened following these suggested revisions.

Your paper already says this is a pilot that will inform a larger trial. To make that unmistakable, add a brief sentence near the start of Methods and again in Analysis that the goal is learning and planning. Then list two or three simple progression rules, for example completion rate, time to finish, and acceptability thresholds, based on the feasibility data you already collect on pages 7 to 9.
Thank you for this suggestion. We have added a statement at the beginning of the Methods and again throughout the Analysis section to clarify that the primary purpose of this SWAR is to learn from pilot study findings and plan for a future definitive trial. In addition, we have now added a ‘Progression criteria’ subsection to the Methods, where we have specified criteria to guide decisions about progression and refinement of the full-scale trial based on feasibility and acceptability outcomes, including completion rate, time required of participants, and perceived usefulness of training and screener pairings when undertaking the study selection task.

The current aim reads like a test of which arm is best. A small rewording might fix that e.g., “This pilot will assess feasibility and acceptability to plan a definitive trial” and align outcomes accordingly.
Thank you for this helpful suggestion. We have revised the aim to more clearly reflect the pilot purpose of assessing feasibility and acceptability to inform a future definitive trial. We have also updated the objectives and language so that reliability outcomes are framed as exploratory estimates intended to guide the design of the future trial, rather than as formal comparative tests.

Readers will appreciate a brief expansion of how the 219 records will be chosen.
We have expanded the Methods section to clarify the process used to select the 219 records. Specifically, we now explain that the sample was purposefully drawn to reflect the breadth of eligibility criteria in the host review, including a proportion of records previously screened as “include” (k = 44) and a diverse set of “exclude” records (k = 175) to represent the full range of exclusion reasons. This ensures that participants encounter a variety of screening decisions and the need to apply various exclusion rationales, supporting a meaningful assessment of how well the training conditions prepare novice screeners to apply the eligibility criteria.

Because Covidence shows agreement or conflict as people work, learning during the task is very plausible. Suggest that the record order be randomised for each participant if possible, and report performance by time or by item deciles to show any learning pattern.
Thank you for this helpful guidance. We explored options for randomising record order and analysing learning patterns. However, as confirmed by Covidence Support, Covidence software does not currently support randomisation of the screening order. Records are displayed by default using the “most relevant” ordering, and this is how the study selection task was presented to participants. The “most relevant” ordering is underpinned by Covidence’s internal active learning algorithm, although users can switch to alternative static orderings (e.g., Title or Author) if they choose. Further, the Covidence platform does not provide time-stamped screening data (only the date the record was screened is available), which prevents analyses by time or by item deciles. We recognise that not being able to explore learning during study selection is a limitation and unfortunately cannot action this for our pilot SWAR given the constraints of the platform.

With only three participants per cell, the averages of kappa by arm will be wide regardless. You might consider keeping those and adding a complementary item-level mixed model that treats each screened record as one observation and includes effects for training, partner type, and their interaction, with random effects for participant and item.
Thank you for this recommended addition to our analysis plan. We agree that with three participants per condition, arm-level kappa estimates will be imprecise and should be interpreted cautiously. Consistent with the pilot and exploratory aims of this SWAR, we have added an item-level mixed-effects analysis to explore patterns in screening decisions while accounting for clustering by participant and by item (i.e. individual study records). We clarify that this analysis is exploratory and is intended to inform the design and analytical approach of a subsequent definitive trial, rather than to support confirmatory analyses.

In Analysis on page 9 you plan to treat confidence intervals that cross zero as evidence of equivalence. That can be misread. Suggest you avoid equivalence language or set a small, prespecified equivalence margin for kappa differences and say any tests are exploratory.
We agree. Any reference to evidence of equivalence has now been removed and we further underscore that any analyses are exploratory.

Kappa can be sensitive to prevalence. Keep kappa and percent agreement, then add one robust agreement index such as PABAK. This may make the results easier to read if the include rate is low or high.
Thank you for this suggestion. We have revised the protocol to include reporting of the prevalence-adjusted and bias-adjusted kappa (PABAK), alongside Cohen’s kappa and percent agreement, to support interpretation of reliability given the low number of ‘includes’ during the study selection task.

You define the partner with 4 to 6 years as moderate experience on page 6, but label the arm “experienced”. Maybe revised labelling.
This was an oversight on our part and we thank you for drawing this to our attention. We have now updated our figure labels to reflect ‘moderately experienced’ and likewise updated this throughout the manuscript.

Because one person represents each partner type, partner identity is confounded with partner type. Flag this as a pilot limitation and say that the full trial will include multiple partners per level.
Thank you for prompting consideration of this. We have now included a note on this limitation in our conclusion section with a view to addressing it in the future trial, where feasible.

The Procedure seems to ask participants to vote YES when unsure. Suggest adding a sentence that explains the consequence of this rule for sensitivity and specificity, so readers know you expect higher sensitivity and possibly lower specificity by design.
Many thanks for this suggestion. We acknowledge the importance of this caveat and have added a note to our methods to indicate the impact and trade-off between sensitivity and specificity by reducing decision options to just two, ‘YES’ or ‘NO’.

Time is self-reported in your feasibility outcomes. If Covidence logs can provide start or finish times, say that you will also capture those as an objective check on efficiency.
We have queried the availability of such logs with Covidence Support and unfortunately such data is not available. Our self-reported measure was included as a means to overcome this and have access to data on time efficiency. We nonetheless acknowledge that an objective measure would strengthen the validity and reliability of this outcome.
Competing Interests: N/A Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 31 Dec 2025

Elayne Ahern, Department of Psychology, University of Limerick, Castletroy, V94 T9PX, Ireland

31 Dec 2025

Author Response

Thank you kindly for the time and consideration given to reviewing our protocol. For ease, we have provided the peer-review comments, formatted in bold, with our response then presented immediately ... Continue reading Thank you kindly for the time and consideration given to reviewing our protocol. For ease, we have provided the peer-review comments, formatted in bold, with our response then presented immediately thereafter, point-by-point.

This protocol addresses a practical and under‑studied step of evidence synthesis: how to train and pair novice screeners so that study selection remains reliable and efficient. The rationale is well supported by prior work on dual screening, training effects, and kappa‑based reliability. Strengths include a clear factorial design, preregistration, transparency about materials, and attention to feasibility and acceptability. Overall, the protocol is clear and useful, but several aspects would benefit from clarification or modest redesign to strengthen internal validity and the interpretability of results.
Thank you for positively receiving our work. The recommendations are much appreciated, with the validity and clarity of our protocol strengthened following these suggested revisions.

Your paper already says this is a pilot that will inform a larger trial. To make that unmistakable, add a brief sentence near the start of Methods and again in Analysis that the goal is learning and planning. Then list two or three simple progression rules, for example completion rate, time to finish, and acceptability thresholds, based on the feasibility data you already collect on pages 7 to 9.
Thank you for this suggestion. We have added a statement at the beginning of the Methods and again throughout the Analysis section to clarify that the primary purpose of this SWAR is to learn from pilot study findings and plan for a future definitive trial. In addition, we have now added a ‘Progression criteria’ subsection to the Methods, where we have specified criteria to guide decisions about progression and refinement of the full-scale trial based on feasibility and acceptability outcomes, including completion rate, time required of participants, and perceived usefulness of training and screener pairings when undertaking the study selection task.

The current aim reads like a test of which arm is best. A small rewording might fix that e.g., “This pilot will assess feasibility and acceptability to plan a definitive trial” and align outcomes accordingly.
Thank you for this helpful suggestion. We have revised the aim to more clearly reflect the pilot purpose of assessing feasibility and acceptability to inform a future definitive trial. We have also updated the objectives and language so that reliability outcomes are framed as exploratory estimates intended to guide the design of the future trial, rather than as formal comparative tests.

Readers will appreciate a brief expansion of how the 219 records will be chosen.
We have expanded the Methods section to clarify the process used to select the 219 records. Specifically, we now explain that the sample was purposefully drawn to reflect the breadth of eligibility criteria in the host review, including a proportion of records previously screened as “include” (k = 44) and a diverse set of “exclude” records (k = 175) to represent the full range of exclusion reasons. This ensures that participants encounter a variety of screening decisions and the need to apply various exclusion rationales, supporting a meaningful assessment of how well the training conditions prepare novice screeners to apply the eligibility criteria.

Because Covidence shows agreement or conflict as people work, learning during the task is very plausible. Suggest that the record order be randomised for each participant if possible, and report performance by time or by item deciles to show any learning pattern.
Thank you for this helpful guidance. We explored options for randomising record order and analysing learning patterns. However, as confirmed by Covidence Support, Covidence software does not currently support randomisation of the screening order. Records are displayed by default using the “most relevant” ordering, and this is how the study selection task was presented to participants. The “most relevant” ordering is underpinned by Covidence’s internal active learning algorithm, although users can switch to alternative static orderings (e.g., Title or Author) if they choose. Further, the Covidence platform does not provide time-stamped screening data (only the date the record was screened is available), which prevents analyses by time or by item deciles. We recognise that not being able to explore learning during study selection is a limitation and unfortunately cannot action this for our pilot SWAR given the constraints of the platform.

With only three participants per cell, the averages of kappa by arm will be wide regardless. You might consider keeping those and adding a complementary item-level mixed model that treats each screened record as one observation and includes effects for training, partner type, and their interaction, with random effects for participant and item.
Thank you for this recommended addition to our analysis plan. We agree that with three participants per condition, arm-level kappa estimates will be imprecise and should be interpreted cautiously. Consistent with the pilot and exploratory aims of this SWAR, we have added an item-level mixed-effects analysis to explore patterns in screening decisions while accounting for clustering by participant and by item (i.e. individual study records). We clarify that this analysis is exploratory and is intended to inform the design and analytical approach of a subsequent definitive trial, rather than to support confirmatory analyses.

In Analysis on page 9 you plan to treat confidence intervals that cross zero as evidence of equivalence. That can be misread. Suggest you avoid equivalence language or set a small, prespecified equivalence margin for kappa differences and say any tests are exploratory.
We agree. Any reference to evidence of equivalence has now been removed and we further underscore that any analyses are exploratory.

Kappa can be sensitive to prevalence. Keep kappa and percent agreement, then add one robust agreement index such as PABAK. This may make the results easier to read if the include rate is low or high.
Thank you for this suggestion. We have revised the protocol to include reporting of the prevalence-adjusted and bias-adjusted kappa (PABAK), alongside Cohen’s kappa and percent agreement, to support interpretation of reliability given the low number of ‘includes’ during the study selection task.

You define the partner with 4 to 6 years as moderate experience on page 6, but label the arm “experienced”. Maybe revised labelling.
This was an oversight on our part and we thank you for drawing this to our attention. We have now updated our figure labels to reflect ‘moderately experienced’ and likewise updated this throughout the manuscript.

Because one person represents each partner type, partner identity is confounded with partner type. Flag this as a pilot limitation and say that the full trial will include multiple partners per level.
Thank you for prompting consideration of this. We have now included a note on this limitation in our conclusion section with a view to addressing it in the future trial, where feasible.

The Procedure seems to ask participants to vote YES when unsure. Suggest adding a sentence that explains the consequence of this rule for sensitivity and specificity, so readers know you expect higher sensitivity and possibly lower specificity by design.
Many thanks for this suggestion. We acknowledge the importance of this caveat and have added a note to our methods to indicate the impact and trade-off between sensitivity and specificity by reducing decision options to just two, ‘YES’ or ‘NO’.

Time is self-reported in your feasibility outcomes. If Covidence logs can provide start or finish times, say that you will also capture those as an objective check on efficiency.
We have queried the availability of such logs with Covidence Support and unfortunately such data is not available. Our self-reported measure was included as a means to overcome this and have access to data on time efficiency. We nonetheless acknowledge that an objective measure would strengthen the validity and reliability of this outcome.
Thank you kindly for the time and consideration given to reviewing our protocol. For ease, we have provided the peer-review comments, formatted in bold, with our response then presented immediately thereafter, point-by-point.

This protocol addresses a practical and under‑studied step of evidence synthesis: how to train and pair novice screeners so that study selection remains reliable and efficient. The rationale is well supported by prior work on dual screening, training effects, and kappa‑based reliability. Strengths include a clear factorial design, preregistration, transparency about materials, and attention to feasibility and acceptability. Overall, the protocol is clear and useful, but several aspects would benefit from clarification or modest redesign to strengthen internal validity and the interpretability of results.
Thank you for positively receiving our work. The recommendations are much appreciated, with the validity and clarity of our protocol strengthened following these suggested revisions.

Your paper already says this is a pilot that will inform a larger trial. To make that unmistakable, add a brief sentence near the start of Methods and again in Analysis that the goal is learning and planning. Then list two or three simple progression rules, for example completion rate, time to finish, and acceptability thresholds, based on the feasibility data you already collect on pages 7 to 9.
Thank you for this suggestion. We have added a statement at the beginning of the Methods and again throughout the Analysis section to clarify that the primary purpose of this SWAR is to learn from pilot study findings and plan for a future definitive trial. In addition, we have now added a ‘Progression criteria’ subsection to the Methods, where we have specified criteria to guide decisions about progression and refinement of the full-scale trial based on feasibility and acceptability outcomes, including completion rate, time required of participants, and perceived usefulness of training and screener pairings when undertaking the study selection task.

The current aim reads like a test of which arm is best. A small rewording might fix that e.g., “This pilot will assess feasibility and acceptability to plan a definitive trial” and align outcomes accordingly.
Thank you for this helpful suggestion. We have revised the aim to more clearly reflect the pilot purpose of assessing feasibility and acceptability to inform a future definitive trial. We have also updated the objectives and language so that reliability outcomes are framed as exploratory estimates intended to guide the design of the future trial, rather than as formal comparative tests.

Readers will appreciate a brief expansion of how the 219 records will be chosen.
We have expanded the Methods section to clarify the process used to select the 219 records. Specifically, we now explain that the sample was purposefully drawn to reflect the breadth of eligibility criteria in the host review, including a proportion of records previously screened as “include” (k = 44) and a diverse set of “exclude” records (k = 175) to represent the full range of exclusion reasons. This ensures that participants encounter a variety of screening decisions and the need to apply various exclusion rationales, supporting a meaningful assessment of how well the training conditions prepare novice screeners to apply the eligibility criteria.

Because Covidence shows agreement or conflict as people work, learning during the task is very plausible. Suggest that the record order be randomised for each participant if possible, and report performance by time or by item deciles to show any learning pattern.
Thank you for this helpful guidance. We explored options for randomising record order and analysing learning patterns. However, as confirmed by Covidence Support, Covidence software does not currently support randomisation of the screening order. Records are displayed by default using the “most relevant” ordering, and this is how the study selection task was presented to participants. The “most relevant” ordering is underpinned by Covidence’s internal active learning algorithm, although users can switch to alternative static orderings (e.g., Title or Author) if they choose. Further, the Covidence platform does not provide time-stamped screening data (only the date the record was screened is available), which prevents analyses by time or by item deciles. We recognise that not being able to explore learning during study selection is a limitation and unfortunately cannot action this for our pilot SWAR given the constraints of the platform.

With only three participants per cell, the averages of kappa by arm will be wide regardless. You might consider keeping those and adding a complementary item-level mixed model that treats each screened record as one observation and includes effects for training, partner type, and their interaction, with random effects for participant and item.
Thank you for this recommended addition to our analysis plan. We agree that with three participants per condition, arm-level kappa estimates will be imprecise and should be interpreted cautiously. Consistent with the pilot and exploratory aims of this SWAR, we have added an item-level mixed-effects analysis to explore patterns in screening decisions while accounting for clustering by participant and by item (i.e. individual study records). We clarify that this analysis is exploratory and is intended to inform the design and analytical approach of a subsequent definitive trial, rather than to support confirmatory analyses.

In Analysis on page 9 you plan to treat confidence intervals that cross zero as evidence of equivalence. That can be misread. Suggest you avoid equivalence language or set a small, prespecified equivalence margin for kappa differences and say any tests are exploratory.
We agree. Any reference to evidence of equivalence has now been removed and we further underscore that any analyses are exploratory.

Kappa can be sensitive to prevalence. Keep kappa and percent agreement, then add one robust agreement index such as PABAK. This may make the results easier to read if the include rate is low or high.
Thank you for this suggestion. We have revised the protocol to include reporting of the prevalence-adjusted and bias-adjusted kappa (PABAK), alongside Cohen’s kappa and percent agreement, to support interpretation of reliability given the low number of ‘includes’ during the study selection task.

You define the partner with 4 to 6 years as moderate experience on page 6, but label the arm “experienced”. Maybe revised labelling.
This was an oversight on our part and we thank you for drawing this to our attention. We have now updated our figure labels to reflect ‘moderately experienced’ and likewise updated this throughout the manuscript.

Because one person represents each partner type, partner identity is confounded with partner type. Flag this as a pilot limitation and say that the full trial will include multiple partners per level.
Thank you for prompting consideration of this. We have now included a note on this limitation in our conclusion section with a view to addressing it in the future trial, where feasible.

The Procedure seems to ask participants to vote YES when unsure. Suggest adding a sentence that explains the consequence of this rule for sensitivity and specificity, so readers know you expect higher sensitivity and possibly lower specificity by design.
Many thanks for this suggestion. We acknowledge the importance of this caveat and have added a note to our methods to indicate the impact and trade-off between sensitivity and specificity by reducing decision options to just two, ‘YES’ or ‘NO’.

Time is self-reported in your feasibility outcomes. If Covidence logs can provide start or finish times, say that you will also capture those as an objective check on efficiency.
We have queried the availability of such logs with Covidence Support and unfortunately such data is not available. Our self-reported measure was included as a means to overcome this and have access to data on time efficiency. We nonetheless acknowledge that an objective measure would strengthen the validity and reliability of this outcome.
Competing Interests: N/A Close
Report a concern

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 14 Apr 2025

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1
Version 1 14 Apr 25	read

Declan Devane, University of Galway, Galway, Ireland

Comments on this article

All Comments(0)

Add a comment

Back to all reports

Reviewer Report

8 Views

30 Sep 2025 | for Version 1

Declan Devane, University of Galway, Galway, Ireland

8 Views Cite this report Responses(1)

Approved

Is the rationale for, and objectives of, the study clearly described?

Yes
Is the study design appropriate for the research question?

Yes
Are sufficient details of the methods provided to allow replication by others?

Yes
Are the datasets clearly presented in a useable and accessible format?

Not applicable

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Clincial trials and evidence synthesis methodology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (1)

Author Response

31 Dec 2025

Elayne Ahern, Department of Psychology, University of Limerick, Castletroy, V94 T9PX, Ireland

Thank you kindly for the time and consideration given to reviewing our protocol. For ease, we have provided the peer-review comments, formatted in bold, with our response then presented immediately thereafter, point-by-point.

This protocol addresses a practical and under‑studied step of evidence synthesis: how to train and pair novice screeners so that study selection remains reliable and efficient. The rationale is well supported by prior work on dual screening, training effects, and kappa‑based reliability. Strengths include a clear factorial design, preregistration, transparency about materials, and attention to feasibility and acceptability. Overall, the protocol is clear and useful, but several aspects would benefit from clarification or modest redesign to strengthen internal validity and the interpretability of results.
Thank you for positively receiving our work. The recommendations are much appreciated, with the validity and clarity of our protocol strengthened following these suggested revisions.

Your paper already says this is a pilot that will inform a larger trial. To make that unmistakable, add a brief sentence near the start of Methods and again in Analysis that the goal is learning and planning. Then list two or three simple progression rules, for example completion rate, time to finish, and acceptability thresholds, based on the feasibility data you already collect on pages 7 to 9.
Thank you for this suggestion. We have added a statement at the beginning of the Methods and again throughout the Analysis section to clarify that the primary purpose of this SWAR is to learn from pilot study findings and plan for a future definitive trial. In addition, we have now added a ‘Progression criteria’ subsection to the Methods, where we have specified criteria to guide decisions about progression and refinement of the full-scale trial based on feasibility and acceptability outcomes, including completion rate, time required of participants, and perceived usefulness of training and screener pairings when undertaking the study selection task.

The current aim reads like a test of which arm is best. A small rewording might fix that e.g., “This pilot will assess feasibility and acceptability to plan a definitive trial” and align outcomes accordingly.
Thank you for this helpful suggestion. We have revised the aim to more clearly reflect the pilot purpose of assessing feasibility and acceptability to inform a future definitive trial. We have also updated the objectives and language so that reliability outcomes are framed as exploratory estimates intended to guide the design of the future trial, rather than as formal comparative tests.

Readers will appreciate a brief expansion of how the 219 records will be chosen.
We have expanded the Methods section to clarify the process used to select the 219 records. Specifically, we now explain that the sample was purposefully drawn to reflect the breadth of eligibility criteria in the host review, including a proportion of records previously screened as “include” (k = 44) and a diverse set of “exclude” records (k = 175) to represent the full range of exclusion reasons. This ensures that participants encounter a variety of screening decisions and the need to apply various exclusion rationales, supporting a meaningful assessment of how well the training conditions prepare novice screeners to apply the eligibility criteria.

Because Covidence shows agreement or conflict as people work, learning during the task is very plausible. Suggest that the record order be randomised for each participant if possible, and report performance by time or by item deciles to show any learning pattern.
Thank you for this helpful guidance. We explored options for randomising record order and analysing learning patterns. However, as confirmed by Covidence Support, Covidence software does not currently support randomisation of the screening order. Records are displayed by default using the “most relevant” ordering, and this is how the study selection task was presented to participants. The “most relevant” ordering is underpinned by Covidence’s internal active learning algorithm, although users can switch to alternative static orderings (e.g., Title or Author) if they choose. Further, the Covidence platform does not provide time-stamped screening data (only the date the record was screened is available), which prevents analyses by time or by item deciles. We recognise that not being able to explore learning during study selection is a limitation and unfortunately cannot action this for our pilot SWAR given the constraints of the platform.

With only three participants per cell, the averages of kappa by arm will be wide regardless. You might consider keeping those and adding a complementary item-level mixed model that treats each screened record as one observation and includes effects for training, partner type, and their interaction, with random effects for participant and item.
Thank you for this recommended addition to our analysis plan. We agree that with three participants per condition, arm-level kappa estimates will be imprecise and should be interpreted cautiously. Consistent with the pilot and exploratory aims of this SWAR, we have added an item-level mixed-effects analysis to explore patterns in screening decisions while accounting for clustering by participant and by item (i.e. individual study records). We clarify that this analysis is exploratory and is intended to inform the design and analytical approach of a subsequent definitive trial, rather than to support confirmatory analyses.

In Analysis on page 9 you plan to treat confidence intervals that cross zero as evidence of equivalence. That can be misread. Suggest you avoid equivalence language or set a small, prespecified equivalence margin for kappa differences and say any tests are exploratory.
We agree. Any reference to evidence of equivalence has now been removed and we further underscore that any analyses are exploratory.

Kappa can be sensitive to prevalence. Keep kappa and percent agreement, then add one robust agreement index such as PABAK. This may make the results easier to read if the include rate is low or high.
Thank you for this suggestion. We have revised the protocol to include reporting of the prevalence-adjusted and bias-adjusted kappa (PABAK), alongside Cohen’s kappa and percent agreement, to support interpretation of reliability given the low number of ‘includes’ during the study selection task.

You define the partner with 4 to 6 years as moderate experience on page 6, but label the arm “experienced”. Maybe revised labelling.
This was an oversight on our part and we thank you for drawing this to our attention. We have now updated our figure labels to reflect ‘moderately experienced’ and likewise updated this throughout the manuscript.

Because one person represents each partner type, partner identity is confounded with partner type. Flag this as a pilot limitation and say that the full trial will include multiple partners per level.
Thank you for prompting consideration of this. We have now included a note on this limitation in our conclusion section with a view to addressing it in the future trial, where feasible.

The Procedure seems to ask participants to vote YES when unsure. Suggest adding a sentence that explains the consequence of this rule for sensitivity and specificity, so readers know you expect higher sensitivity and possibly lower specificity by design.
Many thanks for this suggestion. We acknowledge the importance of this caveat and have added a note to our methods to indicate the impact and trade-off between sensitivity and specificity by reducing decision options to just two, ‘YES’ or ‘NO’.

Time is self-reported in your feasibility outcomes. If Covidence logs can provide start or finish times, say that you will also capture those as an objective check on efficiency.
We have queried the availability of such logs with Covidence Support and unfortunately such data is not available. Our self-reported measure was included as a means to overcome this and have access to data on time efficiency. We nonetheless acknowledge that an objective measure would strengthen the validity and reliability of this outcome.

View more View less

Competing Interests

N/A

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] Acosta S, Garza T, Hsu HY, et al.: Assessing quality in systematic literature reviews: a study of novice rater training. Sage Open. 2020; 10(3): 1–11. Publisher Full Text

[2] Ahern E, Whiston A, Dillon S, et al.: Training and Experience in Study Selection - The TESS study. [Open Science Framework Project]. 2025. http://www.doi.org/10.17605/OSF.IO/WKHQJ

[3] Borah R, Brown AW, Capers PL, et al.: Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open. 2017; 7(2): e012545. PubMed Abstract | Publisher Full Text | Free Full Text

[4] Canty A, Ripley BD: boot: bootstrap R (S-Plus) functions. R package version 1.3-31. 2024.

[5] Chan AW, Tetzlaff JM, Altman DG, et al.: SPIRIT 2013 statement: defining standard protocol items for clinical trials. Ann Intern Med. 2013; 158(3): 200–207. PubMed Abstract | Publisher Full Text | Free Full Text

[6] Cohen J: A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960; 20(1): 37–46. Publisher Full Text

[7] Cooper M, Ungar W, Zlotkin S: An assessment of inter-rater agreement of the literature filtering process in the development of evidence-based dietary guidelines. Public Health Nutr. 2006; 9(4): 494–500. PubMed Abstract | Publisher Full Text

[8] da Costa BR, Beckett B, Diaz A, et al.: Effect of standardized training on the reliability of the Cochrane risk of bias assessment tool: a prospective study. Syst Rev. 2017; 6(1): 44. PubMed Abstract | Publisher Full Text | Free Full Text

[9] Davison AC, Hinkley DV: Bootstrap methods and their application. Cambridge University Press, 1997. Publisher Full Text

[10] Devane D, Burke NN, Treweek S, et al.: Study Within A Review (SWAR). J Evid Based Med. 2022; 15(4): 328–332. PubMed Abstract | Publisher Full Text | Free Full Text

[11] Edwards P, Clarke M, DiGuiseppi C, et al.: Identification of randomized controlled trials in systematic reviews: accuracy and reliability of screening records. Stat Med. 2002; 21(11): 1635–1640. PubMed Abstract | Publisher Full Text

[12] Eldridge SM, Chan CL, Campbell MJ, et al.: CONSORT 2010 statement: extension to randomised pilot and feasibility trials. BMJ. 2016; 355: i5239. PubMed Abstract | Publisher Full Text | Free Full Text

[13] Fourcade L, Boutron I, Moher D, et al.: Development and evaluation of a pedagogical tool to improve understanding of a quality checklist: a randomised controlled trial. PLoS Clin Trials. 2007; 2(5): e22. PubMed Abstract | Publisher Full Text | Free Full Text

[14] Gamer M, Lemon J, Fellows I, et al.: Package ‘irr’: various coefficients of interrater reliability and agreement. R package version 0.84.1, 2019. Reference Source

[15] Garritty C, Gartlehner G, Nussbaumer-Streit B, et al.: Cochrane rapid reviews methods group offers evidence-informed guidance to conduct rapid reviews. J Clin Epidemiol. 2021; 130: 13–22. PubMed Abstract | Publisher Full Text | Free Full Text

[16] Gartlehner G, Affengruber L, Titscher V, et al.: Single-reviewer abstract screening missed 13 percent of relevant studies: a crowd-based, randomized controlled trial. J Clin Epidemiol. 2020; 121: 20–28. PubMed Abstract | Publisher Full Text

[17] Haddaway NR, Pullin AS: The policy role of systematic reviews: past, present and future. Springer Sci Rev. 2014; 2(1–2): 179–183. Publisher Full Text

[18] Hoffmann F, Allers K, Rombey T, et al.: Nearly 80 systematic reviews were published each day: observational study on trends in epidemiology and reporting over the years 2000-2019. J Clin Epidemiol. 2021; 138: 1–11. PubMed Abstract | Publisher Full Text

[19] Horton J, Vandermeer B, Hartling L, et al.: Systematic review data extraction: cross-sectional study showed that experience did not increase accuracy. J Clin Epidemiol. 2010; 63(3): 289–298. PubMed Abstract | Publisher Full Text

[20] Lasserson TJ, Thomas J, Higgins JPT: Chapter 1: starting a review. In: J. P. T. Higgins, J. Thomas, J. Chandler, M. Cumpston, T. Li, M. Page, & V. Welch (Eds.), Cochrane handbook for systematic reviews of interventions. version 6.5. Cochrane 2024, 2021. Reference Source

[21] Lee EC, Whitehead AL, Jacques RM, et al.: The statistical interpretation of pilot trials: should significance thresholds be reconsidered? BMC Med Res Methodol. 2014; 14(1): 41. PubMed Abstract | Publisher Full Text | Free Full Text

[22] Lefebvre C, Glanville J, Briscoe S, et al.: Chapter 4: Searching for and selecting studies. In: JPT Higgins, J Thomas, J Chandler, M Cumpston, T Li, MJ Page, & VA Welch (Eds.), Cochrane Handbook for Systematic Reviews of Interventions. Cochrane, version 6.5. 2024. Reference Source

[23] Leon AC, Davis LL, Kraemer HC: The role and interpretation of pilot studies in clinical research. J Psychiatr Res. 2011; 45(5): 626–629. PubMed Abstract | Publisher Full Text | Free Full Text

[24] Mateen FJ, Oh J, Tergas AI, et al.: Titles versus titles and abstracts for initial screening of articles for systematic reviews. Clin Epidemiol. 2013; 5: 89–95. PubMed Abstract | Publisher Full Text | Free Full Text

[25] McDonagh M, Peterson K, Raina P, et al.: Avoiding bias in selecting studies. In: Methods Guide for Comparative Effectiveness Reviews. AHRQ Publication No. 13–EHC045–EF. Agency for Healthcare Research and Quality, 2013. PubMed Abstract

[26] McGuire J, Bates GW, Dretzke BJ, et al.: Methodological quality as a component of meta-analysis. Educational Psychologist. 1985; 20(1): 1–5. Publisher Full Text

[27] McHugh ML: Interrater reliability: the kappa statistic. Biochem Med (Zagreb). 2012; 22(3): 276–282. PubMed Abstract | Publisher Full Text | Free Full Text

[28] Moons P, Goossens E, Thompson DR: Rapid reviews: the pros and cons of an accelerated review process. Eur J Cardiovasc Nurs. 2021; 20(5): 515–519. PubMed Abstract | Publisher Full Text

[29] Ng L, Pitt V, Huckvale K, et al.: Title and Abstract Screening and Evaluation in Systematic Reviews (TASER): a pilot randomised controlled trial of title and abstract screening by medical students. Syst Rev. 2014; 3: 121. PubMed Abstract | Publisher Full Text | Free Full Text

[30] Noel-Storr A, Dooley G, Elliott J, et al.: An evaluation of Cochrane Crowd found that crowdsourcing produced accurate results in identifying randomized trials. J Clin Epidemiol. 2021; 133: 130–139. PubMed Abstract | Publisher Full Text

[31] Nussbaumer-Streit B, Sommer I, Hamel C, et al.: Rapid reviews methods series: guidance on team considerations, study selection, data extraction and risk of bias assessment. BMJ Evid Based Med. 2023; 28(6): 418–423. PubMed Abstract | Publisher Full Text | Free Full Text

[32] Oremus M, Oremus C, Hall GB, et al.: Inter-rater and test–retest reliability of quality assessments by novice student raters using the Jadad and Newcastle–Ottawa Scales. BMJ Open. 2012; 2(4): e001368. PubMed Abstract | Publisher Full Text | Free Full Text

[33] Orwin RG: Evaluating coding decisions. In: H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis. Russell Sage Foundation, 1994; 139–162. Reference Source

[34] Polanin JR, Pigott TD, Espelage DL, et al.: Best practice guidelines for abstract screening large-evidence systematic reviews and meta-analyses. Res Synth Methods. 2019; 10(3): 330–342. Publisher Full Text | Free Full Text

[35] Robson RC, Pham B, Hwee J, et al.: Few studies exist examining methods for selecting studies, abstracting data, and appraising quality in a systematic review. J Clin Epidemiol. 2019; 106: 121–135. PubMed Abstract | Publisher Full Text

[36] Rotondi MA: kappaSize: sample size estimation functions for studies of interobserver agreement. R package version 1.2, 2018. Publisher Full Text

[37] Siddaway AP, Wood AM, Hedges LV: How to do a systematic review: a best practice guide for conducting and reporting narrative reviews, meta-analyses, and meta-syntheses. Annu Rev Psychol. 2019; 70(1): 747–770. PubMed Abstract | Publisher Full Text

[38] Stoll CRT, Izadi S, Fowler S, et al.: The value of a second reviewer for study selection in systematic reviews. Res Synth Methods. 2019; 10(4): 539–545. PubMed Abstract | Publisher Full Text | Free Full Text

[39] Tendal B, Higgins JP, Jüni P, et al.: Disagreements in meta-analyses using outcomes measured on continuous or rating scales: observer agreement study. BMJ. 2009; 339: b3128. PubMed Abstract | Publisher Full Text | Free Full Text

[40] Thabane L, Lancaster G: A guide to the reporting of protocols of pilot and feasibility trials. Pilot Feasibility Stud. 2019; 5(1): 37. PubMed Abstract | Publisher Full Text | Free Full Text

[41] Veritas Health Innovation: Covidence systematic review software. Reference Source

[42] Waffenschmidt S, Knelangen M, Sieben W, et al.: Single screening versus conventional double screening for study selection in systematic reviews: a methodological systematic review. BMC Med Res Methodol. 2019; 19(1): 132. PubMed Abstract | Publisher Full Text | Free Full Text

Training and Experience in Study Selection (TESS): study protocol for a pilot randomised trial within a systematic review

Abstract

Background

Methods

Conclusions

Registration

Keywords

Background

Aim and objectives

Methods

Study design

Figure 1. Flow diagram of randomisation to the respective pilot study conditions.

Study selection task

Participants

Training conditions

Measures and outcomes

Procedure

Figure 2. Schematic figure of participation tasks and anticipated timeline.

Analysis

Conclusions

Study status

Data availability

Underlying data

Extended data

Author contributions

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Are you a HRB-funded researcher?

Thank you!

Training and Experience in Study Selection (TESS): study protocol for a pilot randomised trial within a systematic review

Abstract

Background

Methods

Conclusions

Registration

Keywords

Background

Aim and objectives

Methods

Study design

Figure 1. Flow diagram of randomisation to the respective pilot study conditions.

Study selection task

Participants

Training conditions

Measures and outcomes

Procedure

Figure 2. Schematic figure of participation tasks and anticipated timeline.

Analysis

Conclusions

Study status

Data availability

Underlying data

Extended data

Author contributions

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Competing Interests Policy

Stay Updated

Are you a HRB-funded researcher?

Thank you!