Factors Affecting Reliability of Grip Strength Measurements in Middle Aged and Older Adults [version 1; peer review: 2 approved with reservations]

Background: Grip strength is a well-established marker of frailty and a good predictor of mortality that has been measured in a diverse range of samples including many population studies. The reliability of grip strength measurement in longitudinal studies is not well understood. Methods: Participants (n=130) completed a baseline and repeat health assessment in the Irish Longitudinal Study on Ageing. Grip strength was assessed using dominant and non-dominant hands (two trials on each). Repeat assessments were conducted 1-4 months later and participants were randomised into groups so that 50% changed time (morning or afternoon assessment) and 50% changed assessor between assessments. Intra-class correlation (ICC) and minimum detectable change (MDC95) were calculated and the effects of repeat assessment, time of day and assessor were determined. Results: Aggregated measures had little variation by repeat assessment or time of day; however, there was a significant effect of assessor (up to 2 kg depending on the measure used). Reliability between assessments was good (ICC>0.9) while MDC95 ranged from 5.59–7.96 kg. Non-aggregated measures alone, taken on the nondominant hand were susceptible to repeat assessment, time of day, assessor and repeated measures within-assessment effects whereas the dominant hand was only affected by assessor. Conclusions: Mean and maximum grip strength had a higher ICC and lower MDC95 than measures on the dominant or non-dominant hands alone. The MDC95 is less than 8 kg regardless of the specific measure reported. However, changing assessor further increases variability, highlighting the need for comprehensive assessor training and avoiding changes within studies where possible. Open Peer Review


Introduction
The use of maximum grip strength as a measure of muscle function and a proxy of overall body strength has become commonplace in clinical and epidemiological research. Grip strength deficits are used in models of frailty 1,2 and are predictive of fractures, disability, cognitive decline and mortality 3-9 . Grip strength is measured routinely in studies of aging 10-17 , where large numbers of participants necessitates quick, easily obtained and meaningful measurements. Clinically, grip strength is an important component in the assessment of sarcopenia recommended by the European working group on sarcopenia in older people 18 .
Grip strength is most often measured with a hand-held hydraulic dynamometer, which is inexpensive and portable. Several studies have examined grip strength test-retest reliability over time, positioning and assessor [19][20][21][22][23][24] . A recent review, revealed a lack of standardization in epidemiological studies which inhibits comparability of results 6 . Studies focusing on grip strength repeatability have stringent controls on assessment location and assessor types, and tend to have relatively small sample sizes. While these studies provide useful guidance, the real-world use of grip strength in epidemiological or clinical test batteries can be considerably different. In epidemiological studies, factors such as assessors, test environments and test order may not be strictly comparable to existing reliability studies and these factors may also change over time within the same study (e.g. see wave-to-wave changes in SHARE, ELSA, HRS).
In longitudinal studies, the ability to detect change over a long period of time is related to a measurement's inherent variability, which is a function of the measurement tool, the testing paradigm, as well as the day-to-day human variability caused by transient factors that are generally not of interest to these studies. Characterising this variability in the short and medium term allows us to distinguish between a genuine change in performance and measurement error. It also allows us to identify improvements in protocols to ensure best results.
The aim of this study was to estimate the reliability of grip strength measures, and to test the effects of repeat assessments, assessor and time of day on grip strength reliability using a representative sample of Irish adults aged 50 years and older. We report the intra-class correlation (ICC), limits of agreement and the minimum detectable change (MDC), quantify variability both within-assessment and between-assessments, and investigate the changes associated with using the dominant or non-dominant hand.

Participants
Participants were recruited from the Survey of Health, Ageing and Retirement in Europe (SHARE), a cross-European longitudinal study constructed to enable multinational comparison on factors affecting ageing 25 . Within Ireland, SHARE recruited 1,119 community-dwelling adults aged 50 years and who participated in a baseline interview in 2006/2007. In 2010, the 827 remaining SHARE-Ireland participants were contacted and invited to participate in a detailed health assessment carried out in conjunction with The Irish Longitudinal Study on Ageing (TILDA), a large ongoing study epidemiologic study of health and ageing that is independent of SHARE 11,26 . Further information was provided to 377 participants, of whom 253 attended an initial health assessment. Participants reporting pain, recent surgery or injury were excluded. Ethical approval for this study was obtained from the Trinity College Dublin research ethics committee.

Health assessments
The health assessment followed similar protocols to those used in TILDA 11 and took place within the dedicated TILDA health assessment centre at Trinity College Dublin. It consisted of a 3-hour battery of tests assessing anthropometric, cognitive, cardiovascular, gait, and visual function including a measure of hand-grip strength. All assessments were delivered in the same testing rooms using the same equipment by two trained research nurses, each with experience of delivering over 300 assessments in TILDA.

Grip strength measurement protocol
Grip strength was measured using a Baseline® hydraulic hand dynamometer. Two measurements were taken on both the dominant and non-dominant hands. Participants were asked to squeeze the dynamometer as hard as possible for a few seconds, while standing with their upper arm against their trunk and the elbow at 90 degrees. Participants who were unable to maintain this position could sit down or support the dynamometer with their free hand or a table. Mean and maximum grip strength was obtained across all four measurements and for dominant and non-dominant hand separately. To estimate within-participant variation, 130 participants attended a repeat health assessment in which the following factors were varied: • Time between assessments varied from approximately 1-4 months (median 88 days; range 28-141 days; interquartile range 70-104 days).
• Time of day (morning/afternoon): Assessments were conducted in either the morning or afternoon, 50% of participants completed the repeat assessment at a different time of day to their initial assessment.
• Assessor: 50% of participants changed nurse at the repeat assessment.
Assessment lag, time of day and assessor was randomised using a minimisation routine designed to achieve balance between all combinations of these covariates, age group and sex of the participants.

Statistical analysis
The statistical analysis plan was based on previous works reported on reliability of cognitive and cardiovascular measures from the same sample 27,28 . The 95% limits of agreement were calculated for mean and maximum grip strength between the first and repeat assessment. The differences were visualised using Bland-Altman plots which graph the average of the two visits on the x-axis and the difference on the y-axis 29 . In order to assess the factors affecting repeatability and calculate p-values, two mixed-effects multi-level models were fitted using Stata 15 (StataCorp LLC, TX, USA). Fixed effects were used to estimate factor treatments, and random effects to estimate variance contributions within assessments (the residual), between assessments for the same participant and between-participants. For the fixed effect reports in Table 1, models also included demographic factors (age, sex and height), to correct for potential imbalances in case distribution between assessors and between days. As this was only necessary for fixed effects, intercepts and random effects reported do not include these corrections. The ICC, which is an indicator of agreement within the random effect of the models, between repeat and baseline measurements was calculated for the groups in each model to indicate reliability across time. Where SD Between is the between-individual standard deviation and SD within is within-individual standard deviation. The MDC 95 , which indicates the value below which 95% of change scores are likely to lie if measurement error alone accounted for them, was calculated for each measure by: For all models, confidence intervals for the ICC and MDC 95 values were calculated using bootstrapping with 1000 repetitions. Bootstrapping was clustered by participant and, in order to mimic sampling processes, was stratified by 1) lag between assessments (less than/greater than median), 2) assessor (change/no change at repeat assessment), and 3) time of day (change/no change at repeat assessment).
Model 1: Estimating the reliability of aggregated measures between assessments Model 1 was used to model the effects of participant and setting factors on mean and maximum grip strength measures. Mean and maximum were calculated across all four measurements (first and second measurements for dominant and non-dominant hands) for each assessment. Fixed effects included assessment (repeat/baseline), lag between assessments (in days), time of day (morning/afternoon) and assessor.

Results
This analysis is based on 130 participants (median age 66 years, range 50-89 years; 55% female). Grip strength data was available at baseline and repeat assessments for 123 participants, with 21 of these having incomplete data at one or both the assessments due to injury (Figure 1). The 95% limits of agreement between baseline and repeat assessments were -6.2-7.0 kg for mean grip strength and -5.9-6.5 kg for maximum grip strength (Figure 2).

Mixed effects models
The results of the mixed effects models in Table 1 are separated into components associated with the fixed effects and random effects. For the fixed effects, a positive result indicates an increase in grip strength at the repeat assessment, in the morning, when the assessment was delivered by assessor 2, with every 1 month increase in the lag between assessments and in the second measure taken within the same assessment.  The plots allow for assessment of the agreement between initial and repeat assessments for mean and maximum grip strength. Disagreement between assessments was fairly even at both lower and higher strengths (X-axes). An increase in measurement value was slightly more frequent in the repeat measurement (Y-axes).
We found that using the mean or maximum grip strength gives a higher ICC and lower MDC 95 compared to individual measurements from either the dominant or the non-dominant hand. This indicates that while reliability is high using both approaches, it will be more difficult to identify a genuine change in performance when using a non-aggregated measurement from one hand only. Although a range of studies and reviews have reported test-retest reliability 6,21,33 and normative values for grip strength 13,17 , the present work focusses on its application in longitudinal studies of aging and provides minimum detectable change values to guide interpretation of longitudinal changes. As expected, results suggest aggregated measures should be used where possible with the benefit of decreasing the MDC 95 from around 7-8 kg (non-aggregated measures) to 5.6 kg.
Differences in mean and maximum grip strength were small across repeat assessments and with varying lag times between assessments. There were differences between morning and afternoon measurements of up to 0.85 kg; however, only the non-dominant, non-aggregated measure was significant at an alpha level of 0.05. When examining non-aggregated grip strength measurements on the non-dominant hands only, performance was higher when it was obtained in the repeat assessment, in the morning, and in the first measure within an assessment. The magnitude of these changes is small, and there is little evidence to suggest either dominant or non-dominant hand is more affected by time of day. The interaction between hand used and repeated measures within-assessment suggests that the non-dominant hand may be more susceptible to fatigue when multiple measurements are taken on the same occasion.
The most striking result is the significant assessor effect of 1-2 kg observed across all grip strength measures (mean, maximum, dominant, non-dominant). Equipment and environmental factors confounding this effect were ruled out by careful experimental design, and as there is little overall difference between repeat assessments or time of day, we conclude that this effect is likely related to the nurse administering the test. The intensity of instruction, even using the same wording, can affect grip strength measurements 30,34 , and the positive consequences of effective encouragement are noted in other studies of exercise and physical activity 35,36 . Clinical studies have also found that the patient-clinician relationship affects healthcare outcomes 37 . Throughout the health assessment in this study (minimum 3 hours), it is likely that the nurse built a relationship with the participant. Regardless of the potential explanations, variation due to the assessor is a fixed effect which is additive to the reported MDC 95 and would increase variance and the bounds of detectable change in cases where the assessor is not held constant. Additional training of assessors to raise awareness of the effects of wording and encouragement may improve repeatability of grip strength tests.
The differences observed here are calculated between only two nurses, each with considerable experience administering health assessments, and so it is likely that even larger variation would be observed among a larger group of assessors. This highlights the importance of comprehensive assessor training especially in large studies which require multiple assessors and where repeated measurements are common. It is often difficult to retain testers in long-term longitudinal studies; this does introduce a potential biasing factor, which should at least be recorded so that potential rater effects can be accounted for in analysis.
The strengths of this study include the strict study design which allows separation and quantification of the different sources of variation, and the relatively large number of participants. A limitation is that the health assessments were carried out by only two nurses. Most epidemiological studies would use a higher number of assessors: having more nurses would have allowed us to estimate a distribution of assessor offsets and to fully understand the impact of using multiple assessors.

Conclusions
Here we report the effects of time of day, assessor, and practice on measures of grip strength, along with their standard deviations within assessment, between assessments and between participants. We demonstrated that grip strength measurements are reliable when obtained at repeat assessments over 1-4 months and at different times of day but that they are potentially affected by different assessors. Grip strength measurements were more susceptible to repeat assessment, time of day and repeated measures effects when they were obtained using the non-dominant hand. We also derive estimates for MDC 95 which can be used to assess changes in grip strength performance in individuals drawn from comparable populations. These results suggest that longitudinal studies should use aggregated grip strength measures as well as minimising the number of assessors over the course of the study.

Underlying data
The data for this study is linked to the wider TILDA dataset containing sensitive, personal information and access is therefore granted on a case-by-case basis following application (including assurances that the data anonymity will be protected and a description of use) to The Irish Longitudinal Study on Ageing (email: tilda@tcd.ie).

Introduction
Second paragraph, second sentence: "A recent review, revealed a lack of standardization in epidemiological studies which inhibits comparability of results". Since this review was published in 2011, it would be appropriate to remove the term "recent". The rationale of the study is not clear from the two following sentences: "Studies focusing on grip strength repeatability have stringent controls on assessment location and assessor types, and tend to have relatively small sample sizes. While these studies provide useful guidance, the real-world use of grip strength in epidemiological or clinical test batteries can be considerably different".
It is not clear why it can be considerably different in epidemiological studies and why stringent protocols should not be used in this type of study to ensure the validity of the measurements.

Method
The description of the test protocol is incomplete. If possible, the following should be described: The number of dynamometers used. Did each evaluator use a different dynamometer? 1.
The handle and the arm positions. 3.
The sequence of the measurements between the dominant and the non-dominant sides (Were the measurements taken consecutively on the same side?).

4.
The sequence of the overall evaluation (Were the tests administered in the same order? If yes, were the grip strength measurements done at the beginning or the end?).

5.
The training of the assessors. 6.
The blinding of the assessors. As the authors wanted to assess the "real world" error of measurement, this would be an important factor to report.

7.
The number of people who could not maintain the standing position or had support to hold the dynamometer? Why was it not chosen to perform the test sitting down and to support the dynamometer in the first place to ensure that everyone used the same position?

8.
The values used in the analyses must be specified. The following sentence should be reworded to help clarity. "Mean and maximum grip strength was obtained across all four measurements and for dominant and non-dominant hand separately." It is quite unusual to use the mean of both left and right hands in grip strength studies. Whilst we understand why this was chosen in this type of study, it wasn't very clear that this was what happened. It is not clear from the introduction or the method section why the authors chose the factors under consideration (time, assessors, etc.). Could it be justified why these factors were chosen in relation to the literature on grip strength measurements? Calculation of the minimum detectable change is usually made from the SEM. To make it easier for less experienced readers to understand, could you explain why you chose the SDwithin.

Results
The distribution of the participants age could be better estimated with the mean, standard deviation and range, or at least with the IQR. Are there any health problems that might affect strength that would be of interest to report in the sample description? This would help to establish the generalizability of their results.

Discussion
In light of their results and the need for standardized grip strength measurements when using more than one evaluator, the authors could add a reference to published standardized protocols.

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Partly Are the conclusions drawn adequately supported by the results? Yes