Background: The mechanisms contributing to the development of rheumatoid arthritis and to the development of tissue damage are complex [1]. Candidate gene association and linkage studies have recently identified a role for genetics in predicting RA severity [2]. Whilst many genetic variants have been identified, few have attempted to fit a full predictive model allowing for the many possible gene-gene and gene-environment interactions. This lack of progress is mainly due to the technical limitations of commonly employed linear model based approaches. Partial least squares (PLS) is a dimension reduction technique which creates a linear combination of potential predictive (X) variables which can be used to predict either one or many (Y) response variables. Unlike traditional linear regression type methods, it makes no distributional assumptions, is able to fit models with more variables than subjects, has no problems with collinearity of the variables and can be used in the presence of incomplete data collection or missing data [3, 4].
Objectives: Use PLS to create a linear model to predict erosive severity using single nucleotide polymorphisms (SNPs), environmental factors and non-erosive markers of disease severity. To quantify the predictive accuracy of the model and interpret the chosen variables in terms of possible functional relationships.
Methods: Using data from 912 subjects, 50 of the most predictive variables from 392 SNPS, 51 environmental variables and 14 other non-erosive markers of disease severity were chosen using a sparse PLS analysis to predict the Larsen score for each subject. The combined linear model was cross validated to measure predictive accuracy of the model.
Results: Although liable to over fitting due to using the same patient population in the test and validation sample, preliminary findings suggest we can predict Larsen score with a median average difference between actual and predicted of 16 Larsen score points (min=0, max=88). A correlation between predicted and actual Larsen score of r=0.684 indicates almost 47% of the variation is being explained by the model. Continued work will assess the cross validation of the prediction estimates with future work aimed to externally validate the findings on independent data.
Conclusions: In this early work, sparse PLS modelling appears to be a promising approach which may be able to identify key variables contributing to the amount of erosive damage in patients with RA.
Arend WP. The innate immune system in rheumatoid arthritis. Arthritis and Rheumatism. 2001; 44(10):2224-2234.
Marinou I, Maxwell JR, Wilson AG. Genetic influences modulating the radiological severity of rheumatoid arthritis. Annals of the Rheumatic Diseases. 2010; 69(3):476-482.
Boulesteix AL, Strimmer K. Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics. 2007; 8(1):32-44.
Eriksson L, Johansson E, Kettaneh-Wold N, Trygg J, Wilkström C, Wold S. Multi-and Megavariate Data Analysis, Part 1, Basic Principals and Applications. 2nd revised and enlarged ed: Umetrics Academy, 2006.