[R] Help with big data and parallel computing: 500, 000 x 4 linear models

Ellis, Alicia M Mon, 08 Aug 2016 14:14:54 -0700

I have a large dataset with ~500,000 columns and 1264 rows.  Each column 
represents the percent methylation at a given location in the genome.  I need 
to run 500,000 linear models for each of 4 predictors of interest in the form 
of:
Methylation.stie1 ~ predictor1 + covariate1+ covariate2 + ... covariate9
...and save only the pvalue for the predictor


The original methylation data file had methylation sites as row labels and the 
individuals as columns so I read the data in chunks and transposed it so I now 
have 5 csv files (chunks) with columns representing methylation sites and rows 
as individuals.

I was able to get results for all of the regressions by running each chunk of 
methylation data separately on our supercomputer using the code below.  
However, I'm going to have to do this again for another project and I would 
really like to accomplish two things to make the whole process more 
computationally efficient:


1)      Work with data.tables instead of data.frames (reading and manipulating 
will be much easier and faster)

2)      Do the work in parallel using say 12 cores at once and having the 
program divide the work up on the cores rather than me having to split the data 
and run 5 separate jobs on the supercomputer.

I have some basic knowledge of the data.table package but I wasn't able to 
modify the foreach code below to get it to work and the code using data.frames 
didn't seem to be using all 12 cores that I created in the cluster.

Can anyone suggest some modifications to the foreach code below that will allow 
me to do this in parallel with datatables and not have to do it in chunks?


############# Set up cluster
clus = makeCluster(12, type = "SOCK")
registerDoSNOW(clus)
getDoParWorkers()
getDoParName()


################### Following code needs to be modified to run the full dataset 
(batch1-batch5) in parallel
### Currently I read in the following chunks, and run each predictor separately 
for each chunk of data

############### Methylation data in batches
batch1=read.csv("/home/alicia.m.ellis/batch1.csv")  ###### #Each batch has 
about 100,000 columns and 1264 rows; want to alter this to:
## batch1=fread(file= )
batch2=read.csv(file="/home/alicia.m.ellis/batch2.csv")
batch3=read.csv(file="/home/alicia.m.ellis/batch3.csv")
batch4=read.csv(file="/home/alicia.m.ellis/batch4.csv")
batch5=read.csv(file="/home/alicia.m.ellis/batch5.csv")

predictors  ## this is a data.frame with 4 columns and 1264 rows

covariates ## this is a data.frame with 9 columns and 1264 rows

fits <- as.data.table(batch1)[, list(MyFits = lapply(1:ncol(batch1), 
function(x) summary(lm(batch1[, x] ~ predictors[,1] +
                                                                                
              covariates[,1]+
                                                                                
              covariates[,2]+
                                                                                
              covariates[,3]+
                                                                                
              covariates[,4]+
                                                                                
              covariates[,5]+
                                                                                
              covariates[,6]+
                                                                                
              covariates[,7]+
                                                                                
              covariates[,8]+
                                                                                
              covariates[,9]
)
)$coefficients[2,4]
)
)
]


######################################  This is what I was trying but wasn't 
having much luck
#### I'm having trouble getting the data merged as a single data.frame and the 
code below doesn't seem to be dividing the work among the 12 cores in the 
cluster

all. fits = foreach (j=1:ncol(predictors), i=1:ncol(meth1), combine='rbind', 
.inorder=TRUE) %dopar% {

  model = lm(meth[, i] ~ predictors[,j] +
               covariates[,1]+
               covariates[,2]+
               covariates[,3]+
               covariates[,4]+
               covariates[,5]+
               covariates[,6]+
               covariates[,7]+
               covariates[,8]+
               covariates[,9])
  summary(model)$coefficients[2,4]
}


Alicia Ellis, Ph.D
Biostatistician
Pathology & Laboratory Medicine
Colchester Research Facility
360 South Park Drive, Room 209C
Colchester, VT  05446
802-656-9840


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Help with big data and parallel computing: 500, 000 x 4 linear models

Reply via email to