[R] how to draw random numbers from many categorical distributions quickly?
Dear R helpers, I have a question about drawing random numbers from many categorical distributions. Consider n individuals, each follows a categorical distribution defined over k categories. Consider a simple case in which n=4, k=3 as below catDisMat - rbind(c(0.1,0.2,0.7),c(0.2,0.2,0.6),c(0.1,0.2,0.7),c(0.1,0.2,0.7)) outVec - rep(NA,nrow(catDisMat)) for (i in 1:nrow(catDisMat)){ outVec[i] - sample(1:3,1, prob=catDisMat[i,], replace = TRUE) } I can think of one way to potentially speed it up (in reality, my n is very large, so speed matters). The approach above only samples 1 value each time. I could have sampled two values for c(0.1,0.2,0.7) because it appears three times. so by doing some manipulation, I think I can have the idea, sample(1:3, 3, prob=c(0.1,0.2,0.7), replace = TRUE), implemented to improve speed a bit. But, I wonder whether there is a better approach for speed? Thanks in advance. -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] how to efficiently extract elements of a list?
Dear R helper, I wonder whether there is a quick way to extract some elements for a list. for a vector we can do the following vec - seq(3) names(vec) - LETTERS[1:3] vec[c(1,3)] vec[c('A','C')] But for a list, test.l - list(c(1,3),array(NA,c(1,2)),array(0,c(2,3))) names(test.l)-LETTERS[1:3] The following does not work. is there some command (I was thinking of do.call) that can do the job? test.l[[c('A','B')]] test.l[[c(1,3)]] do.call('[',c(test.l,c(1,3))) do.call('[[',c(test.l,c(1,3))) do.call('[',c(test.l,c('A','C'))) do.call('[[',c(test.l,c('A','C'))) Thanks in advance. -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] using character vector as input argument to setkey (data.table pakcage)
Dear R helpers, I wonder how to use a character vector as an input argument to setkey (data.table package). The following works: library(data.table) test.dt - data.table(expand.grid(a=1:30,b=LETTERS),c=seq(30*26)) setkey(test.dt,a,b) I like a similar function, but can accept c('a','b') as an input argument as below setkey.wanted(test.dt,c('a','b')) Your help will be highly appreciated. -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] how to cut a multidimensional array along a chosen dimension and store each piece into a list
Dear R-Helpers, I wonder whether there is a function which cuts a multiple dimensional array along a chosen dimension and then store each piece (still an array of one dimension less) into a list. For example, arr - array(seq(1*2*3*4),dim=c(1,2,3,4)) # I made a point to set the length of the first dimension be 1to test whether I worry about drop=F option. brkArrIntoListAlong - function(arr,alongWhichDim){ return(outlist) } I have tried splitter_a in plyr package but does not get what I want. library(plyr) plyr:::splitter_a(arr,3) I understand that I can write a for loop to make it happen but I am searching for a better solution. Thanks in advance. -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] question about deparse(substitute(...))
Dear R helpers: I like to apply deparse(substitute()) on multiple arguments to collect the names of the arguments into a character vector. I used function test.fun as below. it works when there is only one input argument. but it does not work for multiple arguements. can someone kindly help? test.fun - function(...){deparse(substitute(...))} test.fun(x) #this works test.fun(x,y,z) # I like c('x','y','z') be the output, but cannot get it. Thanks in advance. -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] how to replace a single forward slash with a double backward slash in a string?
Dear R-helpers. Can someone kindly tell me how to replace a single forward slash with double backward slash in a string? i.e., from a/b to a\\b Many thanks in advance. -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] how to replace a single backward slash with a double backward slash?
Dear R-helpers: Hours ago, I asked how to replace a single forward slash with a double backward slash and recieved great help. Thanks again for all the repliers. In the meantime, I wonder how to replace a single backward slash with a double backward slash? e.g., I want change c:\test into c:\\test I tried the following but does not work. gsub(\\\,,c:\test) Can someone help? Thanks a lot in advance. -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to replace a single backward slash with a double backward slash?
David and William, Thanks for your reply which make me know the concept of escape symbols. As David guessed, I was trying to write a function which will accept a path cut from windows explorer. and as you know windows explorer uses \. e.g., c:\temp\function.r I originally would like that the function is able to change the example path into c:/temp/function.r David's final comment seems to suggest this is impossible... If so, it is a limitation because I have to manually change \ into / each time. But it is good to know this limitation. Correct me, if I misunderstand and there is no such a limitation. Thanks again. -Sean On Sun, Dec 13, 2009 at 5:26 PM, David Winsemius dwinsem...@comcast.netwrote: On Dec 13, 2009, at 5:11 PM, Sean Zhang wrote: Dear R-helpers: Hours ago, I asked how to replace a single forward slash with a double backward slash and recieved great help. Thanks again for all the repliers. In the meantime, I wonder how to replace a single backward slash with a double backward slash? e.g., I want change c:\test into c:\\test I tried the following but does not work. gsub(\\\,,) Can someone help? Your problem may be that you think there actually is a \ in c:\test. There isn't: grep(, c:\test) # which would have found a true \ integer(0) It's an escaped t, which is the tab character = \t: grep(\\\t, c:\test) [1] 1 cat(rr\tqq) rr qq If your goal is to make file paths in Windows correctly, then you have two choices: a) use doubled \\'s in the literal strings you type, or ... b) use /'s So maybe you should explain what you are doing? We don't request that background out of nosiness, but rather so we can give better answers -- David Winsemius, MD Heritage Laboratories West Hartford, CT [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] how to convert character string with only month and year into date
Dear R helpers. I am new to plotting time data using R. wonder how to convert character time info into date in R. I searched over the web but did not find answer. the input character string is something like 03_1993 or 03-1993, so the precision is at month level. I tried the following but failed. #R code below. strptime(c(03_1993),%m_%Y) strptime(c(03-1993),%m-%Y) Can you someone kindly show me to do it? Many thanks in advance! -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to convert character string with only month and year into date
David, Gabor, and Henriuqe, Thanks a lot for help! Another (inelegant) way is to use ts() and then supply the start and end time. this inelegant way works (I guess at least for equally spaced data.) . -Sean On Tue, Sep 22, 2009 at 3:19 PM, David Winsemius dwinsem...@comcast.netwrote: On Sep 22, 2009, at 3:03 PM, Sean Zhang wrote: Dear R helpers. I am new to plotting time data using R. wonder how to convert character time info into date in R. I searched over the web but did not find answer. the input character string is something like 03_1993 or 03-1993, so the precision is at month level. I tried the following but failed. #R code below. strptime(c(03_1993),%m_%Y) strptime(c(03-1993),%m-%Y) Can you someone kindly show me to do it? The usual R classes do not have a year-month version but package zoo does: library(zoo) as.yearmon(03_1993,%m_%Y) [1] Mar 1993 -- David Winsemius, MD Heritage Laboratories West Hartford, CT [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] why need a new database to store results generated from another database in filehash?
Dear R-Helpers: I am trying filehash and would like to know whether I have to create a new database to store results generated from another database. Example code is presented below to show the question. #R code below library(filehash) dbCreate(myDB1) db1 - dbInit(myDB1) dbDelete(db1,a) dbInsert(db1, a, data.frame(id=I(LETTERS[1:3]))) dbInsert(db1, b, data.frame(id=I(LETTERS[2:3]))) #the following line does Not work, a_and_b will not created dbInsert(db1,a_and_b,merge(db1$a,db1$b,by='id')) dbList(db1) #however, a new database(db2) can store a_and_b dbCreate(myDB2) db2-dbInit(myDB2) dbInsert(db2,a_and_b,merge(db1$a,db1$b,by='id')) dbList(db2) db2$a_and_b #R code above Is it possible to make dbInsert(db1,a_and_b,merge(db1$a,db1$b,by='id')) work? I am interested in avoiding creating db2. Many thanks in advance! -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] how to pass more than one argument to the function called by lapply?
Dear R helpers: I wonder how to pass more than one argument to the function called by lapply. For example, #R code below --- indf - data.frame(id=I(c('a','b')),y=c(1,10)) #I want to add an addition argument cutoff into the function called by lapply. outside.fun - function(indf, cutoff) { unlist(lapply(split(indf, indf[,'id']), function(.x, cutoff) {.x[,'y'] cutoff} )) } #but the next line does not work outside.fun(indf,3) #as you expected, hard code cutoff works as below, but I do not like hard coding. outside.fun.hardcode.cutoff - function(indf, cutoff) { unlist(lapply(split(indf, indf[,'id']), function(.x, cutoff) {.x[,'y'] 3} )) } outside.fun.hardcode.cutoff(indf,) #R code above So, can someone kindly show me how to pass more than one arguments into the function called by lapply? Many thanks in advance. -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] how to use do.call together with cbind and get inside a function
Dear R-helpers: I have a question related to using do.call to call cbind and get. #the following works vec1 - c(1,2) vec2 - c(3,4) ColNameVec - c('vec1','vec2') mat - do.call(cbind,lapply(ColNameVec,get)) mat #put code above into a function then it does not work #before doing so, first remove vec1 and vec2 from global environment rm(vec1,vec2) test - function() { vec1 - c(1,2) vec2 - c(3,4) ColNameVec - c('vec1','vec2') mat - do.call(cbind,lapply(ColNameVec,get)) return(mat) } test() In my task, I have to run do.call(cbind,lapply(ColNameVec,get)) inside a function, can someone kindly help? Many thanks in advance! -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Problem with storing a sequence of lmer() model fit into a list
Dear R-helpers: May I ask a question related to storing a number of lmer model fit into a list. Basically, I have a for-loop (see towards the bottom of this email) in the loop, I am very sure that the i-th model fit (i.e.,fit_i) is successfully generated and the character string (i.e., tmp_i) is created correctly. The problem stems from the following line in the for-loop #trouble making line below fit.list[[tmp_i]] - fit_i I tried the following example which stores glm() model fit without a problem. #the following code can store glm() model fit into a list --- x1-runif(200) x2-rnorm(200) y-x1+x2 testdf-data.frame(y=y, x1=x1, x2=x2) indepvec-c(x1,x2) fit.list-NULL fit_1-glm(y~x1,data=testdf) fit_2-glm(y~x2,data=testdf) fit.list[[paste('fit_',indepvec[1],sep='')]]-fit_1 fit.list[[paste('fit_',indepvec[12],sep='')]]-fit_2 so why cannot I store lmer() model fit in a list? Would someone kindly explain to me what the R error message(last line of this email) really means? Your kind help will be highly appreciated! -Sean #the following for-loop intends to store lmer() random poisson model output into list (fit.list), it does not work --- fit.list-NULL for (i in seq_along(depvar_vec)) { #I found that s_sex, ses1 and race are not useful fit_i - lmer(as.formula(gen.ranpoisson.fml.jh(depvar_vec[i], offsetvar ,factorindepvars, nonfactorindepvars ,ranintvar )), family=quasipoisson(link=log),verbose=F, data=indf) tmp_i-paste('ranpoi_', depvar_vec[i], sep='') fit.list[[tmp_i]] - fit_i #assign also does not work #assign(fit.list$parse(text = tmp_i), fit_i) } --- #R gives the following error message. Error in fit.list[[tmp_i]] - fit_i : invalid type/length (S4/0) in vector allocation [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] how to apply the dummy coding rule in a dataframe with complete factor levels to another dataframe with incomplete factor levels?
Dear R helpers: Sorry to bother for a basic question about model.matrix. Basically, I want to apply the dummy coding rule in a dataframe with complete factor levels to another dataframe with incomplete factor levels. I used model.matrix, but could not get what I want. The following is an example. #Suppose I have two dataframe A and B dfA=data.frame(f1=factor(c('a','b','c')), f2=factor(c('aa','bb','cc'))) dfB =data.frame(f1=factor(c('a','b','b')), f2=factor(c('aa','bb','bb'))) #dfB's factor variables have less number of levels #use model.matrix on dfA (matA-model.matrix(~f1+f2,data=dfA)) #use model.matrix on dfB (matB-model.matrix(~f1+f2,data=dfB)) #I actaully like to dummy code dfB using the dummy coding rule defined in model.matrix(~f1+f2,data=dfA)) #matB_wanted is below (matB_wanted-rbind(c(1,0,0,0,0),c(1,1,0,1,0),c(1,1,0,1,0)) ) colnames(matB_wanted)-colnames(matA) matB_wanted Can someone kindly show me how to get matB_wanted? Many thanks in advance! -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] question related to fitting overdispersion count data using lmer quasipoisson
Dear R-helpers: I have a question related to fitting overdispersed count data using lmer. Basically, I simulate an overdispsed data set by adding an observation-level normal random shock into exp(+rnorm()). Then I fit a lmer quasipoisson model. The estimation results are very off (see model output of fit.lmer.over.quasi below). Can someone kindly explain to me what went wrong? Many thanks in advance. -Sean #data simulation (modified from code at http://markmail.org/message/j3zmgrklihe73p4p) set.seed(100) m - 5 n - 100 N - n*m #X - cbind(1,runif(N)) X - cbind(1,rnorm(N)) X - cbind(runif(N),rnorm(N)) id - rep(1:n,each=m) # Z - kronecker(diag(n),rep(1,m)) #Possion with group level heterogeneity z - rpois(N, exp(X%*%matrix(c(1,2)) + Z%*%matrix(rnorm(n #2*rnorm(n*m) is added to each observation to create overdispersion z.overdis - rpois(N, exp(X%*%matrix(c(1,2)) + Z%*%matrix(rnorm(n)) + 2*rnorm(n*m))) #without observation-level random shock i.e., 2*rnorm(n*m), estimate results are very accurate (fit.lmer - lmer(z ~ X + (1|id), family=poisson,verbose=F)) #Generalized linear mixed model fit by the Laplace approximation #Formula: z ~ X + (1 | id) # AIC BIC logLik deviance # 851 868 -422 843 #Random effects: # Groups NameVariance Std.Dev. # id (Intercept) 0.9770.988 #Number of obs: 500, groups: id, 100 # #Fixed effects: #Estimate Std. Error z value Pr(|z|) #(Intercept) -0.0128 0.1116-0.1 0.9 #X11.0615 0.060117.7 2e-16 *** #X22.0236 0.021494.7 2e-16 *** #--- #Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 # #Correlation of Fixed Effects: # (Intr) X1 #X1 -0.349 #X2 -0.270 0.258 #Now you can see the results are very off (fit.lmer.over.quasi - lmer(z.overdis ~ X + (1|id), family=quasipoisson(link=log),verbose=F)) #Generalized linear mixed model fit by the Laplace approximation #Formula: z.overdis ~ X + (1 | id) # AIC BIC logLik deviance # 41867 41888 -2092941857 #Random effects: # Groups NameVariance Std.Dev. # id (Intercept) 175.813.26 # Residual 72.9 8.54 #Number of obs: 500, groups: id, 100 # #Fixed effects: #Estimate Std. Error t value #(Intercept) 1.3530 1.34921.00 #X11.0834 0.22734.77 #X21.3501 0.0783 17.25 # #Correlation of Fixed Effects: # (Intr) X1 #X1 -0.099 #X2 -0.055 0.070 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Sean / Re: question related to fitting overdispersion count data using lmer quasipoisson
Hey Buddy, Hope you have been doing well since last contact. If you have the answer to the following question, please let me know. If you have chance to travel up north. let me know. best, -Sean -- Forwarded message -- From: Sean Zhang seane...@gmail.com Date: Sat, Apr 11, 2009 at 12:12 PM Subject: question related to fitting overdispersion count data using lmer quasipoisson To: r-help@r-project.org Cc: seane...@gmail.com Dear R-helpers: I have a question related to fitting overdispersed count data using lmer. Basically, I simulate an overdispsed data set by adding an observation-level normal random shock into exp(+rnorm()). Then I fit a lmer quasipoisson model. The estimation results are very off (see model output of fit.lmer.over.quasi below). Can someone kindly explain to me what went wrong? Many thanks in advance. -Sean #data simulation (modified from code at http://markmail.org/message/j3zmgrklihe73p4p) set.seed(100) m - 5 n - 100 N - n*m #X - cbind(1,runif(N)) X - cbind(1,rnorm(N)) X - cbind(runif(N),rnorm(N)) id - rep(1:n,each=m) # Z - kronecker(diag(n),rep(1,m)) #Possion with group level heterogeneity z - rpois(N, exp(X%*%matrix(c(1,2)) + Z%*%matrix(rnorm(n #2*rnorm(n*m) is added to each observation to create overdispersion z.overdis - rpois(N, exp(X%*%matrix(c(1,2)) + Z%*%matrix(rnorm(n)) + 2*rnorm(n*m))) #without observation-level random shock i.e., 2*rnorm(n*m), estimate results are very accurate (fit.lmer - lmer(z ~ X + (1|id), family=poisson,verbose=F)) #Generalized linear mixed model fit by the Laplace approximation #Formula: z ~ X + (1 | id) # AIC BIC logLik deviance # 851 868 -422 843 #Random effects: # Groups NameVariance Std.Dev. # id (Intercept) 0.9770.988 #Number of obs: 500, groups: id, 100 # #Fixed effects: #Estimate Std. Error z value Pr(|z|) #(Intercept) -0.0128 0.1116-0.1 0.9 #X11.0615 0.060117.7 2e-16 *** #X22.0236 0.021494.7 2e-16 *** #--- #Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 # #Correlation of Fixed Effects: # (Intr) X1 #X1 -0.349 #X2 -0.270 0.258 #Now you can see the results are very off (fit.lmer.over.quasi - lmer(z.overdis ~ X + (1|id), family=quasipoisson(link=log),verbose=F)) #Generalized linear mixed model fit by the Laplace approximation #Formula: z.overdis ~ X + (1 | id) # AIC BIC logLik deviance # 41867 41888 -2092941857 #Random effects: # Groups NameVariance Std.Dev. # id (Intercept) 175.813.26 # Residual 72.9 8.54 #Number of obs: 500, groups: id, 100 # #Fixed effects: #Estimate Std. Error t value #(Intercept) 1.3530 1.34921.00 #X11.0834 0.22734.77 #X21.3501 0.0783 17.25 # #Correlation of Fixed Effects: # (Intr) X1 #X1 -0.099 #X2 -0.055 0.070 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to handle tabular form data in lmer without expanding the data into binary outcome form?
Dear R-gurus: I have a question about lmer. Basically, I have a dataset, in which each observation records number of trials (N) and number of events (Y) given a covariate combination(X) and group id (grp_id). So, my dataset is in tabular form. (in case my explanation of tabular form is unclear, please see the link: http://www.stat.psu.edu/online/development/stat504/06_logreg/11_logreg_fitmodel.htm ) My question: what is the lmer syntax for tabular data (model Y/N=X is the what SAS does as seen in the link above). In specific, where can I add N (number of trials) into the following line of lmer code? m1 - lmer(Y ~ X+(1|grp_id), family=biomial(link=logit)) As you may expect, I try to avoid expanding the tabular form data into binary (0,1) outcome form data because doing so causes a quite large data matrix in my study). A link with similar question is seen at https://stat.ethz.ch/pipermail/r-help/2008-May/161072.html Seems to me, that link is implementing data expansion approach (they have only 1600 obs after data expansion). If someone knows a neat solution other than data expansion, please help. Many thanks in advance! -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] What is the best package for large data cleaning (not statistical analysis)?
Dear Jim: Thanks for your reply. Looks to me, you were using batching. I used batching to digest large data in Matlab before. Still wonder the answers to the two specifics questions without resorting to batching. Thanks. -Sean On Sat, Mar 14, 2009 at 10:13 PM, jim holtman jholt...@gmail.com wrote: Exactly what type of cleaning do you want to do on them? Can you read in the data a block at a time (e.g., 1M records), clean them up and then write them back out? You would have the choice of putting them back as a text file or possibly storing them using 'filehash'. I have used that technique to segment a year's worth of data that was probably 3GB of text into monthly objects that were about 70MB dataframes that I stored using filehash. These I then read back in to do processing where I could summarize by month. So it all depends on what you want to do. You could read in the chunks, clean them and then reshape them into dataframes that you could process later. You will still probably have the problem that all the data still won't fit in memory. Now one thing I did was that since the dataframes were stored as binary objects in filehash, it was pretty fast to retrieve them, pick out the data I needed from each month and create a subset of just the data I needed that would now fit in memory. So it all depends ... On Sat, Mar 14, 2009 at 8:46 PM, Sean Zhang seane...@gmail.com wrote: Dear R helpers: I am a newbie to R and have a question related to cleaning large data frames in R. So far, I have been using SAS for data cleaning because my data sets are relatively large (handling multiple files, each could be as large as 5-10 G). I am not a fan of SAS at all and am eager to move data cleaning tasks into R completely. Seems to me, there are 3 options. Using SQL, ff or filehash. I do not want to learn sql. so my question is more related to ff and filehash. In specifics, (1) for merging two large data frames, which one is better, ff vs. filehash? (2) for reshaping a large data frame (say from long to wide or the opposite) which one is better, ff vs. filehash? If you can provide examples, that will be even better. Many thanks in advance. -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] What is the best package for large data cleaning (not statistical analysis)?
Dear R helpers: I am a newbie to R and have a question related to cleaning large data frames in R. So far, I have been using SAS for data cleaning because my data sets are relatively large (handling multiple files, each could be as large as 5-10 G). I am not a fan of SAS at all and am eager to move data cleaning tasks into R completely. Seems to me, there are 3 options. Using SQL, ff or filehash. I do not want to learn sql. so my question is more related to ff and filehash. In specifics, (1) for merging two large data frames, which one is better, ff vs. filehash? (2) for reshaping a large data frame (say from long to wide or the opposite) which one is better, ff vs. filehash? If you can provide examples, that will be even better. Many thanks in advance. -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Is there any difference between - and =
Dear Jens and Wacek: I appreciate your answers very much. I came up an example based on your comments. I feel the example helped me to understand...(I could be missing your points though :( ) If so, please let me know. Simon pointed out the following link: http://www.stat.auckland.ac.nz/mail/archive/r-downunder/2008-October/000300.html I am still trying to understand it... My question is how my conclusion (see at the end of the example below) drawn from lexical scope perspective is related to an understanding from an environment perspective (if an understanding from environment perspective validly exists). Thank you all again very much! -Sean Zhang #My little example is listed below f1-function(a=1,b=2) {print(a); print(b); print(a-b) } f1() #get 3, makes sense f1(2,) #get 0, makes sense a - 10 b - 20 f1(a=a+1,b=a) a #get 10 a is not changed outside function scope b #get 20, b is not changed outside function scope a - 10 b - 20 f1(a - a+1, b - a) a #a is now 11, a is changed outside function b #b is now 11 b is changed outside function a - 10 b - 20 f1({a=a+1},{b = a}) a #a is changed into 11 b #b is changed into a(i.e., 11) a - 10 b - 20 f1((a=a+1),(b = a)) a #a is changed into 11 b #b is changed into a(i.e., 11) #my conclusion based on testing the example above is below #say argument is a, when used inside paraenthesis of whatever.fun-function() #a-something, (a=something) , and {a-something} #are the same. They all change the values outside the function's scope. #Typically, this breaks the desired lexical scope convention. so it is dangerous. #Correct me, if my understanding is off. #Also, how to interprete the above test results from an environment perspective? evnironment vs. scope? #big thanks. -Sean On Thu, Mar 12, 2009 at 11:29 AM, Jens Oehlschlägel oehl_l...@gmx.dewrote: Sean, would like to receive expert opinion to avoid potential trouble [..] i think the following is the most secure way if one really really has to do assignment in a function call f({a=3}) and if one keeps this convention, - can be dropped altogether. secure is relative, since due to R's lazy evaluation you never know whether a function's argument is being evalutated, look at: f- function(x)TRUE x - 1 f((x=2)) # obscured attempt to assign in a function call [1] TRUE x [1] 1 Thus there is dangerous advice in the referenced blog which reads: f(x - 3) which means assign 3 to x, and call f with the first argument set to the value 3 This might be the case in C but not in R. Actually in R f(x - 3) means: call f with a first unevaluated argument x - 3, and if and only if f decides to evaluate its first argument, then the assignment is done. To make this very clear: f - function(x)if(runif(1)0.5) TRUE else x x - 1 print(f(x - x + 1)) [1] TRUE print(f(x - x + 1)) [1] 2 print(f(x - x + 1)) [1] 3 print(f(x - x + 1)) [1] TRUE print(f(x - x + 1)) [1] 4 print(f(x - x + 1)) [1] 5 print(f(x - x + 1)) [1] TRUE print(f(x - x + 1)) [1] 6 print(f(x - x + 1)) [1] TRUE Here it is unpredictable whether your assignment takes place. Thus assigning like f({x=1}) or f((x=1))is the maximum dangerous thing to do: even if you have a code-reviewer and the guy is aware of the danger of f(x-1) he will probably miss it because f((x=1)) does look too similar to a standard call f(x=1). According to help(-), R's assignment operator is rather - than =: The operators - and = assign into the environment in which they are evaluated. The operator - can be used anywhere, whereas the operator = is only allowed at the top level (e.g., in the complete expression typed at the command prompt) or as one of the subexpressions in a braced list of expressions. So my recommendation is 1) use R's assignment operator with two spaces around (or assign()) and don't obscure assignments by using C's assignment operator (or other languages equality operator) 2) do not assign in function arguments unless you have good reasons like in system.time(x - something) HTH Jens Oehlschlägel P.S. Disclaimer: you can consider me biased towards -, never trust experts, whether experienced or not. P.P.S. a puzzle, following an old tradition: What is going on here? (and what would you need to do to prove it?) search() [1] .GlobalEnvpackage:stats package:graphics package:grDevices package:utils package:datasets package:methods [8] Autoloads package:base ls(all.names = TRUE) [1] y y [1] 1 2 3 identical(y, 1:3) [1] TRUE y[] - 1 # assigning 1 fails y [1] 1 2 3 y[] - 2 # assigning 2 works y [1] 2 2 2 # Tip: no standard packages modified, no extra packages loaded, neither classes nor methods defined, no print methods hiding anything, if you would investigate my R you would not find any false bottom anymore version _ platform i386-pc-mingw32 arch i386 os mingw32 system
[R] Is there any difference between - and =
Dear R-helpers: I have a question related to - and =. I saw very experienced R programmers use = rather than - quite consistently. However, I heard from others that do not use = but always stick to - when assigning valuese. I personally like = because I was using Matabl, But, would like to receive expert opinion to avoid potential trouble. Many thanks in advance. -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to write a function that accepts unlimited number of input arguments?
Dear R-helpers: I am an R newbie and have a question related to writing functions that accept unlimited number of input arguments. (I tried to peek into functions such as paste and cbind, but failed, I cannot see their codes..) Can someone kindly show me through a summation example? Say, we have input scalar, 1 2 3 4 5 then the ideal function, say sum.test, can do (1+2+3+4+5)==sum.test(1,2,3,4,5) Also sum.test can work as the number of input scalar changes. Many thanks in advance! -sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to write a function that accepts unlimited number of input arguments?
Big thanks for your help and suggestion for email communication. -s On Mon, Mar 9, 2009 at 12:18 PM, baptiste auguie ba...@exeter.ac.uk wrote: On 9 Mar 2009, at 16:04, Sean Zhang wrote: Dear Baptiste: Many thanks for your help! Using the Reduce way, it works almost perfectly. I ran into this problem when thinking of appending vectors. Is it possible to not use list() within add() so add(vec1,vec2,vec3) below can work? add - function(...) Reduce(+, list(...)) add(1, 2, 3) Also, do you have some quick hints on using '...'? Many Thanks in advance. I'm not sure of a good reference for this. I'd strongly suggest you read the Introduction to R manual ( also check the R project webpage for many other resources). Also, it'd be better if you could Cc R-help next time you ask for further information. Hope this helps, baptiste vec1-c(0,1) vec2-c(2,3) vec3-c(4,5) add - function(x) Reduce(append, x) add(list(vec1, vec2)) #add(vec1,vec2) does not work at the moment -sean On Mon, Mar 9, 2009 at 11:50 AM, baptiste auguie ba...@exeter.ac.ukwrote: Hi, On 9 Mar 2009, at 15:32, Sean Zhang wrote: Dear R-helpers: I am an R newbie and have a question related to writing functions that accept unlimited number of input arguments. it's usually through the ... argument, e.g in paste(...). (I tried to peek into functions such as paste and cbind, but failed, I cannot see their codes..) simply type their name in the R prompt paste function (..., sep = , collapse = NULL) .Internal(paste(list(...), sep, collapse)) environment: namespace:base etc... but that's not very useful here. Can someone kindly show me through a summation example? Say, we have input scalar, 1 2 3 4 5 then the ideal function, say sum.test, can do (1+2+3+4+5)==sum.test(1,2,3,4,5) see ?Reduce for one way to do this: add - function(x) Reduce(+, x) add(list(1, 2, 3)) Also sum.test can work as the number of input scalar changes. Many thanks in advance! -sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. _ Baptiste Auguié School of Physics University of Exeter Stocker Road, Exeter, Devon, EX4 4QL, UK Phone: +44 1392 264187 http://newton.ex.ac.uk/research/emag __ _ Baptiste Auguié School of Physics University of Exeter Stocker Road, Exeter, Devon, EX4 4QL, UK Phone: +44 1392 264187 http://newton.ex.ac.uk/research/emag __ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to make warning message colorful (or have sound)?
Dear R-helpers: I am new to R and wonder how to make a warning message colorful (if possible, having sound is also welcome). I did some research and failed to see options to allow this functionality. Is this a techical limitation so far, or I miss some information. Many thanks in advance. -sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] how to NULL multiple variables of a df efficiently?
Dear R-helpers: I am an R novice and would appreciate answer to the following question. Want to delete many variables in a dataframe. Am able to delete one variable by assigning it as NULL Have a large number of variables and would like to delete them without using a for loop. Is there a command/function which does this job? Many thanks in advance. -Sean #Small Example: df-data.frame(var.a=rnorm(10), var.b=rnorm(10),var.c=rnorm(10)) df[,'var.a']-NULL #this works for one single variable df[,c('var.a','var.b')]-NULL #does not work for multiple variables [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to transfer a list of space delimited character elements into a char vector?
My dear R-helpers: I am a novice in R and have the following text string manipulation question. Is there a function that performs the job described below? Say, wanted_output - c(ab, cd, ef) #the function_wanted can generate c(ab, cd, ef) using ab cd ef as the single input argument wanted_output - function_wanted(ab cd ef) Motivation: I have a very long list of character elements (like, ab cd ef gg ww kwfl ..), I try to avoid typing , between two adjacent elements, typing in front of the first element, and typing right after the last element. when using them to generate a character vector. Many Thanks in advance. -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to apply table() on subdata and stack outputs
Dear R helpers: I am a R novice and have a question about using table() to extract frequences over many sub-datasets. A small example input dataframe and wanted output dataframe are provided below. The real data is very large so a for loop is what I try to avoid. Can someone englithen me how to use sapply or the like to achieve it? Many thanks in advance! -Sean #example input dataframe id - c('tom', 'tom', 'tom', 'jack', 'jack', 'jack', 'jack') var_interest - c(happy,unhappy, , happy, unhappy, 'soso','happy') input.df - data.frame(id=id, var_interest=var_interest) input.df wanted.df - #output dataframe I want id_unique - c('tom','jack') happy_freq-c(1,2) unhappy_freq-c(1,1) soso_freq-c(0,1) miss_freq-c(1,0) output.df -data.frame(id_unique=id_unique, happy_freq=happy_freq, unhappy_freq=unhappy_freq, soso_freq=soso_freq, miss_freq=miss_freq) output.df [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] 3d scatter plot with both error bars and a flexibly fitted surface
Dear R-helpers: I, an entry level R user, wonder how make a 3d scatter plot with both error bars and a flexibly fitted surface. Can anyone eligthen me? Many Thanks in advance. -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] misalignment of x-axis when overlaying two plots using latticeExtra
Dear R-helpers: I am an entry-level R user and have a question related to overlaying a barchart and and a xyplot using latticeExtra. My problem is that when I overlay them I fail to align their x-axes. I show my problem below through an example. #the example data frame is provided below vec -c(1,5.056656,0.5977967,0.06126587,0.08557778, 2,4.601049,0.5995989,0.05002188,0.11410027, 3,4.932008,0.5502283,0.06727938,0.12531825, 4,4.763798,0.5499489,0.06473846,0.10752641, 5,4.944967,0.5328129,0.05445327,0.13663951, 6,5.063504,0.5267245,0.06477738,0.12380332, 7,4.735251,0.5528205,0.06851714,0.12196075, 8,5.141733,0.5304151,0.07965567,0.15123277, 9,5.215678,0.5219224,0.06694207,0.16476356, 10,4.930439,0.5712519,0.08591549,0.09710933, 11,5.075990,0.5615573,0.05778996,0.15361845, 12,4.909847,0.5683740,0.08711699,0.11189277, 13,4.863164,0.5652511,0.0727,0.12071060, 14,5.173818,0.5564918,0.09830620,0.11831926, 15,4.762325,0.5345888,0.08792658,0.11738642, 16,5.046225,0.5268459,0.09574746,0.13254236, 17,4.902188,0.5370394,0.07194955,0.13164327, 18,4.865935,0.5446562,0.06894994,0.12645103, 19,5.204060,0.5650887,0.06726925,0.09242551, 20,5.208138,0.5765187,0.09282935,0.11053842) df-as.data.frame( t(matrix(vec,nrow=5,ncol=20))) names(df)-c(group,outcome,proportion_1,proportion_2,proportion_3) library(latticeExtra) library(lattice) #First generate barchart to plot the 3 proportions prop.data -subset(df,select=c(proportion_1,proportion_2,proportion_3)) prop.tab - as.table(as.matrix(prop.data)) barchart.obj-barchart(prop.tab, stack=TRUE, horizontal = FALSE) #Second, generate the dots of outcome (I could have used type=l but using type=p makes the #misalignment of x-axis more obvious. dot.outcome - xyplot(outcome~group,df,type=p, col=blue) #Last, overlay the two plots barchart.obj+ as.layer(dot.outcome,style=2,axes=c(y), outside=TRUE) #Now, you should be able to see the x-axis of the two plots are not matching. #i.e., a dot is not at the center of its correspoding bar. How can I fix this? Your help will be highly appreciated. Many thanks in advance. -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Object name vectcor as function input argument?
Dear R-helpers: I am new to R and ran into the following question and would appreicate your advice very much. My question: How to use a character vector that records object names as function input argument? I asked this question very recently and was advised to use get(). get() works when passing one single object name. but it does not work when passing multiple object names. For example, I want to rbind many dfs into one df. Below, I use 3 data frames for illustration. df.1 - data.frame(v1=rnorm(5), v2=rnorm(5)) df.2 - data.frame(v1=rnorm(5), v2=rnorm(5)) df.3 - data.frame(v1=rnorm(5), v2=rnorm(5)) all.dfs - c(df.1,df.2,df.3) # all.dfs is the a character vector recording all object names and I would like to use all.dfs as # an input argument for a function that performs rbind # The following works, but I do not know how to use all.dfs as its input argument output - do.call(rbind,list(df.1,df.2,df.3)) # The desired function has the following form: output - desired.function (all.dfs) # Show some hw I have done below: # I tried the following things and they do not work do.call(rbind,list(all.dfs)) one.string - paste(all.dfs,collapse=,) do.call(rbind,list(one.string)) do.call(rbind,list(get(one.string))) do.call(rbind,list(parse(one.string))) # By the way, the following loop.fun works but it is Not what I like because I may have a large number of dfs loop.fun - function (all.dfs) { for (i in 1:length(all.dfs) ) ifelse ( i==1, output - get(all.dfs[i]), output - rbind(output,get(all.dfs[i])) ) return(output) } output - loop.fun(all.dfs) #Your help is highly appreciated. Many thanks in advance. -Sean Zhang, Ann Arbor [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to skip re-installing CRAN packages when updating R?
Dear R-helpers: I am new to R and would like to seek your expert opinion on installation tip. Many thanks in advance. I want to update my R to the newest version and wonder the following two questions: Question 1: How can I install R and its contributed packages in a way so when updating R in the future, I do NOT need to re-install contributed packages used by R of last version. Question 2: Is it an ok-practice to just install all the CRAN packages (i.e., install.packages(available.packages()[,1]) ). Does someone do so? The reason I ask the second question is that if installing all available packages does Not consume too much time (say less than 2 hours), too much computer resource (I have big harddrive, so harddrive is probably not a concern. I guess computing speed will not be affected but not sure...) then, I do not need to bother Question 1 and will just install all available packages when updating R. Many Thanks in advance. Merry Christmas! -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] quotation problem/dataframe names as function input argument.
Dear R friends: Can someone help me with the following problem? Many thanks in advance. # Problem Description: # I want to write functions which take a (character) vector of dataframe names as input argument. # For example, I want to extract the number of observations from a number of dataframes. # I tried the following: nobs.fun - function (dframe.vec) { nobs.vec - array(NA,c(length(dframe.vec),1)) for (i in 1:length(dframe.vec)) { nobs.vec[i] - dim(dframe.vec[i])[1] } return(nobs.vec) } # To show the problem, I create a fake dataframe and store its name (i.e., dframe.1) # in a vector (i.e., dframe.vec) of length 1. # creation of fake dataframe dframe.1 - as.data.frame(matrix(seq(1:2),c(1,2))) # store the dataframe name into a vector using c() function dframe.vec - c(dframe.1) # The problem is that the following line does not work nobs.fun(dframe.vec) # Seems to me, the problem stems from the fact that dframe.vec[1] is intepreted by R as dframe.vec (note: it is quotated) # and dim(dframe.vec)[1] gives NULL. # Also, I realize the following line works as expected (note: dframe.1 is not quoted any more): dim(dframe.1)[1] So my question is then: how can I pass dataframe names as an input argument for another function without running into the quotation mark issue above? Any hint? Thank you in advance. -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Extract values based on indexes without looping?
Dear R-Helpers: I am a entry level user of R. Have the following question. Many thanks in advance. # value.vec stores values value.vec - c('a','b','c') # which.vec stores the locations/indexs of values in value.vec. which.vec - c(3, 2, 2, 1) # How can I obtain the following vector based on the value.vec and which.vec mentioned above # vector.I.want - c('c', 'b', 'b', 'a') # 3221 # I try to avoid using the following loop to achieve the goal because the which.vec in reality will be very long vector.I.want - rep(NA,length(which.vec)) for (i in 1:length(which.vec)) { vector.I.want[i] - value.vec[which.vec[i]] } # is there a faster way than looping? Thanks in advance. -Sean [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.