Re: [R] adding rows without loops
Thank you Blaser: This is the exact solution I came up with but when comparing 8M rows even on an 8G machine, one runs out of memory. To run this effectively, I have to break the DF into smaller DFs, loop through them and then do a massive rmerge at the end. That's what takes 8+ hours to compute. Even the bigmemory package is causing OOM issues. -Original Message- From: Blaser Nello [mailto:nbla...@ispm.unibe.ch] Sent: Thursday, May 23, 2013 12:15 AM To: Adeel Amin; r-help@r-project.org Subject: RE: [R] adding rows without loops Merge should do the trick. How to best use it will depend on what you want to do with the data after. The following is an example of what you could do. This will perform best, if the rows are missing at random and do not cluster. DF1 - data.frame(X.DATE=rep(01052007, 7), X.TIME=c(2:5,7:9)*100, VALUE=c(37, 42, 45, 45, 45, 42, 45), VALE2=c(29,24,28,27,35,32,32)) DF2 - data.frame(X.DATE=rep(01052007, 7), X.TIME=c(2:8)*100, VALUE=c(37, 42, 45, 45, 45, 42, 45), VALE2=c(29,24,28,27,35,32,32)) DFm - merge(DF1, DF2, by=c(X.DATE, X.TIME), all=TRUE) while(any(is.na(DFm))){ if (any(is.na(DFm[1,]))) stop(Complete first row required!) ind - which(is.na(DFm), arr.ind=TRUE) prind - matrix(c(ind[,row]-1, ind[,col]), ncol=2) DFm[is.na(DFm)] - DFm[prind] } DFm Best, Nello -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Adeel Amin Sent: Donnerstag, 23. Mai 2013 07:01 To: r-help@r-project.org Subject: [R] adding rows without loops I'm comparing a variety of datasets with over 4M rows. I've solved this problem 5 different ways using a for/while loop but the processing time is murder (over 8 hours doing this row by row per data set). As such I'm trying to find whether this solution is possible without a loop or one in which the processing time is much faster. Each dataset is a time series as such: DF1: X.DATE X.TIME VALUE VALUE2 1 01052007 020037 29 2 01052007 030042 24 3 01052007 040045 28 4 01052007 050045 27 5 01052007 070045 35 6 01052007 080042 32 7 01052007 090045 32 ... ... ... n DF2 X.DATE X.TIME VALUE VALUE2 1 01052007 020037 29 2 01052007 030042 24 3 01052007 040045 28 4 01052007 050045 27 5 01052007 060045 35 6 01052007 070042 32 7 01052007 080045 32 ... ... n+4000 In other words there are 4000 more rows in DF2 then DF1 thus the datasets are of unequal length. I'm trying to ensure that all dataframes have the same number of X.DATE and X.TIME entries. Where they are missing, I'd like to insert a new row. In the above example, when comparing DF2 to DF1, entry 01052007 0600 entry is missing in DF1. The solution would add a row to DF1 at the appropriate index. so new dataframe would be X.DATE X.TIME VALUE VALUE2 1 01052007 020037 29 2 01052007 030042 24 3 01052007 040045 28 4 01052007 050045 27 5 01052007 060045 27 6 01052007 070045 35 7 01052007 080042 32 8 01052007 090045 32 Value and Value2 would be the same as row 4. Of course this is simple to accomplish using a row by row analysis but with of 4M rows the processing time destroying and rebinding the datasets is very time consuming and I believe highly un-R'ish. What am I missing? Thanks! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] adding rows...
Hi Rainer: Thanks for the reply. Posting the large dataset is a task. There are 8M rows between the two of them and the first discrepancy in the data doesn't happen until at least the 40,000th row on each dataframe. The examples I posted are a pretty good abstraction of the root of the issue. The problem isn't the data. The problem is Out Of Memory issues when doing any operations like merge, rbind, etc. The solution that Blaser suggested in his post works great, but the systems quickly run out of memory. What does work without OOM issues are for/while loops but on average take an inordinate time to compute and tie up a machine for hours and hours at time. Essentially I break the data apart, add rows and rebind. It's a brute force type of approach and run times are in excess of 48 hours for one full iteration across 25 data frames. Terrible. I am about to go down the road of using data.tables class as its far more memory efficient, but the documentation is cryptic. Your idea of creating a super set has some merit and it's what I was experimenting with prior to my original post. -Original Message- From: Rainer Schuermann [mailto:rainer.schuerm...@gmx.net] Sent: Thursday, May 23, 2013 12:19 AM To: Adeel Amin Subject: adding rows... Can I suggest that you post the output of dput( DF1 ) dput( DF2 ) rather than pictures of your data? Any solution attempt will depend upon the data types... Just shooting in the dark: Have you tried just row-binding the missing 4k lines to DF1 and then order DF1 as you like? It looks as if the data are ordered by time / date? Rgds, Rainer __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R issue with unequal large data frames with multiple columns
Thank you Arun (and everyone else)-- this is in the right the direction. Ill post the code that worked shortly for everyone else in case you were curious. -Original Message- From: arun [mailto:smartpink...@yahoo.com] Sent: Thursday, May 02, 2013 7:09 AM To: Adeel Amin Cc: R help Subject: Re: [R] R issue with unequal large data frames with multiple columns Hi,May be this helps: dat1-structure(list(X.DATE = c(01052007, 01072007, 01072007, 02182007, 02182007, 02242007, 03252007), X.TIME = c(0230, 0330, 0440, 0440, 0440, 0330, 0230), VALUE = c(37, 42, 45, 45, 45, 42, 45), VALUE2 = c(29, 24, 28, 27, 35, 32, 32 )), .Names = c(X.DATE, X.TIME, VALUE, VALUE2), class = data.frame, row.names = c(NA, -7L)) dat2- structure(list(X.DATE = c(01052007, 01182007, 01242007, 02142007, 02182007, 03242007, 03252007), X.TIME = c(0230, 0330, 0430, 0330, 0440, 0230, 0230), VALUE = c(34, 41, 42, 44, 45, 21, 42), VALUE2 = c(28, 25, 26, 28, 32, 35, 36 )), .Names = c(X.DATE, X.TIME, VALUE, VALUE2), class = data.frame, row.names = c(NA, -7L)) dat3- structure(list(X.DATE = c(01052007, 01182007, 01252007, 02142007, 02182007, 03222007, 03252007), X.TIME = c(0230, 0330, 0430, 0330, 0440, 0230, 0230), VALUE = c(32, 42, 44, 44, 47, 42, 46), VALUE2 = c(24, 29, 32, 34, 38, 39, 42 )), .Names = c(X.DATE, X.TIME, VALUE, VALUE2), class = data.frame, row.names = c(NA, -7L)) library(xts) lst1-lapply(list(dat1,dat2,dat3),function(x){ xts(x[,-c(1,2)], order.by=as.POSIXct(paste0(x[,1],x[,2]),format=%m%d%Y%H%M))}) #subset by date and time lapply(lst1,function(x) x['2007-01-05 02:30:00/2007-01-25 04:30:00']) #[[1]] # VALUE VALUE2 #2007-01-05 02:30:00 37 29 #2007-01-07 03:30:00 42 24 #2007-01-07 04:40:00 45 28 # #[[2]] # VALUE VALUE2 #2007-01-05 02:30:00 34 28 #2007-01-18 03:30:00 41 25 #2007-01-24 04:30:00 42 26 # #[[3]] # VALUE VALUE2 #2007-01-05 02:30:00 32 24 #2007-01-18 03:30:00 42 29 #2007-01-25 04:30:00 44 32 #subset by time lapply(lst1,function(x) x['T02:30/T03:30']) res-na.omit(Reduce(function(...) merge(...),lst1)) res # VALUE VALUE2 VALUE.1 VALUE2.1 VALUE.2 VALUE2.2 #2007-01-05 02:30:00 37 29 34 28 32 24 #2007-02-18 04:40:00 45 27 45 32 47 38 #2007-03-25 02:30:00 45 32 42 36 46 42 lst2-as.list(res) lst3- lapply(list(c(VALUE,VALUE2),c(VALUE.1,VALUE2.1),c(VALUE.2,VALUE2. 2)),function(x) do.call(cbind,lst2[x])) #or lst3- lapply(split(names(lst2),((seq_along(names(lst2))-1)%/%2)+1),function(x) do.call(cbind,lst2[x])) #change according to the number of columns lst3 #$`1` # VALUE VALUE2 #2007-01-05 02:30:00 37 29 #2007-02-18 04:40:00 45 27 #2007-03-25 02:30:00 45 32 # #$`2` # VALUE.1 VALUE2.1 #2007-01-05 02:30:00 34 28 #2007-02-18 04:40:00 45 32 #2007-03-25 02:30:00 42 36 # #$`3` # VALUE.2 VALUE2.2 #2007-01-05 02:30:00 32 24 #2007-02-18 04:40:00 47 38 #2007-03-25 02:30:00 46 42 A.K. - Original Message - From: Adeel Amin adeel.a...@gmail.com To: r-help@r-project.org Cc: Sent: Thursday, May 2, 2013 2:28 AM Subject: [R] R issue with unequal large data frames with multiple columns I'm a bit of an amateur R programmer. I can do simple R scenarios but my handle on complex grammatical issues isn't steady. I have 12 CSV files that I've read into dataframes. Each has 8 columns and over 200 rows. Each dataframe has data associated by time component and a date component in the format of: X.DATE and then X.TIME X.DATE is in the format of MMDD and X.TIME is format HHMM. The issue is that even though each dataframe begins and ends with the same X.DATE and X.TIME values, each data frame has different number of rows. One may have as many 10 rows more than the other. I want to do two things: 1) I want to extract a certain portion of data depending on date and time (easy) 2) In lock step with number 2 I want to eliminate values from the data frame that are a) redundant or b) do not appear in the other data sets. When step 2 is done, all the time/date data within all 12 dataframes will be the same. Suggestions? Thanks R Community -- [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.