Re: [R] adding rows without loops

2013-05-23 Thread Adeel - SafeGreenCapital
Thank you Blaser:

This is the exact solution I came up with but when comparing 8M rows even on
an 8G machine, one runs out of memory.  To run this effectively, I have to
break the DF into smaller DFs, loop through them and then do a massive
rmerge at the end.  That's what takes 8+ hours to compute.

Even the bigmemory package is causing OOM issues.  

-Original Message-
From: Blaser Nello [mailto:nbla...@ispm.unibe.ch] 
Sent: Thursday, May 23, 2013 12:15 AM
To: Adeel Amin; r-help@r-project.org
Subject: RE: [R] adding rows without loops

Merge should do the trick. How to best use it will depend on what you
want to do with the data after. 
The following is an example of what you could do. This will perform
best, if the rows are missing at random and do not cluster.

DF1 - data.frame(X.DATE=rep(01052007, 7), X.TIME=c(2:5,7:9)*100,
VALUE=c(37, 42, 45, 45, 45, 42, 45), VALE2=c(29,24,28,27,35,32,32))
DF2 - data.frame(X.DATE=rep(01052007, 7), X.TIME=c(2:8)*100,
VALUE=c(37, 42, 45, 45, 45, 42, 45), VALE2=c(29,24,28,27,35,32,32))

DFm - merge(DF1, DF2, by=c(X.DATE, X.TIME), all=TRUE)

while(any(is.na(DFm))){
  if (any(is.na(DFm[1,]))) stop(Complete first row required!)
  ind - which(is.na(DFm), arr.ind=TRUE)
  prind - matrix(c(ind[,row]-1, ind[,col]), ncol=2)
  DFm[is.na(DFm)] - DFm[prind]
}
DFm

Best,
Nello

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
On Behalf Of Adeel Amin
Sent: Donnerstag, 23. Mai 2013 07:01
To: r-help@r-project.org
Subject: [R] adding rows without loops

I'm comparing a variety of datasets with over 4M rows.  I've solved this
problem 5 different ways using a for/while loop but the processing time
is murder (over 8 hours doing this row by row per data set).  As such
I'm trying to find whether this solution is possible without a loop or
one in which the processing time is much faster.

Each dataset is a time series as such:

DF1:

X.DATE X.TIME VALUE VALUE2
1 01052007   020037 29
2 01052007   030042 24
3 01052007   040045 28
4 01052007   050045 27
5 01052007   070045 35
6 01052007   080042 32
7 01052007   090045 32
...
...
...
n

DF2

X.DATE X.TIME VALUE VALUE2
1 01052007   020037 29
2 01052007   030042 24
3 01052007   040045 28
4 01052007   050045 27
5 01052007   060045 35
6 01052007   070042 32
7 01052007   080045 32

...
...
n+4000

In other words there are 4000 more rows in DF2 then DF1 thus the
datasets are of unequal length.

I'm trying to ensure that all dataframes have the same number of X.DATE
and X.TIME entries.  Where they are missing, I'd like to insert a new
row.

In the above example, when comparing DF2 to DF1, entry 01052007 0600
entry is missing in DF1.  The solution would add a row to DF1 at the
appropriate index.

so new dataframe would be


X.DATE X.TIME VALUE VALUE2
1 01052007   020037 29
2 01052007   030042 24
3 01052007   040045 28
4 01052007   050045 27
5 01052007   060045 27
6 01052007   070045 35
7 01052007   080042 32
8 01052007   090045 32

Value and Value2 would be the same as row 4.

Of course this is simple to accomplish using a row by row analysis but
with of 4M rows the processing time destroying and rebinding the
datasets is very time consuming and I believe highly un-R'ish.  What am
I missing?

Thanks!

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] adding rows...

2013-05-23 Thread Adeel - SafeGreenCapital
Hi Rainer:

Thanks for the reply.  Posting the large dataset is a task.  There are 8M
rows between the two of them and the first discrepancy in the data doesn't
happen until at least the 40,000th row on each dataframe.  The examples I
posted are a pretty good abstraction of the root of the issue.  

The problem isn't the data.  The problem is Out Of Memory issues when doing
any operations like merge, rbind, etc.  The solution that Blaser suggested
in his post works great, but the systems quickly run out of memory.  What
does work without OOM issues are for/while loops but on average take an
inordinate time to compute and tie up a machine for hours and hours at time.
Essentially I break the data apart, add rows and rebind.  It's a brute force
type of approach and run times are in excess of 48 hours for one full
iteration across 25 data frames.  Terrible.

I am about to go down the road of using data.tables class as its far more
memory efficient, but the documentation is cryptic. Your idea of creating a
super set has some merit and it's what I was experimenting with prior to my
original post.  

-Original Message-
From: Rainer Schuermann [mailto:rainer.schuerm...@gmx.net] 
Sent: Thursday, May 23, 2013 12:19 AM
To: Adeel Amin
Subject: adding rows...

Can I suggest that you post the output of

dput( DF1 )
dput( DF2 )

rather than pictures of your data? Any solution attempt will depend upon
the data types...

Just shooting in the dark: Have you tried just row-binding the missing 4k
lines to DF1 and then order DF1 as you like? It looks as if the data are
ordered by time / date? 

Rgds,
Rainer

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R issue with unequal large data frames with multiple columns

2013-05-02 Thread Adeel - SafeGreenCapital
Thank you Arun (and everyone else)-- this is in the right the direction.
Ill post the code that worked shortly for everyone else in case you were
curious.

-Original Message-
From: arun [mailto:smartpink...@yahoo.com] 
Sent: Thursday, May 02, 2013 7:09 AM
To: Adeel Amin
Cc: R help
Subject: Re: [R] R issue with unequal large data frames with multiple
columns

Hi,May be this helps:


dat1-structure(list(X.DATE = c(01052007, 01072007, 01072007, 
02182007, 02182007, 02242007, 03252007), X.TIME = c(0230, 
0330, 0440, 0440, 0440, 0330, 0230), VALUE = c(37, 
42, 45, 45, 45, 42, 45), VALUE2 = c(29, 24, 28, 27, 35, 32, 32
)), .Names = c(X.DATE, X.TIME, VALUE, VALUE2), class = data.frame,
row.names = c(NA, 
-7L))
dat2- structure(list(X.DATE = c(01052007, 01182007, 01242007, 
02142007, 02182007, 03242007, 03252007), X.TIME = c(0230, 
0330, 0430, 0330, 0440, 0230, 0230), VALUE = c(34, 
41, 42, 44, 45, 21, 42), VALUE2 = c(28, 25, 26, 28, 32, 35, 36
)), .Names = c(X.DATE, X.TIME, VALUE, VALUE2), class = data.frame,
row.names = c(NA, 
-7L))
dat3- structure(list(X.DATE = c(01052007, 01182007, 01252007, 
02142007, 02182007, 03222007, 03252007), X.TIME = c(0230, 
0330, 0430, 0330, 0440, 0230, 0230), VALUE = c(32, 
42, 44, 44, 47, 42, 46), VALUE2 = c(24, 29, 32, 34, 38, 39, 42
)), .Names = c(X.DATE, X.TIME, VALUE, VALUE2), class = data.frame,
row.names = c(NA, 
-7L))


library(xts)
lst1-lapply(list(dat1,dat2,dat3),function(x){ xts(x[,-c(1,2)],
order.by=as.POSIXct(paste0(x[,1],x[,2]),format=%m%d%Y%H%M))})

#subset by date and time
 lapply(lst1,function(x) x['2007-01-05 02:30:00/2007-01-25 04:30:00'])
#[[1]]
#    VALUE VALUE2
#2007-01-05 02:30:00    37 29
#2007-01-07 03:30:00    42 24
#2007-01-07 04:40:00    45 28
#
#[[2]]
#    VALUE VALUE2
#2007-01-05 02:30:00    34 28
#2007-01-18 03:30:00    41 25
#2007-01-24 04:30:00    42 26
#
#[[3]]
#    VALUE VALUE2
#2007-01-05 02:30:00    32 24
#2007-01-18 03:30:00    42 29
#2007-01-25 04:30:00    44 32

#subset by time
lapply(lst1,function(x) x['T02:30/T03:30'])

res-na.omit(Reduce(function(...) merge(...),lst1))
res
#    VALUE VALUE2 VALUE.1 VALUE2.1 VALUE.2 VALUE2.2
#2007-01-05 02:30:00    37 29  34   28  32   24
#2007-02-18 04:40:00    45 27  45   32  47   38
#2007-03-25 02:30:00    45 32  42   36  46   42

lst2-as.list(res)
lst3-
lapply(list(c(VALUE,VALUE2),c(VALUE.1,VALUE2.1),c(VALUE.2,VALUE2.
2)),function(x) do.call(cbind,lst2[x]))
#or
lst3-
lapply(split(names(lst2),((seq_along(names(lst2))-1)%/%2)+1),function(x)
do.call(cbind,lst2[x])) #change according to the number of columns

lst3
#$`1`
#    VALUE VALUE2
#2007-01-05 02:30:00    37 29
#2007-02-18 04:40:00    45 27
#2007-03-25 02:30:00    45 32
#
#$`2`
#    VALUE.1 VALUE2.1
#2007-01-05 02:30:00  34   28
#2007-02-18 04:40:00  45   32
#2007-03-25 02:30:00  42   36
#
#$`3`
#    VALUE.2 VALUE2.2
#2007-01-05 02:30:00  32   24
#2007-02-18 04:40:00  47   38
#2007-03-25 02:30:00  46   42
A.K.




- Original Message -
From: Adeel Amin adeel.a...@gmail.com
To: r-help@r-project.org
Cc: 
Sent: Thursday, May 2, 2013 2:28 AM
Subject: [R] R issue with unequal large data frames with multiple columns

I'm a bit of an amateur R programmer.  I can do simple R scenarios but my
handle on complex grammatical issues isn't steady.

I have 12 CSV files that I've read into dataframes.  Each has 8 columns and
over 200 rows.  Each dataframe has data associated by time component
and a date component in the format of:

X.DATE and then X.TIME

X.DATE is in the format of MMDD and X.TIME is format HHMM.  The issue
is that even though each dataframe begins and ends with the same X.DATE and
X.TIME values, each data frame has different number of rows.  One may have
as many 10 rows more than the other.

I want to do two things:

1) I want to extract a certain portion of data depending on date and time
(easy)

2) In lock step with number 2 I want to eliminate values from the data
frame that are a) redundant or b) do not appear in the other data sets.

When step 2 is done, all the time/date data within all 12 dataframes will
be the same.

Suggestions?  Thanks R Community --

    [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.