Re: [R] How do I combine lists of data.frames into a single data frame?

2010-07-15 Thread Marc Schwartz
On Jul 15, 2010, at 2:18 PM, Ted Byers wrote:

 The data.frame is constructed by one of the following functions:
 
 funweek - function(df)
  if (length(df$elapsed_time)  5) {
rv = fitdist(df$elapsed_time,exp)
rv$year = df$sale_year[1]
rv$sample = df$sale_week[1]
rv$granularity = week
rv
  }
 funmonth - function(df)
  if (length(df$elapsed_time)  5) {
rv = fitdist(df$elapsed_time,exp)
rv$year = df$sale_year[1]
rv$sample = df$sale_month[1]
rv$granularity = month
rv
  }
 
 It is basically the data.frame created by fitdist extended to include the
 variables used to distinguish one sample from another.
 
 I have the following statement that gets me a set of IDs from my db:
 
 ids - dbGetQuery(con, SELECT DISTINCT m_id FROM risk_input)
 
 And then I have a loop that allows me to analyze one dataset after another:
 
 for (i in 1:length(ids[,1])) {
  print(i)
  print(ids[i,1])
 
 Then, after a set of statements that give me information about the dataset
 (such as its size), within a conditional block that ensures I apply the
 analysis only on sufficiently large samples, I have the following:
 
 z - lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_week),drop
 = TRUE), funweek)
 
 or z -
 lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_month),drop =
 TRUE), funmonth)
 
 followed by:
 
 str(z)
 
 Of course, I close the loop and disconnect from my db.
 
 NB: I don't see any way to get rid of the loop by adding ID as a factor to
 split because I have to query the DB for several key bits of data in order
 to determine whether or not there is sufficient data to work on.
 
 I have everything working, except the final step of storing the results back
 into the db.  Storing data in the Db is easy enough.  But I am at a loss as
 to how to combine the lists placed in z in most of the iterations through
 the ID loop into a single data.frame.
 
 Now, I did take a look at rbind and cbind, but it isn't clear to me if
 either is appropriate.  All the data frames have the same structure, but the
 lists are of variable length, and I am not certain how either might be used
 inside the IDs loop.
 
 So, what is the best way to combine all lists assigned to z into a single
 data.frame?
 
 Thanks
 
 Ted


Ted,

If each of the data frames in the list 'z' have the same column structure, you 
can use:

  do.call(rbind, z)

The result of which will be a single data frame containing all of the rows from 
each of the data frames in the list.

HTH,

Marc Schwartz

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How do I combine lists of data.frames into a single data frame?

2010-07-15 Thread Ted Byers
Thanks Marc

The next part of the question, though, involves the fact that there is a new
'z' list made in almost every iteration through the ID loop.

I guess there are two parts to the question.  First, how would I make a list
containing all the data frames created by a call to rbind?  I assume, then,
that I could call rbind again to make that new list into a single
data.frame.  Second, is it possible to just append one list of objects to
another list of objects, and would doing that and calling rbind on that
master list be more efficient than calling rbind on each z list and then
calling rbind after the loop on the list of such data.frames?

Thanks again,

Ted

On Thu, Jul 15, 2010 at 3:27 PM, Marc Schwartz marc_schwa...@me.com wrote:

 On Jul 15, 2010, at 2:18 PM, Ted Byers wrote:

  The data.frame is constructed by one of the following functions:
 
  funweek - function(df)
   if (length(df$elapsed_time)  5) {
 rv = fitdist(df$elapsed_time,exp)
 rv$year = df$sale_year[1]
 rv$sample = df$sale_week[1]
 rv$granularity = week
 rv
   }
  funmonth - function(df)
   if (length(df$elapsed_time)  5) {
 rv = fitdist(df$elapsed_time,exp)
 rv$year = df$sale_year[1]
 rv$sample = df$sale_month[1]
 rv$granularity = month
 rv
   }
 
  It is basically the data.frame created by fitdist extended to include the
  variables used to distinguish one sample from another.
 
  I have the following statement that gets me a set of IDs from my db:
 
  ids - dbGetQuery(con, SELECT DISTINCT m_id FROM risk_input)
 
  And then I have a loop that allows me to analyze one dataset after
 another:
 
  for (i in 1:length(ids[,1])) {
   print(i)
   print(ids[i,1])
 
  Then, after a set of statements that give me information about the
 dataset
  (such as its size), within a conditional block that ensures I apply the
  analysis only on sufficiently large samples, I have the following:
 
  z -
 lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_week),drop
  = TRUE), funweek)
 
  or z -
  lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_month),drop =
  TRUE), funmonth)
 
  followed by:
 
  str(z)
 
  Of course, I close the loop and disconnect from my db.
 
  NB: I don't see any way to get rid of the loop by adding ID as a factor
 to
  split because I have to query the DB for several key bits of data in
 order
  to determine whether or not there is sufficient data to work on.
 
  I have everything working, except the final step of storing the results
 back
  into the db.  Storing data in the Db is easy enough.  But I am at a loss
 as
  to how to combine the lists placed in z in most of the iterations through
  the ID loop into a single data.frame.
 
  Now, I did take a look at rbind and cbind, but it isn't clear to me if
  either is appropriate.  All the data frames have the same structure, but
 the
  lists are of variable length, and I am not certain how either might be
 used
  inside the IDs loop.
 
  So, what is the best way to combine all lists assigned to z into a single
  data.frame?
 
  Thanks
 
  Ted


 Ted,

 If each of the data frames in the list 'z' have the same column structure,
 you can use:

  do.call(rbind, z)

 The result of which will be a single data frame containing all of the rows
 from each of the data frames in the list.

 HTH,

 Marc Schwartz




-- 
R.E.(Ted) Byers, Ph.D.,Ed.D.
t...@merchantservicecorp.com
CTO
Merchant Services Corp.
350 Harry Walker Parkway North, Suite 8
Newmarket, Ontario
L3Y 8L3

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How do I combine lists of data.frames into a single data frame?

2010-07-15 Thread Marc Schwartz
Ted,

I may not be completely clear on how you have your processes implemented, but 
some thoughts:

If you will be creating multiple lists initially, where each list (say z1...z4) 
contains 1 or more data frames and all of the data frames have the same column 
structure, you can use:

  do.call(rbind, c(z1, z2, z3, z4))

For example, using the iris data set:

  list1 - list(head(iris), head(iris), head(iris))

  list2 - list(head(iris), head(iris))

So these now have 3 and 2 copies, respectively, of 6 rows from the iris data 
set. You can then do:

DF - do.call(rbind, c(list1, list2))

 str(DF)
'data.frame':   30 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 5.1 4.9 4.7 4.6 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.5 3 3.2 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.4 1.3 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.2 0.2 0.2 0.2 ...
 $ Species : Factor w/ 3 levels setosa,versicolor,..: 1 1 1 1 1 1 1 1 1 
1 ...


So DF now contains 30 rows (6 rows * 5 data frames).

I am not sure if that will spark some thoughts, but ideally, if you can figure 
out a way such that the result of all of your operations will be a single list 
(eg. within a loop construct), you can avoid the copying of objects, which both 
adds time and RAM overhead. Then you can just use the do.call(rbind, YourList) 
construct on the single 'all inclusive' list.  If you need to preallocate a 
'master' list object, which you can then index in a loop, presuming that you 
know ahead of time how many total data frames will be created, you can use 
vector(list, N), where N is the number of total list elements that you will 
require. For example:

 vector(list, 5)
[[1]]
NULL

[[2]]
NULL

[[3]]
NULL

[[4]]
NULL

[[5]]
NULL

will preallocate a list of 5 elements, each of which can then be indexed to 
contain a data frame that is a result of your looping operation.


HTH,

Marc


On Jul 15, 2010, at 2:58 PM, Ted Byers wrote:

 Thanks Marc
 
 The next part of the question, though, involves the fact that there is a new
 'z' list made in almost every iteration through the ID loop.
 
 I guess there are two parts to the question.  First, how would I make a list
 containing all the data frames created by a call to rbind?  I assume, then,
 that I could call rbind again to make that new list into a single
 data.frame.  Second, is it possible to just append one list of objects to
 another list of objects, and would doing that and calling rbind on that
 master list be more efficient than calling rbind on each z list and then
 calling rbind after the loop on the list of such data.frames?
 
 Thanks again,
 
 Ted
 
 On Thu, Jul 15, 2010 at 3:27 PM, Marc Schwartz marc_schwa...@me.com wrote:
 
 On Jul 15, 2010, at 2:18 PM, Ted Byers wrote:
 
 The data.frame is constructed by one of the following functions:
 
 funweek - function(df)
 if (length(df$elapsed_time)  5) {
   rv = fitdist(df$elapsed_time,exp)
   rv$year = df$sale_year[1]
   rv$sample = df$sale_week[1]
   rv$granularity = week
   rv
 }
 funmonth - function(df)
 if (length(df$elapsed_time)  5) {
   rv = fitdist(df$elapsed_time,exp)
   rv$year = df$sale_year[1]
   rv$sample = df$sale_month[1]
   rv$granularity = month
   rv
 }
 
 It is basically the data.frame created by fitdist extended to include the
 variables used to distinguish one sample from another.
 
 I have the following statement that gets me a set of IDs from my db:
 
 ids - dbGetQuery(con, SELECT DISTINCT m_id FROM risk_input)
 
 And then I have a loop that allows me to analyze one dataset after
 another:
 
 for (i in 1:length(ids[,1])) {
 print(i)
 print(ids[i,1])
 
 Then, after a set of statements that give me information about the
 dataset
 (such as its size), within a conditional block that ensures I apply the
 analysis only on sufficiently large samples, I have the following:
 
 z -
 lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_week),drop
 = TRUE), funweek)
 
 or z -
 lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_month),drop =
 TRUE), funmonth)
 
 followed by:
 
 str(z)
 
 Of course, I close the loop and disconnect from my db.
 
 NB: I don't see any way to get rid of the loop by adding ID as a factor
 to
 split because I have to query the DB for several key bits of data in
 order
 to determine whether or not there is sufficient data to work on.
 
 I have everything working, except the final step of storing the results
 back
 into the db.  Storing data in the Db is easy enough.  But I am at a loss
 as
 to how to combine the lists placed in z in most of the iterations through
 the ID loop into a single data.frame.
 
 Now, I did take a look at rbind and cbind, but it isn't clear to me if
 either is appropriate.  All the data frames have the same structure, but
 the
 lists are of variable length, and I am not certain how either might be
 used
 inside the IDs loop.
 
 So, what is the best way to combine all lists assigned to z into a single
 data.frame?
 
 

Re: [R] How do I combine lists of data.frames into a single data frame?

2010-07-15 Thread Ted Byers
Thanks Marc

Part of the challenge here is that EVERYTHING is dynamic.  New data is being
added to the DB all the time  Each active ID makes a new sample very day or
at a minimum every week, and new IDs are added every week.  So I can't hard
code anything.  If, for a given ID, I had 50 weekly samples last week, I'll
have 51 samples this week.

But some for the IDs have sample sizes that are so small, it would be pure
BS to try to use fitdist on their data.

I have figured out a way to handle this for a given ID, and so I have the
loop that iterates over the IDs, and processes the data for that ID IF there
is sufficient data.  And to make things interesting, the number of IDs I
need to process this week is greater than the number of IDs I had to process
last week.

So, I iterate over IDs, from 1 up through perhaps 500.  If a given ID has
sufficient data, I get the z lists.  And I have checked, applying rbind to
these works great!  Of all the IDs' datasets I have examined, perhaps 10% do
not yet have enough data to work with (but that, too changes through time).

From what you have said, it would seem that I ought to make a master list.
So, I need to learn how to make a master list grow from nothing to include
all these z lists.  That reduces to a question of how can one append
dynamically created lists of varying size (from just a few list elements to
a few hundred list elements) to such a master list.

Actually, when it gets right down to it, I think I am ignorant of a key
piece of the puzzle (I have probably missed the key part of the
documentation dealing with this).  I do not yet know how to add even one
element to a list within a loop where the loop does not exist (or at least
is empty) at the beginning of the loop.

I get your example do.call(rbind, c(z1, z2, z3, z4)), but what do you do
if there is no list at the beginning of a loop and you need to handle
something like:

#n is some large number, and in about 10% of values of 'i' (not known a
priori) creation
# of x and y is skipped
for (i = 1:n) {
  if(test that returns tru only 90% of the time) {
x = function_that_makes_a_data_frame()
y = function_that_makes_a_list_of_data_frames()
  }
}

We have not created any lists on entry into the loop.  How do we create a
list containing all instances of x and another that contains all elements
that had been in each instance of y?  If I can learn how to do that, then I
can call  do.call(rbind,x_list) and do.call(rbind,y_element_list).

If you know C++, and specifically the STL containers and algorithms, one can
grow vectors or lists using a function called 'push_back' which is defined
on most stl containers.  I am looking for the R equivalent for objects, and
the R equivalent of the C++ STL algorithm std::copy (passed the begin and
end iterators of the source list and a back inserter for the recipient
container), for appending a source list to a master list.

Thanks

Ted

On Thu, Jul 15, 2010 at 4:52 PM, Marc Schwartz marc_schwa...@me.com wrote:

 Ted,

 I may not be completely clear on how you have your processes implemented,
 but some thoughts:

 If you will be creating multiple lists initially, where each list (say
 z1...z4) contains 1 or more data frames and all of the data frames have the
 same column structure, you can use:

  do.call(rbind, c(z1, z2, z3, z4))

 For example, using the iris data set:

  list1 - list(head(iris), head(iris), head(iris))

  list2 - list(head(iris), head(iris))

 So these now have 3 and 2 copies, respectively, of 6 rows from the iris
 data set. You can then do:

 DF - do.call(rbind, c(list1, list2))

  str(DF)
 'data.frame':   30 obs. of  5 variables:
  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 5.1 4.9 4.7 4.6 ...
  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.5 3 3.2 3.1 ...
  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.4 1.3 1.5 ...
  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.2 0.2 0.2 0.2 ...
  $ Species : Factor w/ 3 levels setosa,versicolor,..: 1 1 1 1 1 1 1
 1 1 1 ...


 So DF now contains 30 rows (6 rows * 5 data frames).

 I am not sure if that will spark some thoughts, but ideally, if you can
 figure out a way such that the result of all of your operations will be a
 single list (eg. within a loop construct), you can avoid the copying of
 objects, which both adds time and RAM overhead. Then you can just use the
 do.call(rbind, YourList) construct on the single 'all inclusive' list.  If
 you need to preallocate a 'master' list object, which you can then index in
 a loop, presuming that you know ahead of time how many total data frames
 will be created, you can use vector(list, N), where N is the number of
 total list elements that you will require. For example:

  vector(list, 5)
 [[1]]
 NULL

 [[2]]
 NULL

 [[3]]
 NULL

 [[4]]
 NULL

 [[5]]
 NULL

 will preallocate a list of 5 elements, each of which can then be indexed to
 contain a data frame that is a result of your looping operation.


 HTH,

 Marc


 On Jul 15, 2010, at 2:58 PM, Ted 

Re: [R] How do I combine lists of data.frames into a single data frame?

2010-07-15 Thread Marc Schwartz
Ted,

Based upon your code below, you might be better off using two lapply() 
constructs to create the x and y results separately, taking advantage of 
lapply()'s built-in ability to create lists 'on the fly', while returning a 
NULL when the function will not be applied to the data based upon your test.

For example:

lapply(seq(n), function(i) if (test on ID[i]) funcX() else NULL)

and something like:

lapply(seq(n), function(i) if (test on ID[i]) do.call(rbind, funcY()) else NULL)


and then you can use the do.call() approach on the results of both.


Consider:

# Only return data if 'i' is even

Res1 - lapply(1:5, function(i) if (i %% 2 == 0) iris[1:i, ] else NULL)

 Res1
[[1]]
NULL

[[2]]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1  5.1 3.5  1.4 0.2  setosa
2  4.9 3.0  1.4 0.2  setosa

[[3]]
NULL

[[4]]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1  5.1 3.5  1.4 0.2  setosa
2  4.9 3.0  1.4 0.2  setosa
3  4.7 3.2  1.3 0.2  setosa
4  4.6 3.1  1.5 0.2  setosa

[[5]]
NULL



When we use do.call() here the elements that are NULL do not result in any 
problems creating the result:

 do.call(rbind, Res1)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1  5.1 3.5  1.4 0.2  setosa
2  4.9 3.0  1.4 0.2  setosa
3  5.1 3.5  1.4 0.2  setosa
4  4.9 3.0  1.4 0.2  setosa
5  4.7 3.2  1.3 0.2  setosa
6  4.6 3.1  1.5 0.2  setosa



Now consider the second example, where your function would return a list of 
data frames. I'll use replicate() with 'simplify = FALSE' so that the result 
within lapply() is either a single list of data frames or NULL. If the result 
would be a list of data frames, we'll use do.call() within the loop so that 
lapply() returns a single data frame rather than a list of data frames. 
Consider:


 replicate(3, iris[1:3, ], simplify = FALSE)
[[1]]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1  5.1 3.5  1.4 0.2  setosa
2  4.9 3.0  1.4 0.2  setosa
3  4.7 3.2  1.3 0.2  setosa

[[2]]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1  5.1 3.5  1.4 0.2  setosa
2  4.9 3.0  1.4 0.2  setosa
3  4.7 3.2  1.3 0.2  setosa

[[3]]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1  5.1 3.5  1.4 0.2  setosa
2  4.9 3.0  1.4 0.2  setosa
3  4.7 3.2  1.3 0.2  setosa



 do.call(rbind, replicate(3, iris[1:3, ], simplify = FALSE))
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1  5.1 3.5  1.4 0.2  setosa
2  4.9 3.0  1.4 0.2  setosa
3  4.7 3.2  1.3 0.2  setosa
4  5.1 3.5  1.4 0.2  setosa
5  4.9 3.0  1.4 0.2  setosa
6  4.7 3.2  1.3 0.2  setosa
7  5.1 3.5  1.4 0.2  setosa
8  4.9 3.0  1.4 0.2  setosa
9  4.7 3.2  1.3 0.2  setosa



So now:

Res2 - lapply(1:5, function(i) if (i %% 2 == 0) 
   do.call(rbind, replicate(i, iris[1:i, ], 
simplify = FALSE)) 
   else NULL)

 Res2
[[1]]
NULL

[[2]]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1  5.1 3.5  1.4 0.2  setosa
2  4.9 3.0  1.4 0.2  setosa
3  5.1 3.5  1.4 0.2  setosa
4  4.9 3.0  1.4 0.2  setosa

[[3]]
NULL

[[4]]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1   5.1 3.5  1.4 0.2  setosa
2   4.9 3.0  1.4 0.2  setosa
3   4.7 3.2  1.3 0.2  setosa
4   4.6 3.1  1.5 0.2  setosa
5   5.1 3.5  1.4 0.2  setosa
6   4.9 3.0  1.4 0.2  setosa
7   4.7 3.2  1.3 0.2  setosa
8   4.6 3.1  1.5 0.2  setosa
9   5.1 3.5  1.4 0.2  setosa
10  4.9 3.0  1.4 0.2  setosa
11  4.7 3.2  1.3 0.2  setosa
12  4.6 3.1  1.5