Re: [R] How do I combine lists of data.frames into a single data frame?
On Jul 15, 2010, at 2:18 PM, Ted Byers wrote: The data.frame is constructed by one of the following functions: funweek - function(df) if (length(df$elapsed_time) 5) { rv = fitdist(df$elapsed_time,exp) rv$year = df$sale_year[1] rv$sample = df$sale_week[1] rv$granularity = week rv } funmonth - function(df) if (length(df$elapsed_time) 5) { rv = fitdist(df$elapsed_time,exp) rv$year = df$sale_year[1] rv$sample = df$sale_month[1] rv$granularity = month rv } It is basically the data.frame created by fitdist extended to include the variables used to distinguish one sample from another. I have the following statement that gets me a set of IDs from my db: ids - dbGetQuery(con, SELECT DISTINCT m_id FROM risk_input) And then I have a loop that allows me to analyze one dataset after another: for (i in 1:length(ids[,1])) { print(i) print(ids[i,1]) Then, after a set of statements that give me information about the dataset (such as its size), within a conditional block that ensures I apply the analysis only on sufficiently large samples, I have the following: z - lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_week),drop = TRUE), funweek) or z - lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_month),drop = TRUE), funmonth) followed by: str(z) Of course, I close the loop and disconnect from my db. NB: I don't see any way to get rid of the loop by adding ID as a factor to split because I have to query the DB for several key bits of data in order to determine whether or not there is sufficient data to work on. I have everything working, except the final step of storing the results back into the db. Storing data in the Db is easy enough. But I am at a loss as to how to combine the lists placed in z in most of the iterations through the ID loop into a single data.frame. Now, I did take a look at rbind and cbind, but it isn't clear to me if either is appropriate. All the data frames have the same structure, but the lists are of variable length, and I am not certain how either might be used inside the IDs loop. So, what is the best way to combine all lists assigned to z into a single data.frame? Thanks Ted Ted, If each of the data frames in the list 'z' have the same column structure, you can use: do.call(rbind, z) The result of which will be a single data frame containing all of the rows from each of the data frames in the list. HTH, Marc Schwartz __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How do I combine lists of data.frames into a single data frame?
Thanks Marc The next part of the question, though, involves the fact that there is a new 'z' list made in almost every iteration through the ID loop. I guess there are two parts to the question. First, how would I make a list containing all the data frames created by a call to rbind? I assume, then, that I could call rbind again to make that new list into a single data.frame. Second, is it possible to just append one list of objects to another list of objects, and would doing that and calling rbind on that master list be more efficient than calling rbind on each z list and then calling rbind after the loop on the list of such data.frames? Thanks again, Ted On Thu, Jul 15, 2010 at 3:27 PM, Marc Schwartz marc_schwa...@me.com wrote: On Jul 15, 2010, at 2:18 PM, Ted Byers wrote: The data.frame is constructed by one of the following functions: funweek - function(df) if (length(df$elapsed_time) 5) { rv = fitdist(df$elapsed_time,exp) rv$year = df$sale_year[1] rv$sample = df$sale_week[1] rv$granularity = week rv } funmonth - function(df) if (length(df$elapsed_time) 5) { rv = fitdist(df$elapsed_time,exp) rv$year = df$sale_year[1] rv$sample = df$sale_month[1] rv$granularity = month rv } It is basically the data.frame created by fitdist extended to include the variables used to distinguish one sample from another. I have the following statement that gets me a set of IDs from my db: ids - dbGetQuery(con, SELECT DISTINCT m_id FROM risk_input) And then I have a loop that allows me to analyze one dataset after another: for (i in 1:length(ids[,1])) { print(i) print(ids[i,1]) Then, after a set of statements that give me information about the dataset (such as its size), within a conditional block that ensures I apply the analysis only on sufficiently large samples, I have the following: z - lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_week),drop = TRUE), funweek) or z - lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_month),drop = TRUE), funmonth) followed by: str(z) Of course, I close the loop and disconnect from my db. NB: I don't see any way to get rid of the loop by adding ID as a factor to split because I have to query the DB for several key bits of data in order to determine whether or not there is sufficient data to work on. I have everything working, except the final step of storing the results back into the db. Storing data in the Db is easy enough. But I am at a loss as to how to combine the lists placed in z in most of the iterations through the ID loop into a single data.frame. Now, I did take a look at rbind and cbind, but it isn't clear to me if either is appropriate. All the data frames have the same structure, but the lists are of variable length, and I am not certain how either might be used inside the IDs loop. So, what is the best way to combine all lists assigned to z into a single data.frame? Thanks Ted Ted, If each of the data frames in the list 'z' have the same column structure, you can use: do.call(rbind, z) The result of which will be a single data frame containing all of the rows from each of the data frames in the list. HTH, Marc Schwartz -- R.E.(Ted) Byers, Ph.D.,Ed.D. t...@merchantservicecorp.com CTO Merchant Services Corp. 350 Harry Walker Parkway North, Suite 8 Newmarket, Ontario L3Y 8L3 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How do I combine lists of data.frames into a single data frame?
Ted, I may not be completely clear on how you have your processes implemented, but some thoughts: If you will be creating multiple lists initially, where each list (say z1...z4) contains 1 or more data frames and all of the data frames have the same column structure, you can use: do.call(rbind, c(z1, z2, z3, z4)) For example, using the iris data set: list1 - list(head(iris), head(iris), head(iris)) list2 - list(head(iris), head(iris)) So these now have 3 and 2 copies, respectively, of 6 rows from the iris data set. You can then do: DF - do.call(rbind, c(list1, list2)) str(DF) 'data.frame': 30 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 5.1 4.9 4.7 4.6 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.5 3 3.2 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.4 1.3 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.2 0.2 0.2 0.2 ... $ Species : Factor w/ 3 levels setosa,versicolor,..: 1 1 1 1 1 1 1 1 1 1 ... So DF now contains 30 rows (6 rows * 5 data frames). I am not sure if that will spark some thoughts, but ideally, if you can figure out a way such that the result of all of your operations will be a single list (eg. within a loop construct), you can avoid the copying of objects, which both adds time and RAM overhead. Then you can just use the do.call(rbind, YourList) construct on the single 'all inclusive' list. If you need to preallocate a 'master' list object, which you can then index in a loop, presuming that you know ahead of time how many total data frames will be created, you can use vector(list, N), where N is the number of total list elements that you will require. For example: vector(list, 5) [[1]] NULL [[2]] NULL [[3]] NULL [[4]] NULL [[5]] NULL will preallocate a list of 5 elements, each of which can then be indexed to contain a data frame that is a result of your looping operation. HTH, Marc On Jul 15, 2010, at 2:58 PM, Ted Byers wrote: Thanks Marc The next part of the question, though, involves the fact that there is a new 'z' list made in almost every iteration through the ID loop. I guess there are two parts to the question. First, how would I make a list containing all the data frames created by a call to rbind? I assume, then, that I could call rbind again to make that new list into a single data.frame. Second, is it possible to just append one list of objects to another list of objects, and would doing that and calling rbind on that master list be more efficient than calling rbind on each z list and then calling rbind after the loop on the list of such data.frames? Thanks again, Ted On Thu, Jul 15, 2010 at 3:27 PM, Marc Schwartz marc_schwa...@me.com wrote: On Jul 15, 2010, at 2:18 PM, Ted Byers wrote: The data.frame is constructed by one of the following functions: funweek - function(df) if (length(df$elapsed_time) 5) { rv = fitdist(df$elapsed_time,exp) rv$year = df$sale_year[1] rv$sample = df$sale_week[1] rv$granularity = week rv } funmonth - function(df) if (length(df$elapsed_time) 5) { rv = fitdist(df$elapsed_time,exp) rv$year = df$sale_year[1] rv$sample = df$sale_month[1] rv$granularity = month rv } It is basically the data.frame created by fitdist extended to include the variables used to distinguish one sample from another. I have the following statement that gets me a set of IDs from my db: ids - dbGetQuery(con, SELECT DISTINCT m_id FROM risk_input) And then I have a loop that allows me to analyze one dataset after another: for (i in 1:length(ids[,1])) { print(i) print(ids[i,1]) Then, after a set of statements that give me information about the dataset (such as its size), within a conditional block that ensures I apply the analysis only on sufficiently large samples, I have the following: z - lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_week),drop = TRUE), funweek) or z - lapply(split(moreinfo,list(moreinfo$sale_year,moreinfo$sale_month),drop = TRUE), funmonth) followed by: str(z) Of course, I close the loop and disconnect from my db. NB: I don't see any way to get rid of the loop by adding ID as a factor to split because I have to query the DB for several key bits of data in order to determine whether or not there is sufficient data to work on. I have everything working, except the final step of storing the results back into the db. Storing data in the Db is easy enough. But I am at a loss as to how to combine the lists placed in z in most of the iterations through the ID loop into a single data.frame. Now, I did take a look at rbind and cbind, but it isn't clear to me if either is appropriate. All the data frames have the same structure, but the lists are of variable length, and I am not certain how either might be used inside the IDs loop. So, what is the best way to combine all lists assigned to z into a single data.frame?
Re: [R] How do I combine lists of data.frames into a single data frame?
Thanks Marc Part of the challenge here is that EVERYTHING is dynamic. New data is being added to the DB all the time Each active ID makes a new sample very day or at a minimum every week, and new IDs are added every week. So I can't hard code anything. If, for a given ID, I had 50 weekly samples last week, I'll have 51 samples this week. But some for the IDs have sample sizes that are so small, it would be pure BS to try to use fitdist on their data. I have figured out a way to handle this for a given ID, and so I have the loop that iterates over the IDs, and processes the data for that ID IF there is sufficient data. And to make things interesting, the number of IDs I need to process this week is greater than the number of IDs I had to process last week. So, I iterate over IDs, from 1 up through perhaps 500. If a given ID has sufficient data, I get the z lists. And I have checked, applying rbind to these works great! Of all the IDs' datasets I have examined, perhaps 10% do not yet have enough data to work with (but that, too changes through time). From what you have said, it would seem that I ought to make a master list. So, I need to learn how to make a master list grow from nothing to include all these z lists. That reduces to a question of how can one append dynamically created lists of varying size (from just a few list elements to a few hundred list elements) to such a master list. Actually, when it gets right down to it, I think I am ignorant of a key piece of the puzzle (I have probably missed the key part of the documentation dealing with this). I do not yet know how to add even one element to a list within a loop where the loop does not exist (or at least is empty) at the beginning of the loop. I get your example do.call(rbind, c(z1, z2, z3, z4)), but what do you do if there is no list at the beginning of a loop and you need to handle something like: #n is some large number, and in about 10% of values of 'i' (not known a priori) creation # of x and y is skipped for (i = 1:n) { if(test that returns tru only 90% of the time) { x = function_that_makes_a_data_frame() y = function_that_makes_a_list_of_data_frames() } } We have not created any lists on entry into the loop. How do we create a list containing all instances of x and another that contains all elements that had been in each instance of y? If I can learn how to do that, then I can call do.call(rbind,x_list) and do.call(rbind,y_element_list). If you know C++, and specifically the STL containers and algorithms, one can grow vectors or lists using a function called 'push_back' which is defined on most stl containers. I am looking for the R equivalent for objects, and the R equivalent of the C++ STL algorithm std::copy (passed the begin and end iterators of the source list and a back inserter for the recipient container), for appending a source list to a master list. Thanks Ted On Thu, Jul 15, 2010 at 4:52 PM, Marc Schwartz marc_schwa...@me.com wrote: Ted, I may not be completely clear on how you have your processes implemented, but some thoughts: If you will be creating multiple lists initially, where each list (say z1...z4) contains 1 or more data frames and all of the data frames have the same column structure, you can use: do.call(rbind, c(z1, z2, z3, z4)) For example, using the iris data set: list1 - list(head(iris), head(iris), head(iris)) list2 - list(head(iris), head(iris)) So these now have 3 and 2 copies, respectively, of 6 rows from the iris data set. You can then do: DF - do.call(rbind, c(list1, list2)) str(DF) 'data.frame': 30 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 5.1 4.9 4.7 4.6 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.5 3 3.2 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.4 1.3 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.2 0.2 0.2 0.2 ... $ Species : Factor w/ 3 levels setosa,versicolor,..: 1 1 1 1 1 1 1 1 1 1 ... So DF now contains 30 rows (6 rows * 5 data frames). I am not sure if that will spark some thoughts, but ideally, if you can figure out a way such that the result of all of your operations will be a single list (eg. within a loop construct), you can avoid the copying of objects, which both adds time and RAM overhead. Then you can just use the do.call(rbind, YourList) construct on the single 'all inclusive' list. If you need to preallocate a 'master' list object, which you can then index in a loop, presuming that you know ahead of time how many total data frames will be created, you can use vector(list, N), where N is the number of total list elements that you will require. For example: vector(list, 5) [[1]] NULL [[2]] NULL [[3]] NULL [[4]] NULL [[5]] NULL will preallocate a list of 5 elements, each of which can then be indexed to contain a data frame that is a result of your looping operation. HTH, Marc On Jul 15, 2010, at 2:58 PM, Ted
Re: [R] How do I combine lists of data.frames into a single data frame?
Ted, Based upon your code below, you might be better off using two lapply() constructs to create the x and y results separately, taking advantage of lapply()'s built-in ability to create lists 'on the fly', while returning a NULL when the function will not be applied to the data based upon your test. For example: lapply(seq(n), function(i) if (test on ID[i]) funcX() else NULL) and something like: lapply(seq(n), function(i) if (test on ID[i]) do.call(rbind, funcY()) else NULL) and then you can use the do.call() approach on the results of both. Consider: # Only return data if 'i' is even Res1 - lapply(1:5, function(i) if (i %% 2 == 0) iris[1:i, ] else NULL) Res1 [[1]] NULL [[2]] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa [[3]] NULL [[4]] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa [[5]] NULL When we use do.call() here the elements that are NULL do not result in any problems creating the result: do.call(rbind, Res1) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 5.1 3.5 1.4 0.2 setosa 4 4.9 3.0 1.4 0.2 setosa 5 4.7 3.2 1.3 0.2 setosa 6 4.6 3.1 1.5 0.2 setosa Now consider the second example, where your function would return a list of data frames. I'll use replicate() with 'simplify = FALSE' so that the result within lapply() is either a single list of data frames or NULL. If the result would be a list of data frames, we'll use do.call() within the loop so that lapply() returns a single data frame rather than a list of data frames. Consider: replicate(3, iris[1:3, ], simplify = FALSE) [[1]] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa [[2]] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa [[3]] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa do.call(rbind, replicate(3, iris[1:3, ], simplify = FALSE)) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 5.1 3.5 1.4 0.2 setosa 5 4.9 3.0 1.4 0.2 setosa 6 4.7 3.2 1.3 0.2 setosa 7 5.1 3.5 1.4 0.2 setosa 8 4.9 3.0 1.4 0.2 setosa 9 4.7 3.2 1.3 0.2 setosa So now: Res2 - lapply(1:5, function(i) if (i %% 2 == 0) do.call(rbind, replicate(i, iris[1:i, ], simplify = FALSE)) else NULL) Res2 [[1]] NULL [[2]] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 5.1 3.5 1.4 0.2 setosa 4 4.9 3.0 1.4 0.2 setosa [[3]] NULL [[4]] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.1 3.5 1.4 0.2 setosa 6 4.9 3.0 1.4 0.2 setosa 7 4.7 3.2 1.3 0.2 setosa 8 4.6 3.1 1.5 0.2 setosa 9 5.1 3.5 1.4 0.2 setosa 10 4.9 3.0 1.4 0.2 setosa 11 4.7 3.2 1.3 0.2 setosa 12 4.6 3.1 1.5