Re: [R] Lookups in R
the problem I have is that userid's are not just sequential from 1:n_users. if they were, of course I'd have made a big matrix that was n_users x n_fields and that would be that. but, I think what I cando is just use the hash to store the index into the result matrix, nothing more. then the rest of it will be easy. but please tell me more about eliminating loops. In many cases in R I have used lapply and derivatives to avoid loops, but in this case they seem to give me extra overhead simply by the generation of their result lists: system.time(lapply(1:10^4, mean)) user system elapsed 1.310.001.31 system.time(for(i in 1:10^4) mean(i)) user system elapsed 0.330.000.32 thanks, mike I don't think that's a fair comparison--- much of the overhead comes from the use of data frames and the creation of the indexing vector. I get n_accts - 10^3 n_trans - 10^4 t - list() t$amt - runif(n_trans) t$acct - as.character(round(runif(n_trans, 1, n_accts))) uhash - new.env(hash=TRUE, parent=emptyenv(), size=n_accts) for (acct in as.character(1:n_accts)) uhash[[acct]] - list(amt=0, n=0) system.time(for (i in seq_along(t$amt)) { + acct - t$acct[i] + x - uhash[[acct]] + uhash[[acct]] - list(amt=x$amt + t$amt[i], n=x$n + 1) + }, gcFirst = TRUE) user system elapsed 0.508 0.008 0.517 udf - matrix(0, nrow = n_accts, ncol = 2) rownames(udf) - as.character(1:n_accts) colnames(udf) - c(amt, n) system.time(for (i in seq_along(t$amt)) { + idx - t$acct[i] + udf[idx, ] - udf[idx, ] + c(t$amt[i], 1) + }, gcFirst = TRUE) user system elapsed 1.872 0.008 1.883 The loop is still going to be the problem for realistic examples. -Deepayan __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Lookups in R
You are getting two very different results in what you are comparing. system.time(lapply(1:10^4, mean)) user system elapsed 1.310.001.31 is returning a list with 10,000 values in it. It is taking time to allocate the space and such. system.time(for(i in 1:10^4) mean(i)) user system elapsed 0.330.000.32 is just returning a single value (mean(10^4)) and is not having to allocate space and setup the structure for a list. Typically you use 'lapply' not only for 'looping', but more importantly returning the values associated with the processing. So again the timing will be dependent on what you are doing. If you have a large transaction table that you want consolidated to some processing on userID, then lapply will probably be very efficient for that. On 7/5/07, Michael Frumin [EMAIL PROTECTED] wrote: the problem I have is that userid's are not just sequential from 1:n_users. if they were, of course I'd have made a big matrix that was n_users x n_fields and that would be that. but, I think what I cando is just use the hash to store the index into the result matrix, nothing more. then the rest of it will be easy. but please tell me more about eliminating loops. In many cases in R I have used lapply and derivatives to avoid loops, but in this case they seem to give me extra overhead simply by the generation of their result lists: system.time(lapply(1:10^4, mean)) user system elapsed 1.310.001.31 system.time(for(i in 1:10^4) mean(i)) user system elapsed 0.330.000.32 thanks, mike I don't think that's a fair comparison--- much of the overhead comes from the use of data frames and the creation of the indexing vector. I get n_accts - 10^3 n_trans - 10^4 t - list() t$amt - runif(n_trans) t$acct - as.character(round(runif(n_trans, 1, n_accts))) uhash - new.env(hash=TRUE, parent=emptyenv(), size=n_accts) for (acct in as.character(1:n_accts)) uhash[[acct]] - list(amt=0, n=0) system.time(for (i in seq_along(t$amt)) { + acct - t$acct[i] + x - uhash[[acct]] + uhash[[acct]] - list(amt=x$amt + t$amt[i], n=x$n + 1) + }, gcFirst = TRUE) user system elapsed 0.508 0.008 0.517 udf - matrix(0, nrow = n_accts, ncol = 2) rownames(udf) - as.character(1:n_accts) colnames(udf) - c(amt, n) system.time(for (i in seq_along(t$amt)) { + idx - t$acct[i] + udf[idx, ] - udf[idx, ] + c(t$amt[i], 1) + }, gcFirst = TRUE) user system elapsed 1.872 0.008 1.883 The loop is still going to be the problem for realistic examples. -Deepayan __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Lookups in R
On 7/5/07, jim holtman [EMAIL PROTECTED] wrote: You are getting two very different results in what you are comparing. system.time(lapply(1:10^4, mean)) user system elapsed 1.310.001.31 is returning a list with 10,000 values in it. It is taking time to allocate the space and such. system.time(for(i in 1:10^4) mean(i)) user system elapsed 0.330.000.32 is just returning a single value (mean(10^4)) and is not having to allocate space and setup the structure for a list. Typically you use 'lapply' not only for 'looping', but more importantly returning the values associated with the processing. The point still holds: system.time(lapply(1:10^4, mean)) user system elapsed 3.748 2.404 6.161 system.time({ a = numeric(10^4); for (i in 1:10^4) a[i] = mean(i) }) user system elapsed 0.716 0.004 0.720 To really get rid of the for loop, you need to move the loop to pure C code, e.g. system.time(rowMeans(matrix(1:10^4, ncol = 1))) user system elapsed 0.004 0.000 0.004 Sometimes you can do this using functions available in R, e.g. using tapply() in your original question and rowMeans() in this example. Sometimes you cannot, and the only way to gain efficiency is to write custom C code (we do not have enough information to decide which is the case in your real example, since we don't know what it is). -Deepayan On 7/5/07, Michael Frumin [EMAIL PROTECTED] wrote: the problem I have is that userid's are not just sequential from 1:n_users. if they were, of course I'd have made a big matrix that was n_users x n_fields and that would be that. but, I think what I cando is just use the hash to store the index into the result matrix, nothing more. then the rest of it will be easy. but please tell me more about eliminating loops. In many cases in R I have used lapply and derivatives to avoid loops, but in this case they seem to give me extra overhead simply by the generation of their result lists: system.time(lapply(1:10^4, mean)) user system elapsed 1.310.001.31 system.time(for(i in 1:10^4) mean(i)) user system elapsed 0.330.000.32 thanks, mike I don't think that's a fair comparison--- much of the overhead comes from the use of data frames and the creation of the indexing vector. I get n_accts - 10^3 n_trans - 10^4 t - list() t$amt - runif(n_trans) t$acct - as.character(round(runif(n_trans, 1, n_accts))) uhash - new.env(hash=TRUE, parent=emptyenv(), size=n_accts) for (acct in as.character(1:n_accts)) uhash[[acct]] - list(amt=0, n=0) system.time(for (i in seq_along(t$amt)) { + acct - t$acct[i] + x - uhash[[acct]] + uhash[[acct]] - list(amt=x$amt + t$amt[i], n=x$n + 1) + }, gcFirst = TRUE) user system elapsed 0.508 0.008 0.517 udf - matrix(0, nrow = n_accts, ncol = 2) rownames(udf) - as.character(1:n_accts) colnames(udf) - c(amt, n) system.time(for (i in seq_along(t$amt)) { + idx - t$acct[i] + udf[idx, ] - udf[idx, ] + c(t$amt[i], 1) + }, gcFirst = TRUE) user system elapsed 1.872 0.008 1.883 The loop is still going to be the problem for realistic examples. -Deepayan __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Lookups in R
Hey all; I'm a beginner++ user of R, trying to use it to do some processing of data sets of over 1M rows, and running into a snafu. imagine that my input is a huge table of transactions, each linked to a specif user id. as I run through the transactions, I need to update a separate table for the users, but I am finding that the traditional ways of doing a table lookup are way too slow to support this kind of operation. i.e: for(i in 1:100) { userid = transactions$userid[i]; amt = transactions$amounts[i]; users[users$id == userid,'amt'] += amt; } I assume this is a linear lookup through the users table (in which there are 10's of thousands of rows), when really what I need is O(constant time), or at worst O(log(# users)). is there any way to manage a list of ID's (be they numeric, string, etc) and have them efficiently mapped to some other table index? I see the CRAN package for SQLite hashes, but that seems to be going a bit too far. thanks, Mike Intern, Oyster Card Group, Transport for London (feel free to email back to this address, I'm posting through NAbble so I hope it works). -- View this message in context: http://www.nabble.com/Lookups-in-R-tf4026062.html#a11435994 Sent from the R help mailing list archive at Nabble.com. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Lookups in R
mfrumin wrote: Hey all; I'm a beginner++ user of R, trying to use it to do some processing of data sets of over 1M rows, and running into a snafu. imagine that my input is a huge table of transactions, each linked to a specif user id. as I run through the transactions, I need to update a separate table for the users, but I am finding that the traditional ways of doing a table lookup are way too slow to support this kind of operation. i.e: for(i in 1:100) { userid = transactions$userid[i]; amt = transactions$amounts[i]; users[users$id == userid,'amt'] += amt; } I assume this is a linear lookup through the users table (in which there are 10's of thousands of rows), when really what I need is O(constant time), or at worst O(log(# users)). is there any way to manage a list of ID's (be they numeric, string, etc) and have them efficiently mapped to some other table index? I see the CRAN package for SQLite hashes, but that seems to be going a bit too far. Sometimes you need a bit of lateral thinking. I suspect that you could do it like this: tbl - with(transactions, tapply(amount, userid, sum)) users$amt - users$amt + tbl[users$id] one catch is that there could be users with no transactions, in which case you may need to replace userid by factor(userid, levels=users$id). None of this is tested, of course. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Lookups in R
Michael, A hash provides constant-time access, though the resulting perl-esque data structures (a hash of lists, e.g.) are not convenient for other manipulations n_accts - 10^3 n_trans - 10^4 t - list() t$amt - runif(n_trans) t$acct - as.character(round(runif(n_trans, 1, n_accts))) uhash - new.env(hash=TRUE, parent=emptyenv(), size=n_accts) ## keys, presumably account ids for (acct in as.character(1:n_accts)) uhash[[acct]] - list(amt=0, n=0) system.time(for (i in seq_along(t$amt)) { + acct - t$acct[i] + x - uhash[[acct]] + uhash[[acct]] - list(amt=x$amt + t$amt[i], n=x$n + 1) + }) user system elapsed 0.264 0.000 0.262 udf - data.frame(amt=0, n=rep(0L, n_accts), + row.names=as.character(1:n_accts)) system.time(for (i in seq_along(t$amt)) { + idx - row.names(udf)==t$acct[i] + udf[idx, ] - c(udf[idx,amt], udf[idx, n]) + c(t$amt[i], 1) + }) user system elapsed 18.398 0.000 18.394 Peter Dalgaard [EMAIL PROTECTED] writes: mfrumin wrote: Hey all; I'm a beginner++ user of R, trying to use it to do some processing of data sets of over 1M rows, and running into a snafu. imagine that my input is a huge table of transactions, each linked to a specif user id. as I run through the transactions, I need to update a separate table for the users, but I am finding that the traditional ways of doing a table lookup are way too slow to support this kind of operation. i.e: for(i in 1:100) { userid = transactions$userid[i]; amt = transactions$amounts[i]; users[users$id == userid,'amt'] += amt; } I assume this is a linear lookup through the users table (in which there are 10's of thousands of rows), when really what I need is O(constant time), or at worst O(log(# users)). is there any way to manage a list of ID's (be they numeric, string, etc) and have them efficiently mapped to some other table index? I see the CRAN package for SQLite hashes, but that seems to be going a bit too far. Sometimes you need a bit of lateral thinking. I suspect that you could do it like this: tbl - with(transactions, tapply(amount, userid, sum)) users$amt - users$amt + tbl[users$id] one catch is that there could be users with no transactions, in which case you may need to replace userid by factor(userid, levels=users$id). None of this is tested, of course. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Martin Morgan Bioconductor / Computational Biology http://bioconductor.org __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Lookups in R
i wish it were that simple. unfortunately the logic i have to do on each transaction is substantially more complicated, and involves referencing the existing values of the user table through a number of conditions. any other thoughts on how to get better-than-linear performance time? is there a recommended binary searching/sorting (i.e. BTree) module that I could use to maintain my own index? thanks, mike Peter Dalgaard wrote: mfrumin wrote: Hey all; I'm a beginner++ user of R, trying to use it to do some processing of data sets of over 1M rows, and running into a snafu. imagine that my input is a huge table of transactions, each linked to a specif user id. as I run through the transactions, I need to update a separate table for the users, but I am finding that the traditional ways of doing a table lookup are way too slow to support this kind of operation. i.e: for(i in 1:100) { userid = transactions$userid[i]; amt = transactions$amounts[i]; users[users$id == userid,'amt'] += amt; } I assume this is a linear lookup through the users table (in which there are 10's of thousands of rows), when really what I need is O(constant time), or at worst O(log(# users)). is there any way to manage a list of ID's (be they numeric, string, etc) and have them efficiently mapped to some other table index? I see the CRAN package for SQLite hashes, but that seems to be going a bit too far. Sometimes you need a bit of lateral thinking. I suspect that you could do it like this: tbl - with(transactions, tapply(amount, userid, sum)) users$amt - users$amt + tbl[users$id] one catch is that there could be users with no transactions, in which case you may need to replace userid by factor(userid, levels=users$id). None of this is tested, of course. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Lookups in R
__ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Lookups in R
Michael Frumin wrote: i wish it were that simple. unfortunately the logic i have to do on each transaction is substantially more complicated, and involves referencing the existing values of the user table through a number of conditions. any other thoughts on how to get better-than-linear performance time? is there a recommended binary searching/sorting (i.e. BTree) module that I could use to maintain my own index? The point remains: To do anything efficient in R, you need to get rid of that for loop and use something vectorized. Notice that you can expand values from the user table into the transaction table by indexing with transactions$userid, or you can use a merge operation. thanks, mike Peter Dalgaard wrote: mfrumin wrote: Hey all; I'm a beginner++ user of R, trying to use it to do some processing of data sets of over 1M rows, and running into a snafu. imagine that my input is a huge table of transactions, each linked to a specif user id. as I run through the transactions, I need to update a separate table for the users, but I am finding that the traditional ways of doing a table lookup are way too slow to support this kind of operation. i.e: for(i in 1:100) { userid = transactions$userid[i]; amt = transactions$amounts[i]; users[users$id == userid,'amt'] += amt; } I assume this is a linear lookup through the users table (in which there are 10's of thousands of rows), when really what I need is O(constant time), or at worst O(log(# users)). is there any way to manage a list of ID's (be they numeric, string, etc) and have them efficiently mapped to some other table index? I see the CRAN package for SQLite hashes, but that seems to be going a bit too far. Sometimes you need a bit of lateral thinking. I suspect that you could do it like this: tbl - with(transactions, tapply(amount, userid, sum)) users$amt - users$amt + tbl[users$id] one catch is that there could be users with no transactions, in which case you may need to replace userid by factor(userid, levels=users$id). None of this is tested, of course. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Lookups in R
On 7/4/07, Martin Morgan [EMAIL PROTECTED] wrote: Michael, A hash provides constant-time access, though the resulting perl-esque data structures (a hash of lists, e.g.) are not convenient for other manipulations n_accts - 10^3 n_trans - 10^4 t - list() t$amt - runif(n_trans) t$acct - as.character(round(runif(n_trans, 1, n_accts))) uhash - new.env(hash=TRUE, parent=emptyenv(), size=n_accts) ## keys, presumably account ids for (acct in as.character(1:n_accts)) uhash[[acct]] - list(amt=0, n=0) system.time(for (i in seq_along(t$amt)) { + acct - t$acct[i] + x - uhash[[acct]] + uhash[[acct]] - list(amt=x$amt + t$amt[i], n=x$n + 1) + }) user system elapsed 0.264 0.000 0.262 udf - data.frame(amt=0, n=rep(0L, n_accts), + row.names=as.character(1:n_accts)) system.time(for (i in seq_along(t$amt)) { + idx - row.names(udf)==t$acct[i] + udf[idx, ] - c(udf[idx,amt], udf[idx, n]) + c(t$amt[i], 1) + }) user system elapsed 18.398 0.000 18.394 I don't think that's a fair comparison--- much of the overhead comes from the use of data frames and the creation of the indexing vector. I get n_accts - 10^3 n_trans - 10^4 t - list() t$amt - runif(n_trans) t$acct - as.character(round(runif(n_trans, 1, n_accts))) uhash - new.env(hash=TRUE, parent=emptyenv(), size=n_accts) for (acct in as.character(1:n_accts)) uhash[[acct]] - list(amt=0, n=0) system.time(for (i in seq_along(t$amt)) { + acct - t$acct[i] + x - uhash[[acct]] + uhash[[acct]] - list(amt=x$amt + t$amt[i], n=x$n + 1) + }, gcFirst = TRUE) user system elapsed 0.508 0.008 0.517 udf - matrix(0, nrow = n_accts, ncol = 2) rownames(udf) - as.character(1:n_accts) colnames(udf) - c(amt, n) system.time(for (i in seq_along(t$amt)) { + idx - t$acct[i] + udf[idx, ] - udf[idx, ] + c(t$amt[i], 1) + }, gcFirst = TRUE) user system elapsed 1.872 0.008 1.883 The loop is still going to be the problem for realistic examples. -Deepayan __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.