from:"Michael Frumin"

Re: [R] Lookups in R

2007-07-05 Thread Michael Frumin

the problem I have is that userid's are not just sequential from
1:n_users.  if they were, of course I'd have made a big matrix that was
n_users x n_fields and that would be that.  but, I think what I cando is
just use the hash to store the index into the result matrix, nothing
more. then the rest of it will be easy.

but please tell me more about eliminating loops.  In many cases in R I
have used lapply and derivatives to avoid loops, but in this case they
seem to give me extra overhead simply by the generation of their result
lists:

 system.time(lapply(1:10^4, mean))
   user  system elapsed 
   1.310.001.31 
 system.time(for(i in 1:10^4) mean(i))
   user  system elapsed 
   0.330.000.32 


thanks,
mike


 I don't think that's a fair comparison--- much of the overhead comes
 from the use of data frames and the creation of the indexing vector. I
 get
 
  n_accts - 10^3
  n_trans - 10^4
  t - list()
  t$amt - runif(n_trans)
  t$acct - as.character(round(runif(n_trans, 1, n_accts)))
  uhash - new.env(hash=TRUE, parent=emptyenv(), size=n_accts)
  for (acct in as.character(1:n_accts)) uhash[[acct]] - list(amt=0, n=0)
  system.time(for (i in seq_along(t$amt)) {
 + acct - t$acct[i]
 + x - uhash[[acct]]
 + uhash[[acct]] - list(amt=x$amt + t$amt[i], n=x$n + 1)
 + }, gcFirst = TRUE)
user  system elapsed
   0.508   0.008   0.517
  udf - matrix(0, nrow = n_accts, ncol = 2)
  rownames(udf) - as.character(1:n_accts)
  colnames(udf) - c(amt, n)
  system.time(for (i in seq_along(t$amt)) {
 + idx - t$acct[i]
 + udf[idx, ] - udf[idx, ] + c(t$amt[i], 1)
 + }, gcFirst = TRUE)
user  system elapsed
   1.872   0.008   1.883
 
 The loop is still going to be the problem for realistic examples.
 
 -Deepayan

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Lookups in R

2007-07-04 Thread Michael Frumin

i wish it were that simple.  unfortunately the logic i have to do on 
each transaction is substantially more complicated, and involves 
referencing the existing values of the user table through a number of 
conditions.

any other thoughts on how to get better-than-linear performance time?  
is there a recommended binary searching/sorting (i.e. BTree) module that 
I could use to maintain my own index?

thanks,
mike

Peter Dalgaard wrote:
 mfrumin wrote:
 Hey all; I'm a beginner++ user of R, trying to use it to do some 
 processing
 of data sets of over 1M rows, and running into a snafu.  imagine that my
 input is a huge table of transactions, each linked to a specif user 
 id.  as
 I run through the transactions, I need to update a separate table for 
 the
 users, but I am finding that the traditional ways of doing a table 
 lookup
 are way too slow to support this kind of operation.

 i.e:

 for(i in 1:100) {
userid = transactions$userid[i];
amt = transactions$amounts[i];
users[users$id == userid,'amt'] += amt;
 }

 I assume this is a linear lookup through the users table (in which 
 there are
 10's of thousands of rows), when really what I need is O(constant 
 time), or
 at worst O(log(# users)).

 is there any way to manage a list of ID's (be they numeric, string, 
 etc) and
 have them efficiently mapped to some other table index?

 I see the CRAN package for SQLite hashes, but that seems to be going 
 a bit
 too far.
   
 Sometimes you need a bit of lateral thinking. I suspect that you could 
 do it like this:

 tbl - with(transactions, tapply(amount, userid, sum))
 users$amt - users$amt + tbl[users$id]

 one catch is that there could be users with no transactions, in which 
 case you may need to replace userid by factor(userid, 
 levels=users$id). None of this is tested, of course.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Lookups in R

2007-07-04 Thread Michael Frumin


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Lookups in R

Re: [R] Lookups in R

Re: [R] Lookups in R

3 matches

Site Navigation

Mail list logo

Footer information