Re: [R] Lookups in R

2007-07-05 Thread Michael Frumin
the problem I have is that userid's are not just sequential from
1:n_users.  if they were, of course I'd have made a big matrix that was
n_users x n_fields and that would be that.  but, I think what I cando is
just use the hash to store the index into the result matrix, nothing
more. then the rest of it will be easy.

but please tell me more about eliminating loops.  In many cases in R I
have used lapply and derivatives to avoid loops, but in this case they
seem to give me extra overhead simply by the generation of their result
lists:

 system.time(lapply(1:10^4, mean))
   user  system elapsed 
   1.310.001.31 
 system.time(for(i in 1:10^4) mean(i))
   user  system elapsed 
   0.330.000.32 


thanks,
mike


 I don't think that's a fair comparison--- much of the overhead comes
 from the use of data frames and the creation of the indexing vector. I
 get
 
  n_accts - 10^3
  n_trans - 10^4
  t - list()
  t$amt - runif(n_trans)
  t$acct - as.character(round(runif(n_trans, 1, n_accts)))
  uhash - new.env(hash=TRUE, parent=emptyenv(), size=n_accts)
  for (acct in as.character(1:n_accts)) uhash[[acct]] - list(amt=0, n=0)
  system.time(for (i in seq_along(t$amt)) {
 + acct - t$acct[i]
 + x - uhash[[acct]]
 + uhash[[acct]] - list(amt=x$amt + t$amt[i], n=x$n + 1)
 + }, gcFirst = TRUE)
user  system elapsed
   0.508   0.008   0.517
  udf - matrix(0, nrow = n_accts, ncol = 2)
  rownames(udf) - as.character(1:n_accts)
  colnames(udf) - c(amt, n)
  system.time(for (i in seq_along(t$amt)) {
 + idx - t$acct[i]
 + udf[idx, ] - udf[idx, ] + c(t$amt[i], 1)
 + }, gcFirst = TRUE)
user  system elapsed
   1.872   0.008   1.883
 
 The loop is still going to be the problem for realistic examples.
 
 -Deepayan

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Lookups in R

2007-07-05 Thread jim holtman
You are getting two very different results in what you are comparing.

 system.time(lapply(1:10^4, mean))
  user  system elapsed
  1.310.001.31
is returning a list with 10,000 values in it.  It is taking time to allocate
the space and such.

 system.time(for(i in 1:10^4) mean(i))
  user  system elapsed
  0.330.000.32
is just returning a single value (mean(10^4)) and is not having to allocate
space and setup the structure for a list.  Typically you use 'lapply' not
only for 'looping', but more importantly returning the values associated
with the processing.

So again the timing will be dependent on what you are doing.  If you have a
large transaction table that you want consolidated to some processing on
userID, then lapply will probably be very efficient for that.


On 7/5/07, Michael Frumin [EMAIL PROTECTED] wrote:

 the problem I have is that userid's are not just sequential from
 1:n_users.  if they were, of course I'd have made a big matrix that was
 n_users x n_fields and that would be that.  but, I think what I cando is
 just use the hash to store the index into the result matrix, nothing
 more. then the rest of it will be easy.

 but please tell me more about eliminating loops.  In many cases in R I
 have used lapply and derivatives to avoid loops, but in this case they
 seem to give me extra overhead simply by the generation of their result
 lists:

  system.time(lapply(1:10^4, mean))
   user  system elapsed
   1.310.001.31
  system.time(for(i in 1:10^4) mean(i))
   user  system elapsed
   0.330.000.32


 thanks,
 mike


  I don't think that's a fair comparison--- much of the overhead comes
  from the use of data frames and the creation of the indexing vector. I
  get
 
   n_accts - 10^3
   n_trans - 10^4
   t - list()
   t$amt - runif(n_trans)
   t$acct - as.character(round(runif(n_trans, 1, n_accts)))
   uhash - new.env(hash=TRUE, parent=emptyenv(), size=n_accts)
   for (acct in as.character(1:n_accts)) uhash[[acct]] - list(amt=0,
 n=0)
   system.time(for (i in seq_along(t$amt)) {
  + acct - t$acct[i]
  + x - uhash[[acct]]
  + uhash[[acct]] - list(amt=x$amt + t$amt[i], n=x$n + 1)
  + }, gcFirst = TRUE)
 user  system elapsed
0.508   0.008   0.517
   udf - matrix(0, nrow = n_accts, ncol = 2)
   rownames(udf) - as.character(1:n_accts)
   colnames(udf) - c(amt, n)
   system.time(for (i in seq_along(t$amt)) {
  + idx - t$acct[i]
  + udf[idx, ] - udf[idx, ] + c(t$amt[i], 1)
  + }, gcFirst = TRUE)
 user  system elapsed
1.872   0.008   1.883
 
  The loop is still going to be the problem for realistic examples.
 
  -Deepayan

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Lookups in R

2007-07-05 Thread deepayan . sarkar
On 7/5/07, jim holtman [EMAIL PROTECTED] wrote:
 You are getting two very different results in what you are comparing.

  system.time(lapply(1:10^4, mean))
   user  system elapsed
   1.310.001.31
 is returning a list with 10,000 values in it.  It is taking time to allocate
 the space and such.

  system.time(for(i in 1:10^4) mean(i))
   user  system elapsed
   0.330.000.32
 is just returning a single value (mean(10^4)) and is not having to allocate
 space and setup the structure for a list.  Typically you use 'lapply' not
 only for 'looping', but more importantly returning the values associated
 with the processing.

The point still holds:

 system.time(lapply(1:10^4, mean))
   user  system elapsed
  3.748   2.404   6.161
 system.time({ a = numeric(10^4); for (i in 1:10^4) a[i] = mean(i) })
   user  system elapsed
  0.716   0.004   0.720

To really get rid of the for loop, you need to move the loop to pure C
code, e.g.

 system.time(rowMeans(matrix(1:10^4, ncol = 1)))
   user  system elapsed
  0.004   0.000   0.004

Sometimes you can do this using functions available in R, e.g. using
tapply() in your original question and rowMeans() in this example.
Sometimes you cannot, and the only way to gain efficiency is to write
custom C code (we do not have enough information to decide which is
the case in your real example, since we don't know what it is).

-Deepayan

 On 7/5/07, Michael Frumin [EMAIL PROTECTED] wrote:
 
  the problem I have is that userid's are not just sequential from
  1:n_users.  if they were, of course I'd have made a big matrix that was
  n_users x n_fields and that would be that.  but, I think what I cando is
  just use the hash to store the index into the result matrix, nothing
  more. then the rest of it will be easy.
 
  but please tell me more about eliminating loops.  In many cases in R I
  have used lapply and derivatives to avoid loops, but in this case they
  seem to give me extra overhead simply by the generation of their result
  lists:
 
   system.time(lapply(1:10^4, mean))
user  system elapsed
1.310.001.31
   system.time(for(i in 1:10^4) mean(i))
user  system elapsed
0.330.000.32
 
 
  thanks,
  mike
 
 
   I don't think that's a fair comparison--- much of the overhead comes
   from the use of data frames and the creation of the indexing vector. I
   get
  
n_accts - 10^3
n_trans - 10^4
t - list()
t$amt - runif(n_trans)
t$acct - as.character(round(runif(n_trans, 1, n_accts)))
uhash - new.env(hash=TRUE, parent=emptyenv(), size=n_accts)
for (acct in as.character(1:n_accts)) uhash[[acct]] - list(amt=0,
  n=0)
system.time(for (i in seq_along(t$amt)) {
   + acct - t$acct[i]
   + x - uhash[[acct]]
   + uhash[[acct]] - list(amt=x$amt + t$amt[i], n=x$n + 1)
   + }, gcFirst = TRUE)
  user  system elapsed
 0.508   0.008   0.517
udf - matrix(0, nrow = n_accts, ncol = 2)
rownames(udf) - as.character(1:n_accts)
colnames(udf) - c(amt, n)
system.time(for (i in seq_along(t$amt)) {
   + idx - t$acct[i]
   + udf[idx, ] - udf[idx, ] + c(t$amt[i], 1)
   + }, gcFirst = TRUE)
  user  system elapsed
 1.872   0.008   1.883
  
   The loop is still going to be the problem for realistic examples.
  
   -Deepayan
 
  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 



 --
 Jim Holtman
 Cincinnati, OH
 +1 513 646 9390

 What is the problem you are trying to solve?


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Lookups in R

2007-07-04 Thread mfrumin

Hey all; I'm a beginner++ user of R, trying to use it to do some processing
of data sets of over 1M rows, and running into a snafu.  imagine that my
input is a huge table of transactions, each linked to a specif user id.  as
I run through the transactions, I need to update a separate table for the
users, but I am finding that the traditional ways of doing a table lookup
are way too slow to support this kind of operation.

i.e:

for(i in 1:100) {
   userid = transactions$userid[i];
   amt = transactions$amounts[i];
   users[users$id == userid,'amt'] += amt;
}

I assume this is a linear lookup through the users table (in which there are
10's of thousands of rows), when really what I need is O(constant time), or
at worst O(log(# users)).

is there any way to manage a list of ID's (be they numeric, string, etc) and
have them efficiently mapped to some other table index?

I see the CRAN package for SQLite hashes, but that seems to be going a bit
too far.

thanks,
Mike

Intern, Oyster Card Group, Transport for London
(feel free to email back to this address, I'm posting through NAbble so I
hope it works).
-- 
View this message in context: 
http://www.nabble.com/Lookups-in-R-tf4026062.html#a11435994
Sent from the R help mailing list archive at Nabble.com.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Lookups in R

2007-07-04 Thread Peter Dalgaard
mfrumin wrote:
 Hey all; I'm a beginner++ user of R, trying to use it to do some processing
 of data sets of over 1M rows, and running into a snafu.  imagine that my
 input is a huge table of transactions, each linked to a specif user id.  as
 I run through the transactions, I need to update a separate table for the
 users, but I am finding that the traditional ways of doing a table lookup
 are way too slow to support this kind of operation.

 i.e:

 for(i in 1:100) {
userid = transactions$userid[i];
amt = transactions$amounts[i];
users[users$id == userid,'amt'] += amt;
 }

 I assume this is a linear lookup through the users table (in which there are
 10's of thousands of rows), when really what I need is O(constant time), or
 at worst O(log(# users)).

 is there any way to manage a list of ID's (be they numeric, string, etc) and
 have them efficiently mapped to some other table index?

 I see the CRAN package for SQLite hashes, but that seems to be going a bit
 too far.
   
Sometimes you need a bit of lateral thinking. I suspect that you could 
do it like this:

tbl - with(transactions, tapply(amount, userid, sum))
users$amt - users$amt + tbl[users$id]

one catch is that there could be users with no transactions, in which 
case you may need to replace userid by factor(userid, levels=users$id). 
None of this is tested, of course.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Lookups in R

2007-07-04 Thread Martin Morgan
Michael,

A hash provides constant-time access, though the resulting perl-esque
data structures (a hash of lists, e.g.) are not convenient for other
manipulations

 n_accts - 10^3
 n_trans - 10^4
 t - list()
 t$amt - runif(n_trans)
 t$acct - as.character(round(runif(n_trans, 1, n_accts)))
 
 uhash - new.env(hash=TRUE, parent=emptyenv(), size=n_accts)
 ## keys, presumably account ids
 for (acct in as.character(1:n_accts)) uhash[[acct]] - list(amt=0, n=0)
 
 system.time(for (i in seq_along(t$amt)) {
+ acct - t$acct[i]
+ x - uhash[[acct]]
+ uhash[[acct]] - list(amt=x$amt + t$amt[i], n=x$n + 1)
+ })
   user  system elapsed 
  0.264   0.000   0.262 
 udf - data.frame(amt=0, n=rep(0L, n_accts),
+   row.names=as.character(1:n_accts))
 system.time(for (i in seq_along(t$amt)) {
+ idx - row.names(udf)==t$acct[i]
+ udf[idx, ] - c(udf[idx,amt], udf[idx, n]) + c(t$amt[i], 1)
+ })
   user  system elapsed 
 18.398   0.000  18.394 

Peter Dalgaard [EMAIL PROTECTED] writes:

 mfrumin wrote:
 Hey all; I'm a beginner++ user of R, trying to use it to do some processing
 of data sets of over 1M rows, and running into a snafu.  imagine that my
 input is a huge table of transactions, each linked to a specif user id.  as
 I run through the transactions, I need to update a separate table for the
 users, but I am finding that the traditional ways of doing a table lookup
 are way too slow to support this kind of operation.

 i.e:

 for(i in 1:100) {
userid = transactions$userid[i];
amt = transactions$amounts[i];
users[users$id == userid,'amt'] += amt;
 }

 I assume this is a linear lookup through the users table (in which there are
 10's of thousands of rows), when really what I need is O(constant time), or
 at worst O(log(# users)).

 is there any way to manage a list of ID's (be they numeric, string, etc) and
 have them efficiently mapped to some other table index?

 I see the CRAN package for SQLite hashes, but that seems to be going a bit
 too far.
   
 Sometimes you need a bit of lateral thinking. I suspect that you could 
 do it like this:

 tbl - with(transactions, tapply(amount, userid, sum))
 users$amt - users$amt + tbl[users$id]

 one catch is that there could be users with no transactions, in which 
 case you may need to replace userid by factor(userid, levels=users$id). 
 None of this is tested, of course.

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

-- 
Martin Morgan
Bioconductor / Computational Biology
http://bioconductor.org

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Lookups in R

2007-07-04 Thread Michael Frumin
i wish it were that simple.  unfortunately the logic i have to do on 
each transaction is substantially more complicated, and involves 
referencing the existing values of the user table through a number of 
conditions.

any other thoughts on how to get better-than-linear performance time?  
is there a recommended binary searching/sorting (i.e. BTree) module that 
I could use to maintain my own index?

thanks,
mike

Peter Dalgaard wrote:
 mfrumin wrote:
 Hey all; I'm a beginner++ user of R, trying to use it to do some 
 processing
 of data sets of over 1M rows, and running into a snafu.  imagine that my
 input is a huge table of transactions, each linked to a specif user 
 id.  as
 I run through the transactions, I need to update a separate table for 
 the
 users, but I am finding that the traditional ways of doing a table 
 lookup
 are way too slow to support this kind of operation.

 i.e:

 for(i in 1:100) {
userid = transactions$userid[i];
amt = transactions$amounts[i];
users[users$id == userid,'amt'] += amt;
 }

 I assume this is a linear lookup through the users table (in which 
 there are
 10's of thousands of rows), when really what I need is O(constant 
 time), or
 at worst O(log(# users)).

 is there any way to manage a list of ID's (be they numeric, string, 
 etc) and
 have them efficiently mapped to some other table index?

 I see the CRAN package for SQLite hashes, but that seems to be going 
 a bit
 too far.
   
 Sometimes you need a bit of lateral thinking. I suspect that you could 
 do it like this:

 tbl - with(transactions, tapply(amount, userid, sum))
 users$amt - users$amt + tbl[users$id]

 one catch is that there could be users with no transactions, in which 
 case you may need to replace userid by factor(userid, 
 levels=users$id). None of this is tested, of course.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Lookups in R

2007-07-04 Thread Michael Frumin

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Lookups in R

2007-07-04 Thread Peter Dalgaard
Michael Frumin wrote:
 i wish it were that simple.  unfortunately the logic i have to do on 
 each transaction is substantially more complicated, and involves 
 referencing the existing values of the user table through a number of 
 conditions.

 any other thoughts on how to get better-than-linear performance time?  
 is there a recommended binary searching/sorting (i.e. BTree) module that 
 I could use to maintain my own index?
   
The point remains: To do anything efficient in R, you need to get rid of 
that for loop and use something vectorized. Notice that you can expand 
values from the user table into the transaction table by indexing with 
transactions$userid, or you can use a merge operation.

 thanks,
 mike

 Peter Dalgaard wrote:
   
 mfrumin wrote:
 
 Hey all; I'm a beginner++ user of R, trying to use it to do some 
 processing
 of data sets of over 1M rows, and running into a snafu.  imagine that my
 input is a huge table of transactions, each linked to a specif user 
 id.  as
 I run through the transactions, I need to update a separate table for 
 the
 users, but I am finding that the traditional ways of doing a table 
 lookup
 are way too slow to support this kind of operation.

 i.e:

 for(i in 1:100) {
userid = transactions$userid[i];
amt = transactions$amounts[i];
users[users$id == userid,'amt'] += amt;
 }

 I assume this is a linear lookup through the users table (in which 
 there are
 10's of thousands of rows), when really what I need is O(constant 
 time), or
 at worst O(log(# users)).

 is there any way to manage a list of ID's (be they numeric, string, 
 etc) and
 have them efficiently mapped to some other table index?

 I see the CRAN package for SQLite hashes, but that seems to be going 
 a bit
 too far.
   
   
 Sometimes you need a bit of lateral thinking. I suspect that you could 
 do it like this:

 tbl - with(transactions, tapply(amount, userid, sum))
 users$amt - users$amt + tbl[users$id]

 one catch is that there could be users with no transactions, in which 
 case you may need to replace userid by factor(userid, 
 levels=users$id). None of this is tested, of course.
 

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Lookups in R

2007-07-04 Thread deepayan . sarkar
On 7/4/07, Martin Morgan [EMAIL PROTECTED] wrote:
 Michael,

 A hash provides constant-time access, though the resulting perl-esque
 data structures (a hash of lists, e.g.) are not convenient for other
 manipulations

  n_accts - 10^3
  n_trans - 10^4
  t - list()
  t$amt - runif(n_trans)
  t$acct - as.character(round(runif(n_trans, 1, n_accts)))
 
  uhash - new.env(hash=TRUE, parent=emptyenv(), size=n_accts)
  ## keys, presumably account ids
  for (acct in as.character(1:n_accts)) uhash[[acct]] - list(amt=0, n=0)
 
  system.time(for (i in seq_along(t$amt)) {
 + acct - t$acct[i]
 + x - uhash[[acct]]
 + uhash[[acct]] - list(amt=x$amt + t$amt[i], n=x$n + 1)
 + })
user  system elapsed
   0.264   0.000   0.262
  udf - data.frame(amt=0, n=rep(0L, n_accts),
 +   row.names=as.character(1:n_accts))
  system.time(for (i in seq_along(t$amt)) {
 + idx - row.names(udf)==t$acct[i]
 + udf[idx, ] - c(udf[idx,amt], udf[idx, n]) + c(t$amt[i], 1)
 + })
user  system elapsed
  18.398   0.000  18.394

I don't think that's a fair comparison--- much of the overhead comes
from the use of data frames and the creation of the indexing vector. I
get

 n_accts - 10^3
 n_trans - 10^4
 t - list()
 t$amt - runif(n_trans)
 t$acct - as.character(round(runif(n_trans, 1, n_accts)))
 uhash - new.env(hash=TRUE, parent=emptyenv(), size=n_accts)
 for (acct in as.character(1:n_accts)) uhash[[acct]] - list(amt=0, n=0)
 system.time(for (i in seq_along(t$amt)) {
+ acct - t$acct[i]
+ x - uhash[[acct]]
+ uhash[[acct]] - list(amt=x$amt + t$amt[i], n=x$n + 1)
+ }, gcFirst = TRUE)
   user  system elapsed
  0.508   0.008   0.517
 udf - matrix(0, nrow = n_accts, ncol = 2)
 rownames(udf) - as.character(1:n_accts)
 colnames(udf) - c(amt, n)
 system.time(for (i in seq_along(t$amt)) {
+ idx - t$acct[i]
+ udf[idx, ] - udf[idx, ] + c(t$amt[i], 1)
+ }, gcFirst = TRUE)
   user  system elapsed
  1.872   0.008   1.883

The loop is still going to be the problem for realistic examples.

-Deepayan

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.