Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days

2013-11-08 Thread Magnus Thor Torfason

Thanks to Thomas, Martin, Jim and William,

Your input was very informative, and thanks for the reference to Sedgwick.

In the end, it does seem to me that all these algorithms require fast 
lookup by ID of nodes to access data, and that conditional on such fast 
lookup, algorithms are possible with efficiency O(n) or O(n*log(n)) 
(depending on whether lookup time is constant or logarithmic). I believe 
my original algorithm achieves that.


We come back to the fact that I assumed that R environments, implemented 
as hash tables, would give me that fast lookup. But on my systems, their 
efficiency (for insert and lookup) seems to degrade fast at several 
million entries. Certainly much faster than either O(1) or O(log(n)). I 
believe this does not have to do with disk access time. For example, I 
tested this on my desktop computer, running a pure hash insert loop. I 
observe 100% processor use but no disk access, as the size of the hash 
table approaches millions of entries.


I have tested this on two systems, but have not gone into the 
implementation of the hashed environments to look at this in details. If 
others have the same (or different) experiences with using hashed 
environments with millions of entries, it would be very useful to know.


Barring a solution to the hashed environment speed, it seems the way to 
speed this algorithm up (within the confines of R) would be to move away 
from hash tables and towards a numerically indexed array.


Thanks again for all of the help,
Magnus

On 11/4/2013 8:20 PM, Thomas Lumley wrote:

On Sat, Nov 2, 2013 at 11:12 AM, Martin Morgan mtmor...@fhcrc.org
mailto:mtmor...@fhcrc.org wrote:

On 11/01/2013 08:22 AM, Magnus Thor Torfason wrote:

Sure,

I was attempting to be concise and boiling it down to what I saw
as the root
issue, but you are right, I could have taken it a step further.
So here goes.

I have a set of around around 20M string pairs. A given string
(say, A) can
either be equivalent to another string (B) or not. If A and B
occur together in
the same pair, they are equivalent. But equivalence is
transitive, so if A and B
occur together in one pair, and A and C occur together in
another pair, then A
and C are also equivalent. I need a way to quickly determine if
any two strings
from my data set are equivalent or not.


Do you mean that if A,B occur together and B,C occur together, then
A,B and A,C are equivalent?

Here's a function that returns a unique identifier (not well
tested!), allowing for transitive relations but not circularity.

  uid - function(x, y)
 {
 i - seq_along(x)   # global index
 xy - paste0(x, y)  # make unique identifiers
 idx - match(xy, xy)

 repeat {
 ## transitive look-up
 y_idx - match(y[idx], x)   # look up 'y' in 'x'
 keep - !is.na http://is.na(y_idx)
 if (!any(keep)) # no transitive
relations, done!
 break
 x[idx[keep]] - x[y_idx[keep]]
 y[idx[keep]] - y[y_idx[keep]]

 ## create new index of values
 xy - paste0(x, y)
 idx - match(xy, xy)
 }
 idx
 }

Values with the same index are identical. Some tests

  x - c(1, 2, 3, 4)
  y - c(2, 3, 5, 6)
  uid(x, y)
 [1] 1 1 1 4
  i - sample(x); uid(x[i], y[i])
 [1] 1 1 3 1
  uid(as.character(x), as.character(y))  ## character() ok
 [1] 1 1 1 4
  uid(1:10, 1 + 1:10)
  [1] 1 1 1 1 1 1 1 1 1 1
  uid(integer(), integer())
 integer(0)
  x - c(1, 2, 3)
  y - c(2, 3, 1)
  uid(x, y)  ## circular!
   C-c C-c

I think this will scale well enough, but the worst-case scenario can
be made to be log(longest chain) and copying can be reduced by using
an index i and subsetting the original vector on each iteration. I
think you could test for circularity by checking that the updated x
are not a permutation of the kept x, all(x[y_idx[keep]] %in% x[keep]))

Martin



This problem (union-find) is discussed in Chapter 1 of Sedgwick's
Algorithms.  There's an algorithm given that takes linear time to
build the structure, worst-case logarithmic time to query, and
effectively constant average time to query (inverse-Ackerman amortized
complexity).

-thomas

--
Thomas Lumley
Professor of Biostatistics
University of Auckland


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, 

Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days

2013-11-04 Thread Magnus Thor Torfason
There are around 16M unique values. After accounting for equivalence, 
the number is much smaller (I don't know how much smaller, since my 
program has not completed yet :-)


Yes, I meant that B and C are also equivalent. The original version 
was a typo.


Best,
Magnus

On 11/1/2013 3:45 PM, jim holtman wrote:

in the 20M pairs, how many unique values are there?  In your statement
above  But equivalence is transitive, so if A and B occur together in
one pair, and A and C occur together in another pair, then A and C are
also equivalent., did you mean that B and C are also equivalent?

Jim Holtman
Data Munger Guru



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days

2013-11-04 Thread Magnus Thor Torfason



On 11/1/2013 10:12 PM, Martin Morgan wrote:

Do you mean that if A,B occur together and B,C occur together, then A,B
and A,C are equivalent?


Yes, that's what I meant, sorry, typo.

I like your uid() function. It avoids the 20M times loop, and the issue 
of circular references can be solved by ensuring that y is always 
(weakly) smaller than x.


I can see how the log(longest chain) would be the limiting case, since 
each round should always halve the length of the longest chain.


I have yet to test whether that intuition (about running time) works in 
practice.


I did test the program on a different data set, and it does seem that it 
fails to detect equivalence on the following input:


  a b
1 B A
2 C B
3 O G
4 O M
5 Y C
6 Y M

Everything should be equivalent here, but

 uid(d$a, d$b)
[1] 1 1 3 4 1 6

I'm looking to see what can be about this. The problem seems to be with
entries of the form:

O-G
O-M

Which imply that all of O/M/G are equivalent, but they are not detected 
as such. Will consider whether there is a good way around this.


Best,
Magnus


Here's a function that returns a unique identifier (not well tested!),
allowing for transitive relations but not circularity.

  uid - function(x, y)
 {
 i - seq_along(x)   # global index
 xy - paste0(x, y)  # make unique identifiers
 idx - match(xy, xy)

 repeat {
 ## transitive look-up
 y_idx - match(y[idx], x)   # look up 'y' in 'x'
 keep - !is.na(y_idx)
 if (!any(keep)) # no transitive relations,
done!
 break
 x[idx[keep]] - x[y_idx[keep]]
 y[idx[keep]] - y[y_idx[keep]]

 ## create new index of values
 xy - paste0(x, y)
 idx - match(xy, xy)
 }
 idx
 }

Values with the same index are identical. Some tests

  x - c(1, 2, 3, 4)
  y - c(2, 3, 5, 6)
  uid(x, y)
 [1] 1 1 1 4
  i - sample(x); uid(x[i], y[i])
 [1] 1 1 3 1
  uid(as.character(x), as.character(y))  ## character() ok
 [1] 1 1 1 4
  uid(1:10, 1 + 1:10)
  [1] 1 1 1 1 1 1 1 1 1 1
  uid(integer(), integer())
 integer(0)
  x - c(1, 2, 3)
  y - c(2, 3, 1)
  uid(x, y)  ## circular!
   C-c C-c

I think this will scale well enough, but the worst-case scenario can be
made to be log(longest chain) and copying can be reduced by using an
index i and subsetting the original vector on each iteration. I think
you could test for circularity by checking that the updated x are not a
permutation of the kept x, all(x[y_idx[keep]] %in% x[keep]))

Martin



The way I do this currently is to designate the smallest
(alphabetically) string
in each known equivalence set as the main entry. For each pair, I
therefore
insert two entries into the hash table, both pointing at the mail
value. So
assuming the input data:

A,B
B,C
D,E

I would then have:

A-A
B-A
C-B
D-D
E-D

Except that I also follow each chain until I reach the end
(key==value), and
insert pointers to the main value for every value I find along the
way. After
doing that, I end up with:

A-A
B-A
C-A
D-D
E-D

And I can very quickly check equivalence, either by comparing the hash
of two
strings, or simply by transforming each string into its hash, and then
I can use
simple comparison from then on. The code for generating the final hash
table is
as follows:

h : Empty hash table created with hash.new()
d : Input data
hash.deep.get : Function that iterates through the hash table until it
finds a
key whose value is equal to itself (until hash.get(X)==X), then
returns all the
values in a vector


h = hash.new()
for ( i in 1:nrow(d) )
{
 deep.a  = hash.deep.get(h, d$a[i])
 deep.b  = hash.deep.get(h, d$b[i])
 equivalents = sort(unique(c(deep.a,deep.b)))
 equiv.id= min(equivalents)
 for ( equivalent in equivalents )
 {
 hash.put(h, equivalent, equiv.id)
 }
}


I would so much appreciate if there was a simpler and faster way to do
this.
Keeping my fingers crossed that one of the R-help geniuses who sees
this is
sufficiently interested to crack the problem

Best,
Magnus

On 11/1/2013 1:49 PM, jim holtman wrote:

It would be nice if you followed the posting guidelines and at least
showed the script that was creating your entries now so that we
understand the problem you are trying to solve.  A bit more
explanation of why you want this would be useful.  This gets to the
second part of my tag line:  Tell me what you want to do, not how you
want to do it.  There may be other solutions to your problem.

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason
zulutime@gmail.com wrote:

Pretty much what the subject says:

I used an env as the basis for a Hashtable in 

Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days

2013-11-04 Thread Magnus Thor Torfason
Yes, I'm pretty familiar with igraph, but had not thought of using it 
for this. Interesting idea.


So I presume I'd do something like (pseudocode):

:Create igraph object from my data set as an edgelist
:Create a list of connected subgraphs using clusters()
:Loop through that list of clusters to create final data set

That might work ...

Thanks!

Magnus

On 11/1/2013 4:52 PM, William Dunlap wrote:

Have you looked into the 'igraph' package?

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com



-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf
Of Magnus Thor Torfason
Sent: Friday, November 01, 2013 8:23 AM
To: r-help@r-project.org
Subject: Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries 
taking 5+
days

Sure,

I was attempting to be concise and boiling it down to what I saw as the
root issue, but you are right, I could have taken it a step further. So
here goes.

I have a set of around around 20M string pairs. A given string (say, A)
can either be equivalent to another string (B) or not. If A and B occur
together in the same pair, they are equivalent. But equivalence is
transitive, so if A and B occur together in one pair, and A and C occur
together in another pair, then A and C are also equivalent. I need a way
to quickly determine if any two strings from my data set are equivalent
or not.

The way I do this currently is to designate the smallest
(alphabetically) string in each known equivalence set as the main
entry. For each pair, I therefore insert two entries into the hash
table, both pointing at the mail value. So assuming the input data:

A,B
B,C
D,E

I would then have:

A-A
B-A
C-B
D-D
E-D

Except that I also follow each chain until I reach the end (key==value),
and insert pointers to the main value for every value I find along the
way. After doing that, I end up with:

A-A
B-A
C-A
D-D
E-D

And I can very quickly check equivalence, either by comparing the hash
of two strings, or simply by transforming each string into its hash, and
then I can use simple comparison from then on. The code for generating
the final hash table is as follows:

h : Empty hash table created with hash.new()
d : Input data
hash.deep.get : Function that iterates through the hash table until it
finds a key whose value is equal to itself (until hash.get(X)==X), then
returns all the values in a vector


h = hash.new()
for ( i in 1:nrow(d) )
{
  deep.a  = hash.deep.get(h, d$a[i])
  deep.b  = hash.deep.get(h, d$b[i])
  equivalents = sort(unique(c(deep.a,deep.b)))
  equiv.id= min(equivalents)
  for ( equivalent in equivalents )
  {
  hash.put(h, equivalent, equiv.id)
  }
}


I would so much appreciate if there was a simpler and faster way to do
this. Keeping my fingers crossed that one of the R-help geniuses who
sees this is sufficiently interested to crack the problem

Best,
Magnus

On 11/1/2013 1:49 PM, jim holtman wrote:

It would be nice if you followed the posting guidelines and at least
showed the script that was creating your entries now so that we
understand the problem you are trying to solve.  A bit more
explanation of why you want this would be useful.  This gets to the
second part of my tag line:  Tell me what you want to do, not how you
want to do it.  There may be other solutions to your problem.

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason
zulutime@gmail.com wrote:

Pretty much what the subject says:

I used an env as the basis for a Hashtable in R, based on information that
this is in fact the way environments are implemented under the hood.

I've been experimenting with doubling the number of entries, and so far it
has seemed to be scaling more or less linearly, as expected.

But as I went from 17 million entries to 34 million entries, the completion
time has gone from 18 hours, to 5 days and counting.


The keys and values are in all cases strings of equal length.

One might suspect that the slow-down might have to do with the memory being
swapped to disk, but from what I know about my computing environment, that
should not be the case.

So my first question:
Is anyone familiar with anything in the implementation of environments that
would limit their use or slow them down (faster than O(nlog(n)) as the
number of entries is increased?

And my second question:
I realize that this is not strictly what R environments were designed for,
but this is what my algorithm requires: I must go through these millions of
entries, storing them in the hash table and sometimes retrieving them along
the way, in a more or less random manner, which is contingent on the data I
am encountering, and on the contents of the hash table at each moment.

Does anyone have a good recommendation for alternatives to implement huge,
fast, table-like structures

Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days

2013-11-04 Thread Thomas Lumley
On Sat, Nov 2, 2013 at 11:12 AM, Martin Morgan mtmor...@fhcrc.org wrote:

 On 11/01/2013 08:22 AM, Magnus Thor Torfason wrote:

 Sure,

 I was attempting to be concise and boiling it down to what I saw as the
 root
 issue, but you are right, I could have taken it a step further. So here
 goes.

 I have a set of around around 20M string pairs. A given string (say, A)
 can
 either be equivalent to another string (B) or not. If A and B occur
 together in
 the same pair, they are equivalent. But equivalence is transitive, so if
 A and B
 occur together in one pair, and A and C occur together in another pair,
 then A
 and C are also equivalent. I need a way to quickly determine if any two
 strings
 from my data set are equivalent or not.


 Do you mean that if A,B occur together and B,C occur together, then A,B
 and A,C are equivalent?

 Here's a function that returns a unique identifier (not well tested!),
 allowing for transitive relations but not circularity.

  uid - function(x, y)
 {
 i - seq_along(x)   # global index
 xy - paste0(x, y)  # make unique identifiers
 idx - match(xy, xy)

 repeat {
 ## transitive look-up
 y_idx - match(y[idx], x)   # look up 'y' in 'x'
 keep - !is.na(y_idx)
 if (!any(keep)) # no transitive relations,
 done!
 break
 x[idx[keep]] - x[y_idx[keep]]
 y[idx[keep]] - y[y_idx[keep]]

 ## create new index of values
 xy - paste0(x, y)
 idx - match(xy, xy)
 }
 idx
 }

 Values with the same index are identical. Some tests

  x - c(1, 2, 3, 4)
  y - c(2, 3, 5, 6)
  uid(x, y)
 [1] 1 1 1 4
  i - sample(x); uid(x[i], y[i])
 [1] 1 1 3 1
  uid(as.character(x), as.character(y))  ## character() ok
 [1] 1 1 1 4
  uid(1:10, 1 + 1:10)
  [1] 1 1 1 1 1 1 1 1 1 1
  uid(integer(), integer())
 integer(0)
  x - c(1, 2, 3)
  y - c(2, 3, 1)
  uid(x, y)  ## circular!
   C-c C-c

 I think this will scale well enough, but the worst-case scenario can be
 made to be log(longest chain) and copying can be reduced by using an index
 i and subsetting the original vector on each iteration. I think you could
 test for circularity by checking that the updated x are not a permutation
 of the kept x, all(x[y_idx[keep]] %in% x[keep]))

 Martin



This problem (union-find) is discussed in Chapter 1 of Sedgwick's
Algorithms.  There's an algorithm given that takes linear time to build
the structure, worst-case logarithmic time to query, and effectively
constant average time to query (inverse-Ackerman amortized complexity).

   -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days

2013-11-01 Thread Magnus Thor Torfason

Pretty much what the subject says:

I used an env as the basis for a Hashtable in R, based on information 
that this is in fact the way environments are implemented under the hood.


I've been experimenting with doubling the number of entries, and so far 
it has seemed to be scaling more or less linearly, as expected.


But as I went from 17 million entries to 34 million entries, the 
completion time has gone from 18 hours, to 5 days and counting.



The keys and values are in all cases strings of equal length.

One might suspect that the slow-down might have to do with the memory 
being swapped to disk, but from what I know about my computing 
environment, that should not be the case.


So my first question:
Is anyone familiar with anything in the implementation of environments 
that would limit their use or slow them down (faster than O(nlog(n)) as 
the number of entries is increased?


And my second question:
I realize that this is not strictly what R environments were designed 
for, but this is what my algorithm requires: I must go through these 
millions of entries, storing them in the hash table and sometimes 
retrieving them along the way, in a more or less random manner, which is 
contingent on the data I am encountering, and on the contents of the 
hash table at each moment.


Does anyone have a good recommendation for alternatives to implement 
huge, fast, table-like structures in R?


Best,
Magnus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days

2013-11-01 Thread jim holtman
It would be nice if you followed the posting guidelines and at least
showed the script that was creating your entries now so that we
understand the problem you are trying to solve.  A bit more
explanation of why you want this would be useful.  This gets to the
second part of my tag line:  Tell me what you want to do, not how you
want to do it.  There may be other solutions to your problem.

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason
zulutime@gmail.com wrote:
 Pretty much what the subject says:

 I used an env as the basis for a Hashtable in R, based on information that
 this is in fact the way environments are implemented under the hood.

 I've been experimenting with doubling the number of entries, and so far it
 has seemed to be scaling more or less linearly, as expected.

 But as I went from 17 million entries to 34 million entries, the completion
 time has gone from 18 hours, to 5 days and counting.


 The keys and values are in all cases strings of equal length.

 One might suspect that the slow-down might have to do with the memory being
 swapped to disk, but from what I know about my computing environment, that
 should not be the case.

 So my first question:
 Is anyone familiar with anything in the implementation of environments that
 would limit their use or slow them down (faster than O(nlog(n)) as the
 number of entries is increased?

 And my second question:
 I realize that this is not strictly what R environments were designed for,
 but this is what my algorithm requires: I must go through these millions of
 entries, storing them in the hash table and sometimes retrieving them along
 the way, in a more or less random manner, which is contingent on the data I
 am encountering, and on the contents of the hash table at each moment.

 Does anyone have a good recommendation for alternatives to implement huge,
 fast, table-like structures in R?

 Best,
 Magnus

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days

2013-11-01 Thread Magnus Thor Torfason

Sure,

I was attempting to be concise and boiling it down to what I saw as the 
root issue, but you are right, I could have taken it a step further. So 
here goes.


I have a set of around around 20M string pairs. A given string (say, A) 
can either be equivalent to another string (B) or not. If A and B occur 
together in the same pair, they are equivalent. But equivalence is 
transitive, so if A and B occur together in one pair, and A and C occur 
together in another pair, then A and C are also equivalent. I need a way 
to quickly determine if any two strings from my data set are equivalent 
or not.


The way I do this currently is to designate the smallest 
(alphabetically) string in each known equivalence set as the main 
entry. For each pair, I therefore insert two entries into the hash 
table, both pointing at the mail value. So assuming the input data:


A,B
B,C
D,E

I would then have:

A-A
B-A
C-B
D-D
E-D

Except that I also follow each chain until I reach the end (key==value), 
and insert pointers to the main value for every value I find along the 
way. After doing that, I end up with:


A-A
B-A
C-A
D-D
E-D

And I can very quickly check equivalence, either by comparing the hash 
of two strings, or simply by transforming each string into its hash, and 
then I can use simple comparison from then on. The code for generating 
the final hash table is as follows:


h : Empty hash table created with hash.new()
d : Input data
hash.deep.get : Function that iterates through the hash table until it 
finds a key whose value is equal to itself (until hash.get(X)==X), then 
returns all the values in a vector



h = hash.new()
for ( i in 1:nrow(d) )
{
deep.a  = hash.deep.get(h, d$a[i])
deep.b  = hash.deep.get(h, d$b[i])
equivalents = sort(unique(c(deep.a,deep.b)))
equiv.id= min(equivalents)
for ( equivalent in equivalents )
{
hash.put(h, equivalent, equiv.id)
}
}


I would so much appreciate if there was a simpler and faster way to do 
this. Keeping my fingers crossed that one of the R-help geniuses who 
sees this is sufficiently interested to crack the problem


Best,
Magnus

On 11/1/2013 1:49 PM, jim holtman wrote:

It would be nice if you followed the posting guidelines and at least
showed the script that was creating your entries now so that we
understand the problem you are trying to solve.  A bit more
explanation of why you want this would be useful.  This gets to the
second part of my tag line:  Tell me what you want to do, not how you
want to do it.  There may be other solutions to your problem.

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason
zulutime@gmail.com wrote:

Pretty much what the subject says:

I used an env as the basis for a Hashtable in R, based on information that
this is in fact the way environments are implemented under the hood.

I've been experimenting with doubling the number of entries, and so far it
has seemed to be scaling more or less linearly, as expected.

But as I went from 17 million entries to 34 million entries, the completion
time has gone from 18 hours, to 5 days and counting.


The keys and values are in all cases strings of equal length.

One might suspect that the slow-down might have to do with the memory being
swapped to disk, but from what I know about my computing environment, that
should not be the case.

So my first question:
Is anyone familiar with anything in the implementation of environments that
would limit their use or slow them down (faster than O(nlog(n)) as the
number of entries is increased?

And my second question:
I realize that this is not strictly what R environments were designed for,
but this is what my algorithm requires: I must go through these millions of
entries, storing them in the hash table and sometimes retrieving them along
the way, in a more or less random manner, which is contingent on the data I
am encountering, and on the contents of the hash table at each moment.

Does anyone have a good recommendation for alternatives to implement huge,
fast, table-like structures in R?

Best,
Magnus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days

2013-11-01 Thread William Dunlap
Have you looked into the 'igraph' package?

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
 Behalf
 Of Magnus Thor Torfason
 Sent: Friday, November 01, 2013 8:23 AM
 To: r-help@r-project.org
 Subject: Re: [R] Inserting 17M entries into env took 18h, inserting 34M 
 entries taking 5+
 days
 
 Sure,
 
 I was attempting to be concise and boiling it down to what I saw as the
 root issue, but you are right, I could have taken it a step further. So
 here goes.
 
 I have a set of around around 20M string pairs. A given string (say, A)
 can either be equivalent to another string (B) or not. If A and B occur
 together in the same pair, they are equivalent. But equivalence is
 transitive, so if A and B occur together in one pair, and A and C occur
 together in another pair, then A and C are also equivalent. I need a way
 to quickly determine if any two strings from my data set are equivalent
 or not.
 
 The way I do this currently is to designate the smallest
 (alphabetically) string in each known equivalence set as the main
 entry. For each pair, I therefore insert two entries into the hash
 table, both pointing at the mail value. So assuming the input data:
 
 A,B
 B,C
 D,E
 
 I would then have:
 
 A-A
 B-A
 C-B
 D-D
 E-D
 
 Except that I also follow each chain until I reach the end (key==value),
 and insert pointers to the main value for every value I find along the
 way. After doing that, I end up with:
 
 A-A
 B-A
 C-A
 D-D
 E-D
 
 And I can very quickly check equivalence, either by comparing the hash
 of two strings, or simply by transforming each string into its hash, and
 then I can use simple comparison from then on. The code for generating
 the final hash table is as follows:
 
 h : Empty hash table created with hash.new()
 d : Input data
 hash.deep.get : Function that iterates through the hash table until it
 finds a key whose value is equal to itself (until hash.get(X)==X), then
 returns all the values in a vector
 
 
 h = hash.new()
 for ( i in 1:nrow(d) )
 {
  deep.a  = hash.deep.get(h, d$a[i])
  deep.b  = hash.deep.get(h, d$b[i])
  equivalents = sort(unique(c(deep.a,deep.b)))
  equiv.id= min(equivalents)
  for ( equivalent in equivalents )
  {
  hash.put(h, equivalent, equiv.id)
  }
 }
 
 
 I would so much appreciate if there was a simpler and faster way to do
 this. Keeping my fingers crossed that one of the R-help geniuses who
 sees this is sufficiently interested to crack the problem
 
 Best,
 Magnus
 
 On 11/1/2013 1:49 PM, jim holtman wrote:
  It would be nice if you followed the posting guidelines and at least
  showed the script that was creating your entries now so that we
  understand the problem you are trying to solve.  A bit more
  explanation of why you want this would be useful.  This gets to the
  second part of my tag line:  Tell me what you want to do, not how you
  want to do it.  There may be other solutions to your problem.
 
  Jim Holtman
  Data Munger Guru
 
  What is the problem that you are trying to solve?
  Tell me what you want to do, not how you want to do it.
 
 
  On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason
  zulutime@gmail.com wrote:
  Pretty much what the subject says:
 
  I used an env as the basis for a Hashtable in R, based on information that
  this is in fact the way environments are implemented under the hood.
 
  I've been experimenting with doubling the number of entries, and so far it
  has seemed to be scaling more or less linearly, as expected.
 
  But as I went from 17 million entries to 34 million entries, the completion
  time has gone from 18 hours, to 5 days and counting.
 
 
  The keys and values are in all cases strings of equal length.
 
  One might suspect that the slow-down might have to do with the memory being
  swapped to disk, but from what I know about my computing environment, that
  should not be the case.
 
  So my first question:
  Is anyone familiar with anything in the implementation of environments that
  would limit their use or slow them down (faster than O(nlog(n)) as the
  number of entries is increased?
 
  And my second question:
  I realize that this is not strictly what R environments were designed for,
  but this is what my algorithm requires: I must go through these millions of
  entries, storing them in the hash table and sometimes retrieving them along
  the way, in a more or less random manner, which is contingent on the data I
  am encountering, and on the contents of the hash table at each moment.
 
  Does anyone have a good recommendation for alternatives to implement huge,
  fast, table-like structures in R?
 
  Best,
  Magnus
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
  http://www.R-project.org/posting-guide.html

Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days

2013-11-01 Thread Martin Morgan

On 11/01/2013 08:22 AM, Magnus Thor Torfason wrote:

Sure,

I was attempting to be concise and boiling it down to what I saw as the root
issue, but you are right, I could have taken it a step further. So here goes.

I have a set of around around 20M string pairs. A given string (say, A) can
either be equivalent to another string (B) or not. If A and B occur together in
the same pair, they are equivalent. But equivalence is transitive, so if A and B
occur together in one pair, and A and C occur together in another pair, then A
and C are also equivalent. I need a way to quickly determine if any two strings
from my data set are equivalent or not.


Do you mean that if A,B occur together and B,C occur together, then A,B and A,C 
are equivalent?


Here's a function that returns a unique identifier (not well tested!), allowing 
for transitive relations but not circularity.


 uid - function(x, y)
{
i - seq_along(x)   # global index
xy - paste0(x, y)  # make unique identifiers
idx - match(xy, xy)

repeat {
## transitive look-up
y_idx - match(y[idx], x)   # look up 'y' in 'x'
keep - !is.na(y_idx)
if (!any(keep)) # no transitive relations, done!
break
x[idx[keep]] - x[y_idx[keep]]
y[idx[keep]] - y[y_idx[keep]]

## create new index of values
xy - paste0(x, y)
idx - match(xy, xy)
}
idx
}

Values with the same index are identical. Some tests

 x - c(1, 2, 3, 4)
 y - c(2, 3, 5, 6)
 uid(x, y)
[1] 1 1 1 4
 i - sample(x); uid(x[i], y[i])
[1] 1 1 3 1
 uid(as.character(x), as.character(y))  ## character() ok
[1] 1 1 1 4
 uid(1:10, 1 + 1:10)
 [1] 1 1 1 1 1 1 1 1 1 1
 uid(integer(), integer())
integer(0)
 x - c(1, 2, 3)
 y - c(2, 3, 1)
 uid(x, y)  ## circular!
  C-c C-c

I think this will scale well enough, but the worst-case scenario can be made to 
be log(longest chain) and copying can be reduced by using an index i and 
subsetting the original vector on each iteration. I think you could test for 
circularity by checking that the updated x are not a permutation of the kept x, 
all(x[y_idx[keep]] %in% x[keep]))


Martin



The way I do this currently is to designate the smallest (alphabetically) string
in each known equivalence set as the main entry. For each pair, I therefore
insert two entries into the hash table, both pointing at the mail value. So
assuming the input data:

A,B
B,C
D,E

I would then have:

A-A
B-A
C-B
D-D
E-D

Except that I also follow each chain until I reach the end (key==value), and
insert pointers to the main value for every value I find along the way. After
doing that, I end up with:

A-A
B-A
C-A
D-D
E-D

And I can very quickly check equivalence, either by comparing the hash of two
strings, or simply by transforming each string into its hash, and then I can use
simple comparison from then on. The code for generating the final hash table is
as follows:

h : Empty hash table created with hash.new()
d : Input data
hash.deep.get : Function that iterates through the hash table until it finds a
key whose value is equal to itself (until hash.get(X)==X), then returns all the
values in a vector


h = hash.new()
for ( i in 1:nrow(d) )
{
 deep.a  = hash.deep.get(h, d$a[i])
 deep.b  = hash.deep.get(h, d$b[i])
 equivalents = sort(unique(c(deep.a,deep.b)))
 equiv.id= min(equivalents)
 for ( equivalent in equivalents )
 {
 hash.put(h, equivalent, equiv.id)
 }
}


I would so much appreciate if there was a simpler and faster way to do this.
Keeping my fingers crossed that one of the R-help geniuses who sees this is
sufficiently interested to crack the problem

Best,
Magnus

On 11/1/2013 1:49 PM, jim holtman wrote:

It would be nice if you followed the posting guidelines and at least
showed the script that was creating your entries now so that we
understand the problem you are trying to solve.  A bit more
explanation of why you want this would be useful.  This gets to the
second part of my tag line:  Tell me what you want to do, not how you
want to do it.  There may be other solutions to your problem.

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason
zulutime@gmail.com wrote:

Pretty much what the subject says:

I used an env as the basis for a Hashtable in R, based on information that
this is in fact the way environments are implemented under the hood.

I've been experimenting with doubling the number of entries, and so far it
has seemed to be scaling more or less linearly, as expected.

But as I went from 17 million entries to 34 million entries, the completion
time has gone from 18 hours, to 5 days