Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days
Thanks to Thomas, Martin, Jim and William, Your input was very informative, and thanks for the reference to Sedgwick. In the end, it does seem to me that all these algorithms require fast lookup by ID of nodes to access data, and that conditional on such fast lookup, algorithms are possible with efficiency O(n) or O(n*log(n)) (depending on whether lookup time is constant or logarithmic). I believe my original algorithm achieves that. We come back to the fact that I assumed that R environments, implemented as hash tables, would give me that fast lookup. But on my systems, their efficiency (for insert and lookup) seems to degrade fast at several million entries. Certainly much faster than either O(1) or O(log(n)). I believe this does not have to do with disk access time. For example, I tested this on my desktop computer, running a pure hash insert loop. I observe 100% processor use but no disk access, as the size of the hash table approaches millions of entries. I have tested this on two systems, but have not gone into the implementation of the hashed environments to look at this in details. If others have the same (or different) experiences with using hashed environments with millions of entries, it would be very useful to know. Barring a solution to the hashed environment speed, it seems the way to speed this algorithm up (within the confines of R) would be to move away from hash tables and towards a numerically indexed array. Thanks again for all of the help, Magnus On 11/4/2013 8:20 PM, Thomas Lumley wrote: On Sat, Nov 2, 2013 at 11:12 AM, Martin Morgan mtmor...@fhcrc.org mailto:mtmor...@fhcrc.org wrote: On 11/01/2013 08:22 AM, Magnus Thor Torfason wrote: Sure, I was attempting to be concise and boiling it down to what I saw as the root issue, but you are right, I could have taken it a step further. So here goes. I have a set of around around 20M string pairs. A given string (say, A) can either be equivalent to another string (B) or not. If A and B occur together in the same pair, they are equivalent. But equivalence is transitive, so if A and B occur together in one pair, and A and C occur together in another pair, then A and C are also equivalent. I need a way to quickly determine if any two strings from my data set are equivalent or not. Do you mean that if A,B occur together and B,C occur together, then A,B and A,C are equivalent? Here's a function that returns a unique identifier (not well tested!), allowing for transitive relations but not circularity. uid - function(x, y) { i - seq_along(x) # global index xy - paste0(x, y) # make unique identifiers idx - match(xy, xy) repeat { ## transitive look-up y_idx - match(y[idx], x) # look up 'y' in 'x' keep - !is.na http://is.na(y_idx) if (!any(keep)) # no transitive relations, done! break x[idx[keep]] - x[y_idx[keep]] y[idx[keep]] - y[y_idx[keep]] ## create new index of values xy - paste0(x, y) idx - match(xy, xy) } idx } Values with the same index are identical. Some tests x - c(1, 2, 3, 4) y - c(2, 3, 5, 6) uid(x, y) [1] 1 1 1 4 i - sample(x); uid(x[i], y[i]) [1] 1 1 3 1 uid(as.character(x), as.character(y)) ## character() ok [1] 1 1 1 4 uid(1:10, 1 + 1:10) [1] 1 1 1 1 1 1 1 1 1 1 uid(integer(), integer()) integer(0) x - c(1, 2, 3) y - c(2, 3, 1) uid(x, y) ## circular! C-c C-c I think this will scale well enough, but the worst-case scenario can be made to be log(longest chain) and copying can be reduced by using an index i and subsetting the original vector on each iteration. I think you could test for circularity by checking that the updated x are not a permutation of the kept x, all(x[y_idx[keep]] %in% x[keep])) Martin This problem (union-find) is discussed in Chapter 1 of Sedgwick's Algorithms. There's an algorithm given that takes linear time to build the structure, worst-case logarithmic time to query, and effectively constant average time to query (inverse-Ackerman amortized complexity). -thomas -- Thomas Lumley Professor of Biostatistics University of Auckland __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal,
Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days
There are around 16M unique values. After accounting for equivalence, the number is much smaller (I don't know how much smaller, since my program has not completed yet :-) Yes, I meant that B and C are also equivalent. The original version was a typo. Best, Magnus On 11/1/2013 3:45 PM, jim holtman wrote: in the 20M pairs, how many unique values are there? In your statement above But equivalence is transitive, so if A and B occur together in one pair, and A and C occur together in another pair, then A and C are also equivalent., did you mean that B and C are also equivalent? Jim Holtman Data Munger Guru __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days
On 11/1/2013 10:12 PM, Martin Morgan wrote: Do you mean that if A,B occur together and B,C occur together, then A,B and A,C are equivalent? Yes, that's what I meant, sorry, typo. I like your uid() function. It avoids the 20M times loop, and the issue of circular references can be solved by ensuring that y is always (weakly) smaller than x. I can see how the log(longest chain) would be the limiting case, since each round should always halve the length of the longest chain. I have yet to test whether that intuition (about running time) works in practice. I did test the program on a different data set, and it does seem that it fails to detect equivalence on the following input: a b 1 B A 2 C B 3 O G 4 O M 5 Y C 6 Y M Everything should be equivalent here, but uid(d$a, d$b) [1] 1 1 3 4 1 6 I'm looking to see what can be about this. The problem seems to be with entries of the form: O-G O-M Which imply that all of O/M/G are equivalent, but they are not detected as such. Will consider whether there is a good way around this. Best, Magnus Here's a function that returns a unique identifier (not well tested!), allowing for transitive relations but not circularity. uid - function(x, y) { i - seq_along(x) # global index xy - paste0(x, y) # make unique identifiers idx - match(xy, xy) repeat { ## transitive look-up y_idx - match(y[idx], x) # look up 'y' in 'x' keep - !is.na(y_idx) if (!any(keep)) # no transitive relations, done! break x[idx[keep]] - x[y_idx[keep]] y[idx[keep]] - y[y_idx[keep]] ## create new index of values xy - paste0(x, y) idx - match(xy, xy) } idx } Values with the same index are identical. Some tests x - c(1, 2, 3, 4) y - c(2, 3, 5, 6) uid(x, y) [1] 1 1 1 4 i - sample(x); uid(x[i], y[i]) [1] 1 1 3 1 uid(as.character(x), as.character(y)) ## character() ok [1] 1 1 1 4 uid(1:10, 1 + 1:10) [1] 1 1 1 1 1 1 1 1 1 1 uid(integer(), integer()) integer(0) x - c(1, 2, 3) y - c(2, 3, 1) uid(x, y) ## circular! C-c C-c I think this will scale well enough, but the worst-case scenario can be made to be log(longest chain) and copying can be reduced by using an index i and subsetting the original vector on each iteration. I think you could test for circularity by checking that the updated x are not a permutation of the kept x, all(x[y_idx[keep]] %in% x[keep])) Martin The way I do this currently is to designate the smallest (alphabetically) string in each known equivalence set as the main entry. For each pair, I therefore insert two entries into the hash table, both pointing at the mail value. So assuming the input data: A,B B,C D,E I would then have: A-A B-A C-B D-D E-D Except that I also follow each chain until I reach the end (key==value), and insert pointers to the main value for every value I find along the way. After doing that, I end up with: A-A B-A C-A D-D E-D And I can very quickly check equivalence, either by comparing the hash of two strings, or simply by transforming each string into its hash, and then I can use simple comparison from then on. The code for generating the final hash table is as follows: h : Empty hash table created with hash.new() d : Input data hash.deep.get : Function that iterates through the hash table until it finds a key whose value is equal to itself (until hash.get(X)==X), then returns all the values in a vector h = hash.new() for ( i in 1:nrow(d) ) { deep.a = hash.deep.get(h, d$a[i]) deep.b = hash.deep.get(h, d$b[i]) equivalents = sort(unique(c(deep.a,deep.b))) equiv.id= min(equivalents) for ( equivalent in equivalents ) { hash.put(h, equivalent, equiv.id) } } I would so much appreciate if there was a simpler and faster way to do this. Keeping my fingers crossed that one of the R-help geniuses who sees this is sufficiently interested to crack the problem Best, Magnus On 11/1/2013 1:49 PM, jim holtman wrote: It would be nice if you followed the posting guidelines and at least showed the script that was creating your entries now so that we understand the problem you are trying to solve. A bit more explanation of why you want this would be useful. This gets to the second part of my tag line: Tell me what you want to do, not how you want to do it. There may be other solutions to your problem. Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason zulutime@gmail.com wrote: Pretty much what the subject says: I used an env as the basis for a Hashtable in
Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days
Yes, I'm pretty familiar with igraph, but had not thought of using it for this. Interesting idea. So I presume I'd do something like (pseudocode): :Create igraph object from my data set as an edgelist :Create a list of connected subgraphs using clusters() :Loop through that list of clusters to create final data set That might work ... Thanks! Magnus On 11/1/2013 4:52 PM, William Dunlap wrote: Have you looked into the 'igraph' package? Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Magnus Thor Torfason Sent: Friday, November 01, 2013 8:23 AM To: r-help@r-project.org Subject: Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days Sure, I was attempting to be concise and boiling it down to what I saw as the root issue, but you are right, I could have taken it a step further. So here goes. I have a set of around around 20M string pairs. A given string (say, A) can either be equivalent to another string (B) or not. If A and B occur together in the same pair, they are equivalent. But equivalence is transitive, so if A and B occur together in one pair, and A and C occur together in another pair, then A and C are also equivalent. I need a way to quickly determine if any two strings from my data set are equivalent or not. The way I do this currently is to designate the smallest (alphabetically) string in each known equivalence set as the main entry. For each pair, I therefore insert two entries into the hash table, both pointing at the mail value. So assuming the input data: A,B B,C D,E I would then have: A-A B-A C-B D-D E-D Except that I also follow each chain until I reach the end (key==value), and insert pointers to the main value for every value I find along the way. After doing that, I end up with: A-A B-A C-A D-D E-D And I can very quickly check equivalence, either by comparing the hash of two strings, or simply by transforming each string into its hash, and then I can use simple comparison from then on. The code for generating the final hash table is as follows: h : Empty hash table created with hash.new() d : Input data hash.deep.get : Function that iterates through the hash table until it finds a key whose value is equal to itself (until hash.get(X)==X), then returns all the values in a vector h = hash.new() for ( i in 1:nrow(d) ) { deep.a = hash.deep.get(h, d$a[i]) deep.b = hash.deep.get(h, d$b[i]) equivalents = sort(unique(c(deep.a,deep.b))) equiv.id= min(equivalents) for ( equivalent in equivalents ) { hash.put(h, equivalent, equiv.id) } } I would so much appreciate if there was a simpler and faster way to do this. Keeping my fingers crossed that one of the R-help geniuses who sees this is sufficiently interested to crack the problem Best, Magnus On 11/1/2013 1:49 PM, jim holtman wrote: It would be nice if you followed the posting guidelines and at least showed the script that was creating your entries now so that we understand the problem you are trying to solve. A bit more explanation of why you want this would be useful. This gets to the second part of my tag line: Tell me what you want to do, not how you want to do it. There may be other solutions to your problem. Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason zulutime@gmail.com wrote: Pretty much what the subject says: I used an env as the basis for a Hashtable in R, based on information that this is in fact the way environments are implemented under the hood. I've been experimenting with doubling the number of entries, and so far it has seemed to be scaling more or less linearly, as expected. But as I went from 17 million entries to 34 million entries, the completion time has gone from 18 hours, to 5 days and counting. The keys and values are in all cases strings of equal length. One might suspect that the slow-down might have to do with the memory being swapped to disk, but from what I know about my computing environment, that should not be the case. So my first question: Is anyone familiar with anything in the implementation of environments that would limit their use or slow them down (faster than O(nlog(n)) as the number of entries is increased? And my second question: I realize that this is not strictly what R environments were designed for, but this is what my algorithm requires: I must go through these millions of entries, storing them in the hash table and sometimes retrieving them along the way, in a more or less random manner, which is contingent on the data I am encountering, and on the contents of the hash table at each moment. Does anyone have a good recommendation for alternatives to implement huge, fast, table-like structures
Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days
On Sat, Nov 2, 2013 at 11:12 AM, Martin Morgan mtmor...@fhcrc.org wrote: On 11/01/2013 08:22 AM, Magnus Thor Torfason wrote: Sure, I was attempting to be concise and boiling it down to what I saw as the root issue, but you are right, I could have taken it a step further. So here goes. I have a set of around around 20M string pairs. A given string (say, A) can either be equivalent to another string (B) or not. If A and B occur together in the same pair, they are equivalent. But equivalence is transitive, so if A and B occur together in one pair, and A and C occur together in another pair, then A and C are also equivalent. I need a way to quickly determine if any two strings from my data set are equivalent or not. Do you mean that if A,B occur together and B,C occur together, then A,B and A,C are equivalent? Here's a function that returns a unique identifier (not well tested!), allowing for transitive relations but not circularity. uid - function(x, y) { i - seq_along(x) # global index xy - paste0(x, y) # make unique identifiers idx - match(xy, xy) repeat { ## transitive look-up y_idx - match(y[idx], x) # look up 'y' in 'x' keep - !is.na(y_idx) if (!any(keep)) # no transitive relations, done! break x[idx[keep]] - x[y_idx[keep]] y[idx[keep]] - y[y_idx[keep]] ## create new index of values xy - paste0(x, y) idx - match(xy, xy) } idx } Values with the same index are identical. Some tests x - c(1, 2, 3, 4) y - c(2, 3, 5, 6) uid(x, y) [1] 1 1 1 4 i - sample(x); uid(x[i], y[i]) [1] 1 1 3 1 uid(as.character(x), as.character(y)) ## character() ok [1] 1 1 1 4 uid(1:10, 1 + 1:10) [1] 1 1 1 1 1 1 1 1 1 1 uid(integer(), integer()) integer(0) x - c(1, 2, 3) y - c(2, 3, 1) uid(x, y) ## circular! C-c C-c I think this will scale well enough, but the worst-case scenario can be made to be log(longest chain) and copying can be reduced by using an index i and subsetting the original vector on each iteration. I think you could test for circularity by checking that the updated x are not a permutation of the kept x, all(x[y_idx[keep]] %in% x[keep])) Martin This problem (union-find) is discussed in Chapter 1 of Sedgwick's Algorithms. There's an algorithm given that takes linear time to build the structure, worst-case logarithmic time to query, and effectively constant average time to query (inverse-Ackerman amortized complexity). -thomas -- Thomas Lumley Professor of Biostatistics University of Auckland [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days
Pretty much what the subject says: I used an env as the basis for a Hashtable in R, based on information that this is in fact the way environments are implemented under the hood. I've been experimenting with doubling the number of entries, and so far it has seemed to be scaling more or less linearly, as expected. But as I went from 17 million entries to 34 million entries, the completion time has gone from 18 hours, to 5 days and counting. The keys and values are in all cases strings of equal length. One might suspect that the slow-down might have to do with the memory being swapped to disk, but from what I know about my computing environment, that should not be the case. So my first question: Is anyone familiar with anything in the implementation of environments that would limit their use or slow them down (faster than O(nlog(n)) as the number of entries is increased? And my second question: I realize that this is not strictly what R environments were designed for, but this is what my algorithm requires: I must go through these millions of entries, storing them in the hash table and sometimes retrieving them along the way, in a more or less random manner, which is contingent on the data I am encountering, and on the contents of the hash table at each moment. Does anyone have a good recommendation for alternatives to implement huge, fast, table-like structures in R? Best, Magnus __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days
It would be nice if you followed the posting guidelines and at least showed the script that was creating your entries now so that we understand the problem you are trying to solve. A bit more explanation of why you want this would be useful. This gets to the second part of my tag line: Tell me what you want to do, not how you want to do it. There may be other solutions to your problem. Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason zulutime@gmail.com wrote: Pretty much what the subject says: I used an env as the basis for a Hashtable in R, based on information that this is in fact the way environments are implemented under the hood. I've been experimenting with doubling the number of entries, and so far it has seemed to be scaling more or less linearly, as expected. But as I went from 17 million entries to 34 million entries, the completion time has gone from 18 hours, to 5 days and counting. The keys and values are in all cases strings of equal length. One might suspect that the slow-down might have to do with the memory being swapped to disk, but from what I know about my computing environment, that should not be the case. So my first question: Is anyone familiar with anything in the implementation of environments that would limit their use or slow them down (faster than O(nlog(n)) as the number of entries is increased? And my second question: I realize that this is not strictly what R environments were designed for, but this is what my algorithm requires: I must go through these millions of entries, storing them in the hash table and sometimes retrieving them along the way, in a more or less random manner, which is contingent on the data I am encountering, and on the contents of the hash table at each moment. Does anyone have a good recommendation for alternatives to implement huge, fast, table-like structures in R? Best, Magnus __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days
Sure, I was attempting to be concise and boiling it down to what I saw as the root issue, but you are right, I could have taken it a step further. So here goes. I have a set of around around 20M string pairs. A given string (say, A) can either be equivalent to another string (B) or not. If A and B occur together in the same pair, they are equivalent. But equivalence is transitive, so if A and B occur together in one pair, and A and C occur together in another pair, then A and C are also equivalent. I need a way to quickly determine if any two strings from my data set are equivalent or not. The way I do this currently is to designate the smallest (alphabetically) string in each known equivalence set as the main entry. For each pair, I therefore insert two entries into the hash table, both pointing at the mail value. So assuming the input data: A,B B,C D,E I would then have: A-A B-A C-B D-D E-D Except that I also follow each chain until I reach the end (key==value), and insert pointers to the main value for every value I find along the way. After doing that, I end up with: A-A B-A C-A D-D E-D And I can very quickly check equivalence, either by comparing the hash of two strings, or simply by transforming each string into its hash, and then I can use simple comparison from then on. The code for generating the final hash table is as follows: h : Empty hash table created with hash.new() d : Input data hash.deep.get : Function that iterates through the hash table until it finds a key whose value is equal to itself (until hash.get(X)==X), then returns all the values in a vector h = hash.new() for ( i in 1:nrow(d) ) { deep.a = hash.deep.get(h, d$a[i]) deep.b = hash.deep.get(h, d$b[i]) equivalents = sort(unique(c(deep.a,deep.b))) equiv.id= min(equivalents) for ( equivalent in equivalents ) { hash.put(h, equivalent, equiv.id) } } I would so much appreciate if there was a simpler and faster way to do this. Keeping my fingers crossed that one of the R-help geniuses who sees this is sufficiently interested to crack the problem Best, Magnus On 11/1/2013 1:49 PM, jim holtman wrote: It would be nice if you followed the posting guidelines and at least showed the script that was creating your entries now so that we understand the problem you are trying to solve. A bit more explanation of why you want this would be useful. This gets to the second part of my tag line: Tell me what you want to do, not how you want to do it. There may be other solutions to your problem. Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason zulutime@gmail.com wrote: Pretty much what the subject says: I used an env as the basis for a Hashtable in R, based on information that this is in fact the way environments are implemented under the hood. I've been experimenting with doubling the number of entries, and so far it has seemed to be scaling more or less linearly, as expected. But as I went from 17 million entries to 34 million entries, the completion time has gone from 18 hours, to 5 days and counting. The keys and values are in all cases strings of equal length. One might suspect that the slow-down might have to do with the memory being swapped to disk, but from what I know about my computing environment, that should not be the case. So my first question: Is anyone familiar with anything in the implementation of environments that would limit their use or slow them down (faster than O(nlog(n)) as the number of entries is increased? And my second question: I realize that this is not strictly what R environments were designed for, but this is what my algorithm requires: I must go through these millions of entries, storing them in the hash table and sometimes retrieving them along the way, in a more or less random manner, which is contingent on the data I am encountering, and on the contents of the hash table at each moment. Does anyone have a good recommendation for alternatives to implement huge, fast, table-like structures in R? Best, Magnus __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days
Have you looked into the 'igraph' package? Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Magnus Thor Torfason Sent: Friday, November 01, 2013 8:23 AM To: r-help@r-project.org Subject: Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days Sure, I was attempting to be concise and boiling it down to what I saw as the root issue, but you are right, I could have taken it a step further. So here goes. I have a set of around around 20M string pairs. A given string (say, A) can either be equivalent to another string (B) or not. If A and B occur together in the same pair, they are equivalent. But equivalence is transitive, so if A and B occur together in one pair, and A and C occur together in another pair, then A and C are also equivalent. I need a way to quickly determine if any two strings from my data set are equivalent or not. The way I do this currently is to designate the smallest (alphabetically) string in each known equivalence set as the main entry. For each pair, I therefore insert two entries into the hash table, both pointing at the mail value. So assuming the input data: A,B B,C D,E I would then have: A-A B-A C-B D-D E-D Except that I also follow each chain until I reach the end (key==value), and insert pointers to the main value for every value I find along the way. After doing that, I end up with: A-A B-A C-A D-D E-D And I can very quickly check equivalence, either by comparing the hash of two strings, or simply by transforming each string into its hash, and then I can use simple comparison from then on. The code for generating the final hash table is as follows: h : Empty hash table created with hash.new() d : Input data hash.deep.get : Function that iterates through the hash table until it finds a key whose value is equal to itself (until hash.get(X)==X), then returns all the values in a vector h = hash.new() for ( i in 1:nrow(d) ) { deep.a = hash.deep.get(h, d$a[i]) deep.b = hash.deep.get(h, d$b[i]) equivalents = sort(unique(c(deep.a,deep.b))) equiv.id= min(equivalents) for ( equivalent in equivalents ) { hash.put(h, equivalent, equiv.id) } } I would so much appreciate if there was a simpler and faster way to do this. Keeping my fingers crossed that one of the R-help geniuses who sees this is sufficiently interested to crack the problem Best, Magnus On 11/1/2013 1:49 PM, jim holtman wrote: It would be nice if you followed the posting guidelines and at least showed the script that was creating your entries now so that we understand the problem you are trying to solve. A bit more explanation of why you want this would be useful. This gets to the second part of my tag line: Tell me what you want to do, not how you want to do it. There may be other solutions to your problem. Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason zulutime@gmail.com wrote: Pretty much what the subject says: I used an env as the basis for a Hashtable in R, based on information that this is in fact the way environments are implemented under the hood. I've been experimenting with doubling the number of entries, and so far it has seemed to be scaling more or less linearly, as expected. But as I went from 17 million entries to 34 million entries, the completion time has gone from 18 hours, to 5 days and counting. The keys and values are in all cases strings of equal length. One might suspect that the slow-down might have to do with the memory being swapped to disk, but from what I know about my computing environment, that should not be the case. So my first question: Is anyone familiar with anything in the implementation of environments that would limit their use or slow them down (faster than O(nlog(n)) as the number of entries is increased? And my second question: I realize that this is not strictly what R environments were designed for, but this is what my algorithm requires: I must go through these millions of entries, storing them in the hash table and sometimes retrieving them along the way, in a more or less random manner, which is contingent on the data I am encountering, and on the contents of the hash table at each moment. Does anyone have a good recommendation for alternatives to implement huge, fast, table-like structures in R? Best, Magnus __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
Re: [R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days
On 11/01/2013 08:22 AM, Magnus Thor Torfason wrote: Sure, I was attempting to be concise and boiling it down to what I saw as the root issue, but you are right, I could have taken it a step further. So here goes. I have a set of around around 20M string pairs. A given string (say, A) can either be equivalent to another string (B) or not. If A and B occur together in the same pair, they are equivalent. But equivalence is transitive, so if A and B occur together in one pair, and A and C occur together in another pair, then A and C are also equivalent. I need a way to quickly determine if any two strings from my data set are equivalent or not. Do you mean that if A,B occur together and B,C occur together, then A,B and A,C are equivalent? Here's a function that returns a unique identifier (not well tested!), allowing for transitive relations but not circularity. uid - function(x, y) { i - seq_along(x) # global index xy - paste0(x, y) # make unique identifiers idx - match(xy, xy) repeat { ## transitive look-up y_idx - match(y[idx], x) # look up 'y' in 'x' keep - !is.na(y_idx) if (!any(keep)) # no transitive relations, done! break x[idx[keep]] - x[y_idx[keep]] y[idx[keep]] - y[y_idx[keep]] ## create new index of values xy - paste0(x, y) idx - match(xy, xy) } idx } Values with the same index are identical. Some tests x - c(1, 2, 3, 4) y - c(2, 3, 5, 6) uid(x, y) [1] 1 1 1 4 i - sample(x); uid(x[i], y[i]) [1] 1 1 3 1 uid(as.character(x), as.character(y)) ## character() ok [1] 1 1 1 4 uid(1:10, 1 + 1:10) [1] 1 1 1 1 1 1 1 1 1 1 uid(integer(), integer()) integer(0) x - c(1, 2, 3) y - c(2, 3, 1) uid(x, y) ## circular! C-c C-c I think this will scale well enough, but the worst-case scenario can be made to be log(longest chain) and copying can be reduced by using an index i and subsetting the original vector on each iteration. I think you could test for circularity by checking that the updated x are not a permutation of the kept x, all(x[y_idx[keep]] %in% x[keep])) Martin The way I do this currently is to designate the smallest (alphabetically) string in each known equivalence set as the main entry. For each pair, I therefore insert two entries into the hash table, both pointing at the mail value. So assuming the input data: A,B B,C D,E I would then have: A-A B-A C-B D-D E-D Except that I also follow each chain until I reach the end (key==value), and insert pointers to the main value for every value I find along the way. After doing that, I end up with: A-A B-A C-A D-D E-D And I can very quickly check equivalence, either by comparing the hash of two strings, or simply by transforming each string into its hash, and then I can use simple comparison from then on. The code for generating the final hash table is as follows: h : Empty hash table created with hash.new() d : Input data hash.deep.get : Function that iterates through the hash table until it finds a key whose value is equal to itself (until hash.get(X)==X), then returns all the values in a vector h = hash.new() for ( i in 1:nrow(d) ) { deep.a = hash.deep.get(h, d$a[i]) deep.b = hash.deep.get(h, d$b[i]) equivalents = sort(unique(c(deep.a,deep.b))) equiv.id= min(equivalents) for ( equivalent in equivalents ) { hash.put(h, equivalent, equiv.id) } } I would so much appreciate if there was a simpler and faster way to do this. Keeping my fingers crossed that one of the R-help geniuses who sees this is sufficiently interested to crack the problem Best, Magnus On 11/1/2013 1:49 PM, jim holtman wrote: It would be nice if you followed the posting guidelines and at least showed the script that was creating your entries now so that we understand the problem you are trying to solve. A bit more explanation of why you want this would be useful. This gets to the second part of my tag line: Tell me what you want to do, not how you want to do it. There may be other solutions to your problem. Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason zulutime@gmail.com wrote: Pretty much what the subject says: I used an env as the basis for a Hashtable in R, based on information that this is in fact the way environments are implemented under the hood. I've been experimenting with doubling the number of entries, and so far it has seemed to be scaling more or less linearly, as expected. But as I went from 17 million entries to 34 million entries, the completion time has gone from 18 hours, to 5 days