Re: [R] Fast multiple match function
Hi Jeff, Indeed the data.table package does provide a much cleaner way to achieve the same functionality, and a lot of other functionality as bonus. Thanks for letting me know about it. On Tue, 7 Apr 2015 at 15:41 Jeff Newmiller jdnew...@dcn.davis.ca.us wrote: You might find the data.table package helpful. It uses an index sorted with a radix sort and minimizes moving the data around in memory. --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. On April 7, 2015 1:50:39 PM PDT, Keshav Dhandhania kshav...@gmail.com wrote: Hi all, Thanks for the responses. Herve's example is a good small size example of what I wanted. y - c(16, -3, -2, 15, 15, 0, 8, 15, -2) someCoolFunc(-2, y) [1] 3 9 someCoolFunc(15, y) [1] 4 5 8 The requirement is that I want someCoolFunc() to run in O(number of matches) time, instead of O(size of y). This is because y is big. And I don't know all the queries I want to do up-front. And the results of some queries might change the queries I want to do in the future. @David: I hope the above description is more clear. @Enrico, Herve: I want both the functionality provided by one function. - On repeated calls, fmatch() does give O(1) performance, but it does not give all matches. - findMatches() gives all matches, but I need to know the entire vector x beforehand. I don't have that luxury. I do have something that works now, using split and fmatch (package fastmatch). So just posting that in case anyone in the future has the same problem. y.unique - unique(y) # create a map from the unique elements of y to the locations of all occurrences of the element y.map - split(1:length(y), match(y, y.unique)) # write a wrapper function that does a look-up on the unique list. and then returns all matches using the map. someCoolFunc - function(x) { y.map[[ fmatch(x, y.unique) ]] } On Tue, 7 Apr 2015 at 13:21 Hervé Pagès hpa...@fredhutch.org wrote: Hi Keshav, findMatches() in the S4Vectors/IRanges packages (Bioconductor) I think does what you want: library(IRanges) y - c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L) x - c(unique(y), 999L) hits - findMatches(x, y) Then: hits Hits object with 9 hits and 0 metadata columns: queryHits subjectHits integer integer [1] 1 1 [2] 2 2 [3] 3 3 [4] 3 9 [5] 4 4 [6] 4 5 [7] 4 8 [8] 5 6 [9] 6 7 --- queryLength: 7 subjectLength: 9 The Hits object can be turned into a list with: as.list(hits) [[1]] [1] 1 [[2]] [1] 2 [[3]] [1] 3 9 [[4]] [1] 4 5 8 [[5]] [1] 6 [[6]] [1] 7 [[7]] integer(0) H. sessionInfo() R version 3.2.0 beta (2015-04-05 r68151) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: Ubuntu 14.04.2 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4stats graphics grDevices utils datasets [8] methods base other attached packages: [1] IRanges_2.1.43 S4Vectors_0.5.22 BiocGenerics_0.13.11 loaded via a namespace (and not attached): [1] tools_3.2.0 On 04/06/2015 01:56 PM, Keshav Dhandhania wrote: Hi, I know that one can find all occurrences of x in a vector v by doing which(x == v). However, if I need to do this again and again, where v is remaining the same, then this is quite inefficient. In my particular case, I need to do this millions of times, and length(v) = 100 million. Does anyone have suggestion on how to go about it? I know of a package called fmatch that does the above for the match function. But they don't handle multiple matches. Thanks [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
Re: [R] Fast multiple match function
Hi Keshav, findMatches() in the S4Vectors/IRanges packages (Bioconductor) I think does what you want: library(IRanges) y - c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L) x - c(unique(y), 999L) hits - findMatches(x, y) Then: hits Hits object with 9 hits and 0 metadata columns: queryHits subjectHits integer integer [1] 1 1 [2] 2 2 [3] 3 3 [4] 3 9 [5] 4 4 [6] 4 5 [7] 4 8 [8] 5 6 [9] 6 7 --- queryLength: 7 subjectLength: 9 The Hits object can be turned into a list with: as.list(hits) [[1]] [1] 1 [[2]] [1] 2 [[3]] [1] 3 9 [[4]] [1] 4 5 8 [[5]] [1] 6 [[6]] [1] 7 [[7]] integer(0) H. sessionInfo() R version 3.2.0 beta (2015-04-05 r68151) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: Ubuntu 14.04.2 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4stats graphics grDevices utils datasets [8] methods base other attached packages: [1] IRanges_2.1.43 S4Vectors_0.5.22 BiocGenerics_0.13.11 loaded via a namespace (and not attached): [1] tools_3.2.0 On 04/06/2015 01:56 PM, Keshav Dhandhania wrote: Hi, I know that one can find all occurrences of x in a vector v by doing which(x == v). However, if I need to do this again and again, where v is remaining the same, then this is quite inefficient. In my particular case, I need to do this millions of times, and length(v) = 100 million. Does anyone have suggestion on how to go about it? I know of a package called fmatch that does the above for the match function. But they don't handle multiple matches. Thanks [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Fast multiple match function
On Mon, 06 Apr 2015, Keshav Dhandhania kshav...@gmail.com writes: Hi, I know that one can find all occurrences of x in a vector v by doing which(x == v). However, if I need to do this again and again, where v is remaining the same, then this is quite inefficient. In my particular case, I need to do this millions of times, and length(v) = 100 million. Does anyone have suggestion on how to go about it? I know of a package called fmatch that does the above for the match function. But they don't handle multiple matches. Perhaps 'match(x, v)' is what you want? In which 'x' may be a vector of length 1. In any case, have you actually tried package 'fastmatch'? The function 'fmatch', which that package provides, is very fast for repeated lookups in a table 'v'. -- Enrico Schumann Lucerne, Switzerland http://enricoschumann.net __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Fast multiple match function
You might find the data.table package helpful. It uses an index sorted with a radix sort and minimizes moving the data around in memory. --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. On April 7, 2015 1:50:39 PM PDT, Keshav Dhandhania kshav...@gmail.com wrote: Hi all, Thanks for the responses. Herve's example is a good small size example of what I wanted. y - c(16, -3, -2, 15, 15, 0, 8, 15, -2) someCoolFunc(-2, y) [1] 3 9 someCoolFunc(15, y) [1] 4 5 8 The requirement is that I want someCoolFunc() to run in O(number of matches) time, instead of O(size of y). This is because y is big. And I don't know all the queries I want to do up-front. And the results of some queries might change the queries I want to do in the future. @David: I hope the above description is more clear. @Enrico, Herve: I want both the functionality provided by one function. - On repeated calls, fmatch() does give O(1) performance, but it does not give all matches. - findMatches() gives all matches, but I need to know the entire vector x beforehand. I don't have that luxury. I do have something that works now, using split and fmatch (package fastmatch). So just posting that in case anyone in the future has the same problem. y.unique - unique(y) # create a map from the unique elements of y to the locations of all occurrences of the element y.map - split(1:length(y), match(y, y.unique)) # write a wrapper function that does a look-up on the unique list. and then returns all matches using the map. someCoolFunc - function(x) { y.map[[ fmatch(x, y.unique) ]] } On Tue, 7 Apr 2015 at 13:21 Hervé Pagès hpa...@fredhutch.org wrote: Hi Keshav, findMatches() in the S4Vectors/IRanges packages (Bioconductor) I think does what you want: library(IRanges) y - c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L) x - c(unique(y), 999L) hits - findMatches(x, y) Then: hits Hits object with 9 hits and 0 metadata columns: queryHits subjectHits integer integer [1] 1 1 [2] 2 2 [3] 3 3 [4] 3 9 [5] 4 4 [6] 4 5 [7] 4 8 [8] 5 6 [9] 6 7 --- queryLength: 7 subjectLength: 9 The Hits object can be turned into a list with: as.list(hits) [[1]] [1] 1 [[2]] [1] 2 [[3]] [1] 3 9 [[4]] [1] 4 5 8 [[5]] [1] 6 [[6]] [1] 7 [[7]] integer(0) H. sessionInfo() R version 3.2.0 beta (2015-04-05 r68151) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: Ubuntu 14.04.2 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4stats graphics grDevices utils datasets [8] methods base other attached packages: [1] IRanges_2.1.43 S4Vectors_0.5.22 BiocGenerics_0.13.11 loaded via a namespace (and not attached): [1] tools_3.2.0 On 04/06/2015 01:56 PM, Keshav Dhandhania wrote: Hi, I know that one can find all occurrences of x in a vector v by doing which(x == v). However, if I need to do this again and again, where v is remaining the same, then this is quite inefficient. In my particular case, I need to do this millions of times, and length(v) = 100 million. Does anyone have suggestion on how to go about it? I know of a package called fmatch that does the above for the match function. But they don't handle multiple matches. Thanks [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206)
Re: [R] Fast multiple match function
Hi all, Thanks for the responses. Herve's example is a good small size example of what I wanted. y - c(16, -3, -2, 15, 15, 0, 8, 15, -2) someCoolFunc(-2, y) [1] 3 9 someCoolFunc(15, y) [1] 4 5 8 The requirement is that I want someCoolFunc() to run in O(number of matches) time, instead of O(size of y). This is because y is big. And I don't know all the queries I want to do up-front. And the results of some queries might change the queries I want to do in the future. @David: I hope the above description is more clear. @Enrico, Herve: I want both the functionality provided by one function. - On repeated calls, fmatch() does give O(1) performance, but it does not give all matches. - findMatches() gives all matches, but I need to know the entire vector x beforehand. I don't have that luxury. I do have something that works now, using split and fmatch (package fastmatch). So just posting that in case anyone in the future has the same problem. y.unique - unique(y) # create a map from the unique elements of y to the locations of all occurrences of the element y.map - split(1:length(y), match(y, y.unique)) # write a wrapper function that does a look-up on the unique list. and then returns all matches using the map. someCoolFunc - function(x) { y.map[[ fmatch(x, y.unique) ]] } On Tue, 7 Apr 2015 at 13:21 Hervé Pagès hpa...@fredhutch.org wrote: Hi Keshav, findMatches() in the S4Vectors/IRanges packages (Bioconductor) I think does what you want: library(IRanges) y - c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L) x - c(unique(y), 999L) hits - findMatches(x, y) Then: hits Hits object with 9 hits and 0 metadata columns: queryHits subjectHits integer integer [1] 1 1 [2] 2 2 [3] 3 3 [4] 3 9 [5] 4 4 [6] 4 5 [7] 4 8 [8] 5 6 [9] 6 7 --- queryLength: 7 subjectLength: 9 The Hits object can be turned into a list with: as.list(hits) [[1]] [1] 1 [[2]] [1] 2 [[3]] [1] 3 9 [[4]] [1] 4 5 8 [[5]] [1] 6 [[6]] [1] 7 [[7]] integer(0) H. sessionInfo() R version 3.2.0 beta (2015-04-05 r68151) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: Ubuntu 14.04.2 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4stats graphics grDevices utils datasets [8] methods base other attached packages: [1] IRanges_2.1.43 S4Vectors_0.5.22 BiocGenerics_0.13.11 loaded via a namespace (and not attached): [1] tools_3.2.0 On 04/06/2015 01:56 PM, Keshav Dhandhania wrote: Hi, I know that one can find all occurrences of x in a vector v by doing which(x == v). However, if I need to do this again and again, where v is remaining the same, then this is quite inefficient. In my particular case, I need to do this millions of times, and length(v) = 100 million. Does anyone have suggestion on how to go about it? I know of a package called fmatch that does the above for the match function. But they don't handle multiple matches. Thanks [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Fast multiple match function
Hi, I know that one can find all occurrences of x in a vector v by doing which(x == v). However, if I need to do this again and again, where v is remaining the same, then this is quite inefficient. In my particular case, I need to do this millions of times, and length(v) = 100 million. Does anyone have suggestion on how to go about it? I know of a package called fmatch that does the above for the match function. But they don't handle multiple matches. Thanks [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Fast multiple match function
split() might help, but you should give a more complete explanation of your problem. Bill Dunlap TIBCO Software wdunlap tibco.com On Mon, Apr 6, 2015 at 1:56 PM, Keshav Dhandhania kshav...@gmail.com wrote: Hi, I know that one can find all occurrences of x in a vector v by doing which(x == v). However, if I need to do this again and again, where v is remaining the same, then this is quite inefficient. In my particular case, I need to do this millions of times, and length(v) = 100 million. Does anyone have suggestion on how to go about it? I know of a package called fmatch that does the above for the match function. But they don't handle multiple matches. Thanks [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Fast multiple match function
On Apr 6, 2015, at 1:56 PM, Keshav Dhandhania wrote: Hi, I know that one can find all occurrences of x in a vector v by doing which(x == v). However, if I need to do this again and again, where v is remaining the same, then this is quite inefficient. In my particular case, I need to do this millions of times, and length(v) = 100 million. Does anyone have suggestion on how to go about it? I know of a package called fmatch that does the above for the match function. But they don't handle multiple matches. You should explain why you need to do it millions of times and you should pose a small sample problem that presents the level of complexity needed in a minimal size. Thanks [[alternative HTML version deleted]] And you should read the Posting Guide where it is strongly advised that you not post in HTML format. I have used gmail and I do know that it is fairly easy to post in plain text. -- David Winsemius Alameda, CA, USA __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.