Re: [R] Fast multiple match function

2015-04-17 Thread Keshav Dhandhania
Hi Jeff,

Indeed the data.table package does provide a much cleaner way to achieve
the same functionality, and a lot of other functionality as bonus.

Thanks for letting me know about it.

On Tue, 7 Apr 2015 at 15:41 Jeff Newmiller jdnew...@dcn.davis.ca.us wrote:

 You might find the data.table package helpful. It uses an index sorted
 with a radix sort and minimizes moving the data around in memory.
 ---
 Jeff NewmillerThe .   .  Go Live...
 DCN:jdnew...@dcn.davis.ca.usBasics: ##.#.   ##.#.  Live
 Go...
   Live:   OO#.. Dead: OO#..  Playing
 Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
 /Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k
 ---
 Sent from my phone. Please excuse my brevity.

 On April 7, 2015 1:50:39 PM PDT, Keshav Dhandhania kshav...@gmail.com
 wrote:
 Hi all,
 
 Thanks for the responses.
 Herve's example is a good small size example of what I wanted.
 
  y - c(16, -3, -2, 15, 15, 0, 8, 15, -2)
  someCoolFunc(-2, y)
 [1] 3 9
  someCoolFunc(15, y)
 [1] 4 5 8
 
 The requirement is that I want someCoolFunc() to run in O(number of
 matches) time, instead of O(size of y).
 This is because y is big. And I don't know all the queries I want to
 do up-front. And the results of some queries might change the queries
 I want to do in the future.
 
 @David: I hope the above description is more clear.
 @Enrico, Herve: I want both the functionality provided by one function.
 - On repeated calls, fmatch() does give O(1) performance, but it does
 not give all matches.
 - findMatches() gives all matches, but I need to know the entire
 vector x beforehand. I don't have that luxury.
 
 
 I do have something that works now, using split and fmatch (package
 fastmatch). So just posting that in case anyone in the future has the
 same problem.
  y.unique - unique(y)
 
  # create a map from the unique elements of y to the locations of all
 occurrences of the element
  y.map - split(1:length(y), match(y, y.unique))
 
  # write a wrapper function that does a look-up on the unique list.
 and then returns all matches using the map.
  someCoolFunc - function(x) { y.map[[ fmatch(x, y.unique) ]] }
 
 
 
 On Tue, 7 Apr 2015 at 13:21 Hervé Pagès hpa...@fredhutch.org wrote:
 
  Hi Keshav,
 
  findMatches() in the S4Vectors/IRanges packages (Bioconductor) I
 think
  does what you want:
 
 library(IRanges)
 y - c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L)
 x - c(unique(y), 999L)
 hits - findMatches(x, y)
 
  Then:
 
  hits
 Hits object with 9 hits and 0 metadata columns:
   queryHits subjectHits
   integer   integer
   [1] 1   1
   [2] 2   2
   [3] 3   3
   [4] 3   9
   [5] 4   4
   [6] 4   5
   [7] 4   8
   [8] 5   6
   [9] 6   7
   ---
   queryLength: 7
   subjectLength: 9
 
  The Hits object can be turned into a list with:
 
  as.list(hits)
 [[1]]
 [1] 1
 
 [[2]]
 [1] 2
 
 [[3]]
 [1] 3 9
 
 [[4]]
 [1] 4 5 8
 
 [[5]]
 [1] 6
 
 [[6]]
 [1] 7
 
 [[7]]
 integer(0)
 
  H.
 
sessionInfo()
  R version 3.2.0 beta (2015-04-05 r68151)
  Platform: x86_64-unknown-linux-gnu (64-bit)
  Running under: Ubuntu 14.04.2 LTS
 
  locale:
[1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8   LC_NAME=C
[9] LC_ADDRESS=C   LC_TELEPHONE=C
  [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
 
  attached base packages:
  [1] parallel  stats4stats graphics  grDevices utils
 datasets
  [8] methods   base
 
  other attached packages:
  [1] IRanges_2.1.43   S4Vectors_0.5.22 BiocGenerics_0.13.11
 
  loaded via a namespace (and not attached):
  [1] tools_3.2.0
 
  On 04/06/2015 01:56 PM, Keshav Dhandhania wrote:
   Hi,
  
   I know that one can find all occurrences of x in a vector v by
 doing
   which(x == v).
  
   However, if I need to do this again and again, where v is remaining
 the
   same, then this is quite inefficient. In my particular case, I need
 to do
   this millions of times, and length(v) = 100 million.
  
   Does anyone have suggestion on how to go about it?
   I know of a package called fmatch that does the above for the match
   function. But they don't handle multiple matches.
  
   Thanks
  
 [[alternative HTML version deleted]]
  
   __
   R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
   

Re: [R] Fast multiple match function

2015-04-07 Thread Hervé Pagès

Hi Keshav,

findMatches() in the S4Vectors/IRanges packages (Bioconductor) I think
does what you want:

  library(IRanges)
  y - c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L)
  x - c(unique(y), 999L)
  hits - findMatches(x, y)

Then:

   hits
  Hits object with 9 hits and 0 metadata columns:
queryHits subjectHits
integer   integer
[1] 1   1
[2] 2   2
[3] 3   3
[4] 3   9
[5] 4   4
[6] 4   5
[7] 4   8
[8] 5   6
[9] 6   7
---
queryLength: 7
subjectLength: 9

The Hits object can be turned into a list with:

   as.list(hits)
  [[1]]
  [1] 1

  [[2]]
  [1] 2

  [[3]]
  [1] 3 9

  [[4]]
  [1] 4 5 8

  [[5]]
  [1] 6

  [[6]]
  [1] 7

  [[7]]
  integer(0)

H.

 sessionInfo()
R version 3.2.0 beta (2015-04-05 r68151)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4stats graphics  grDevices utils datasets
[8] methods   base

other attached packages:
[1] IRanges_2.1.43   S4Vectors_0.5.22 BiocGenerics_0.13.11

loaded via a namespace (and not attached):
[1] tools_3.2.0

On 04/06/2015 01:56 PM, Keshav Dhandhania wrote:

Hi,

I know that one can find all occurrences of x in a vector v by doing

which(x == v).


However, if I need to do this again and again, where v is remaining the
same, then this is quite inefficient. In my particular case, I need to do
this millions of times, and length(v) = 100 million.

Does anyone have suggestion on how to go about it?
I know of a package called fmatch that does the above for the match
function. But they don't handle multiple matches.

Thanks

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Fast multiple match function

2015-04-07 Thread Enrico Schumann
On Mon, 06 Apr 2015, Keshav Dhandhania kshav...@gmail.com writes:

 Hi,

 I know that one can find all occurrences of x in a vector v by doing
 which(x == v).

 However, if I need to do this again and again, where v is remaining the
 same, then this is quite inefficient. In my particular case, I need to do
 this millions of times, and length(v) = 100 million.

 Does anyone have suggestion on how to go about it?
 I know of a package called fmatch that does the above for the match
 function. But they don't handle multiple matches.


Perhaps 'match(x, v)' is what you want? In which 'x' may be a vector of
length  1.

In any case, have you actually tried package 'fastmatch'? The function
'fmatch', which that package provides, is very fast for repeated
lookups in a table 'v'.


-- 
Enrico Schumann
Lucerne, Switzerland
http://enricoschumann.net

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Fast multiple match function

2015-04-07 Thread Jeff Newmiller
You might find the data.table package helpful. It uses an index sorted with a 
radix sort and minimizes moving the data around in memory.
---
Jeff NewmillerThe .   .  Go Live...
DCN:jdnew...@dcn.davis.ca.usBasics: ##.#.   ##.#.  Live Go...
  Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
/Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k
--- 
Sent from my phone. Please excuse my brevity.

On April 7, 2015 1:50:39 PM PDT, Keshav Dhandhania kshav...@gmail.com wrote:
Hi all,

Thanks for the responses.
Herve's example is a good small size example of what I wanted.

 y - c(16, -3, -2, 15, 15, 0, 8, 15, -2)
 someCoolFunc(-2, y)
[1] 3 9
 someCoolFunc(15, y)
[1] 4 5 8

The requirement is that I want someCoolFunc() to run in O(number of
matches) time, instead of O(size of y).
This is because y is big. And I don't know all the queries I want to
do up-front. And the results of some queries might change the queries
I want to do in the future.

@David: I hope the above description is more clear.
@Enrico, Herve: I want both the functionality provided by one function.
- On repeated calls, fmatch() does give O(1) performance, but it does
not give all matches.
- findMatches() gives all matches, but I need to know the entire
vector x beforehand. I don't have that luxury.


I do have something that works now, using split and fmatch (package
fastmatch). So just posting that in case anyone in the future has the
same problem.
 y.unique - unique(y)

 # create a map from the unique elements of y to the locations of all
occurrences of the element
 y.map - split(1:length(y), match(y, y.unique))

 # write a wrapper function that does a look-up on the unique list.
and then returns all matches using the map.
 someCoolFunc - function(x) { y.map[[ fmatch(x, y.unique) ]] }



On Tue, 7 Apr 2015 at 13:21 Hervé Pagès hpa...@fredhutch.org wrote:

 Hi Keshav,

 findMatches() in the S4Vectors/IRanges packages (Bioconductor) I
think
 does what you want:

library(IRanges)
y - c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L)
x - c(unique(y), 999L)
hits - findMatches(x, y)

 Then:

 hits
Hits object with 9 hits and 0 metadata columns:
  queryHits subjectHits
  integer   integer
  [1] 1   1
  [2] 2   2
  [3] 3   3
  [4] 3   9
  [5] 4   4
  [6] 4   5
  [7] 4   8
  [8] 5   6
  [9] 6   7
  ---
  queryLength: 7
  subjectLength: 9

 The Hits object can be turned into a list with:

 as.list(hits)
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3 9

[[4]]
[1] 4 5 8

[[5]]
[1] 6

[[6]]
[1] 7

[[7]]
integer(0)

 H.

   sessionInfo()
 R version 3.2.0 beta (2015-04-05 r68151)
 Platform: x86_64-unknown-linux-gnu (64-bit)
 Running under: Ubuntu 14.04.2 LTS

 locale:
   [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
   [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
   [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
   [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
   [9] LC_ADDRESS=C   LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] parallel  stats4stats graphics  grDevices utils
datasets
 [8] methods   base

 other attached packages:
 [1] IRanges_2.1.43   S4Vectors_0.5.22 BiocGenerics_0.13.11

 loaded via a namespace (and not attached):
 [1] tools_3.2.0

 On 04/06/2015 01:56 PM, Keshav Dhandhania wrote:
  Hi,
 
  I know that one can find all occurrences of x in a vector v by
doing
  which(x == v).
 
  However, if I need to do this again and again, where v is remaining
the
  same, then this is quite inefficient. In my particular case, I need
to do
  this millions of times, and length(v) = 100 million.
 
  Does anyone have suggestion on how to go about it?
  I know of a package called fmatch that does the above for the match
  function. But they don't handle multiple matches.
 
  Thanks
 
[[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 

 --
 Hervé Pagès

 Program in Computational Biology
 Division of Public Health Sciences
 Fred Hutchinson Cancer Research Center
 1100 Fairview Ave. N, M1-B514
 P.O. Box 19024
 Seattle, WA 98109-1024

 E-mail: hpa...@fredhutch.org
 Phone:  (206) 

Re: [R] Fast multiple match function

2015-04-07 Thread Keshav Dhandhania
Hi all,

Thanks for the responses.
Herve's example is a good small size example of what I wanted.

 y - c(16, -3, -2, 15, 15, 0, 8, 15, -2)
 someCoolFunc(-2, y)
[1] 3 9
 someCoolFunc(15, y)
[1] 4 5 8

The requirement is that I want someCoolFunc() to run in O(number of
matches) time, instead of O(size of y).
This is because y is big. And I don't know all the queries I want to
do up-front. And the results of some queries might change the queries
I want to do in the future.

@David: I hope the above description is more clear.
@Enrico, Herve: I want both the functionality provided by one function.
- On repeated calls, fmatch() does give O(1) performance, but it does
not give all matches.
- findMatches() gives all matches, but I need to know the entire
vector x beforehand. I don't have that luxury.


I do have something that works now, using split and fmatch (package
fastmatch). So just posting that in case anyone in the future has the
same problem.
 y.unique - unique(y)

 # create a map from the unique elements of y to the locations of all 
 occurrences of the element
 y.map - split(1:length(y), match(y, y.unique))

 # write a wrapper function that does a look-up on the unique list. and then 
 returns all matches using the map.
 someCoolFunc - function(x) { y.map[[ fmatch(x, y.unique) ]] }



On Tue, 7 Apr 2015 at 13:21 Hervé Pagès hpa...@fredhutch.org wrote:

 Hi Keshav,

 findMatches() in the S4Vectors/IRanges packages (Bioconductor) I think
 does what you want:

library(IRanges)
y - c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L)
x - c(unique(y), 999L)
hits - findMatches(x, y)

 Then:

 hits
Hits object with 9 hits and 0 metadata columns:
  queryHits subjectHits
  integer   integer
  [1] 1   1
  [2] 2   2
  [3] 3   3
  [4] 3   9
  [5] 4   4
  [6] 4   5
  [7] 4   8
  [8] 5   6
  [9] 6   7
  ---
  queryLength: 7
  subjectLength: 9

 The Hits object can be turned into a list with:

 as.list(hits)
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3 9

[[4]]
[1] 4 5 8

[[5]]
[1] 6

[[6]]
[1] 7

[[7]]
integer(0)

 H.

   sessionInfo()
 R version 3.2.0 beta (2015-04-05 r68151)
 Platform: x86_64-unknown-linux-gnu (64-bit)
 Running under: Ubuntu 14.04.2 LTS

 locale:
   [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
   [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
   [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
   [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
   [9] LC_ADDRESS=C   LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] parallel  stats4stats graphics  grDevices utils datasets
 [8] methods   base

 other attached packages:
 [1] IRanges_2.1.43   S4Vectors_0.5.22 BiocGenerics_0.13.11

 loaded via a namespace (and not attached):
 [1] tools_3.2.0

 On 04/06/2015 01:56 PM, Keshav Dhandhania wrote:
  Hi,
 
  I know that one can find all occurrences of x in a vector v by doing
  which(x == v).
 
  However, if I need to do this again and again, where v is remaining the
  same, then this is quite inefficient. In my particular case, I need to do
  this millions of times, and length(v) = 100 million.
 
  Does anyone have suggestion on how to go about it?
  I know of a package called fmatch that does the above for the match
  function. But they don't handle multiple matches.
 
  Thanks
 
[[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 

 --
 Hervé Pagès

 Program in Computational Biology
 Division of Public Health Sciences
 Fred Hutchinson Cancer Research Center
 1100 Fairview Ave. N, M1-B514
 P.O. Box 19024
 Seattle, WA 98109-1024

 E-mail: hpa...@fredhutch.org
 Phone:  (206) 667-5791
 Fax:(206) 667-1319

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Fast multiple match function

2015-04-06 Thread Keshav Dhandhania
Hi,

I know that one can find all occurrences of x in a vector v by doing
 which(x == v).

However, if I need to do this again and again, where v is remaining the
same, then this is quite inefficient. In my particular case, I need to do
this millions of times, and length(v) = 100 million.

Does anyone have suggestion on how to go about it?
I know of a package called fmatch that does the above for the match
function. But they don't handle multiple matches.

Thanks

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Fast multiple match function

2015-04-06 Thread William Dunlap
split() might help, but you should give a more complete
explanation of your problem.

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Mon, Apr 6, 2015 at 1:56 PM, Keshav Dhandhania kshav...@gmail.com
wrote:

 Hi,

 I know that one can find all occurrences of x in a vector v by doing
  which(x == v).

 However, if I need to do this again and again, where v is remaining the
 same, then this is quite inefficient. In my particular case, I need to do
 this millions of times, and length(v) = 100 million.

 Does anyone have suggestion on how to go about it?
 I know of a package called fmatch that does the above for the match
 function. But they don't handle multiple matches.

 Thanks

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Fast multiple match function

2015-04-06 Thread David Winsemius

On Apr 6, 2015, at 1:56 PM, Keshav Dhandhania wrote:

 Hi,
 
 I know that one can find all occurrences of x in a vector v by doing
 which(x == v).
 
 However, if I need to do this again and again, where v is remaining the
 same, then this is quite inefficient. In my particular case, I need to do
 this millions of times, and length(v) = 100 million.
 
 Does anyone have suggestion on how to go about it?
 I know of a package called fmatch that does the above for the match
 function. But they don't handle multiple matches.
 

You should explain why you need to do it millions of times and you should pose 
a small sample problem that presents the level of complexity needed in a 
minimal size.

 Thanks
 
   [[alternative HTML version deleted]]

And you should read the Posting Guide where it is strongly advised that you not 
post in HTML format. I have used gmail and I do know that it is fairly easy to 
post in plain text.

-- 
David Winsemius
Alameda, CA, USA

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.