[Numpy-discussion] Overlapping ranges

2009-03-16 Thread Peter Saffrey

I'm trying to file a set of data points, defined by genome coordinates, into 
bins, also based on genome coordinates. Each data point is (chromosome, start, 
end, point) and each bin is (chromosome, start, end). I have about 140 million 
points to file into around 100,000 bins. Both are (roughly) evenly distributed 
over the 24 chromosomes (1-22, X and Y). Genome coordinates are integers and my 
data points are floats. For each data point, (end - start) is roughly 1000, but 
the bins are are of uneven widths. Bins might have also overlap - in that case, 
I need to know all the bins that a point overlaps.

By overlap, I mean the start or end of the data point (or both) is inside the 
bin or that the point entirely covers the bin.

At the moment, I'm using a fairly naive approach that finds roughly in the 
genome (which gene) each point might be and then checking it against the bins 
in that gene. If I split the problem into chromosomes, I feel sure there must 
be some super-fast matrix approach I can apply using numpy, but I'm struggling 
a bit. Can anybody suggest something?

Peter

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Overlapping ranges

2009-03-16 Thread Robert Kern
2009/3/16 Peter Saffrey p...@dcs.gla.ac.uk:

 At the moment, I'm using a fairly naive approach that finds roughly in the
 genome (which gene) each point might be and then checking it against the
 bins in that gene. If I split the problem into chromosomes, I feel sure
 there must be some super-fast matrix approach I can apply using numpy, but
 I'm struggling a bit. Can anybody suggest something?

You probably need something algorithmically better, like interval
trees. There are a couple of C/Python implementations floating around.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
  -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Overlapping ranges

2009-03-16 Thread josef . pktd
On Mon, Mar 16, 2009 at 5:29 PM, Robert Kern robert.k...@gmail.com wrote:
 2009/3/16 Peter Saffrey p...@dcs.gla.ac.uk:

 At the moment, I'm using a fairly naive approach that finds roughly in the
 genome (which gene) each point might be and then checking it against the
 bins in that gene. If I split the problem into chromosomes, I feel sure
 there must be some super-fast matrix approach I can apply using numpy, but
 I'm struggling a bit. Can anybody suggest something?

 You probably need something algorithmically better, like interval
 trees. There are a couple of C/Python implementations floating around.


If I understand your problem correctly, then with a smaller scaled
problem something like this should work
{{{
import numpy as np

B = np.array([[1,3],[2,5],[7,10], [6,15],[14,20]]) # bins
P = np.c_[np.arange(1,16), 4+np.arange(1,16)]  # points

#mask = (~(P[:,0:1]D[:,1:2].T)) * (~(P[:,1:2]D[:,0:1].T))
# if the bin ended before the start of the point interval,then it is discarded
# if the bin started after the end of the point interval, then it is discarded
mask =  ~np.logical_or((P[:,0:1]B[:,1:2].T), (P[:,1:2]B[:,0:1].T))
indices = mask*np.arange(1,6)
print B
print P
print mask
print indices
}}}

However it creates a result matrix with dimension (number of points)
times (number of bins). If this doesn't fit into memory some looping
is necessary.

Tested on example only.

Josef
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion