Re: [Numpy-discussion] MemoryError : with scipy.spatial.distance

2012-04-05 Thread Gael Varoquaux
On Thu, Apr 05, 2012 at 01:05:01PM -0700, Abhishek Pratap wrote:
> Also in my case I dont really have a good approximate on value of K in 
> K-means.

That's a hard problem, for which I have no answer, sorry :$

G
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] MemoryError : with scipy.spatial.distance

2012-04-05 Thread Abhishek Pratap
Also in my case I dont really have a good approximate on value of K in K-means.

-A

On Thu, Apr 5, 2012 at 8:06 AM, Abhishek Pratap  wrote:
> Hi Gael
>
> The MemoryError exception I am getting is from using scikit's DBSCAN
> implementation. I can check mini-batch implementation of Kmeans.
>
> Best,
> -Abhi
>
> On Wed, Apr 4, 2012 at 10:33 PM, Gael Varoquaux
>  wrote:
>> On Wed, Apr 04, 2012 at 04:41:51PM -0700, Abhishek Pratap wrote:
>>> Thanks Chris. So I guess the question becomes how can I efficiently
>>> cluster 1 million x,y coordinates.
>>
>> Did you try the scikit-learn's implementation of DBSCAN:
>> http://scikit-learn.org/stable/modules/clustering.html#dbscan
>> ? I am not sure that it scales, but it's worth trying.
>>
>> Alternatively, the best way to cluster massive datasets is to use the
>> mini-batch implementation of KMeans:
>> http://scikit-learn.org/stable/modules/clustering.html#mini-batch-k-means
>>
>> Hope this helps,
>>
>> Gael
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] MemoryError : with scipy.spatial.distance

2012-04-05 Thread Abhishek Pratap
Hi Gael

The MemoryError exception I am getting is from using scikit's DBSCAN
implementation. I can check mini-batch implementation of Kmeans.

Best,
-Abhi

On Wed, Apr 4, 2012 at 10:33 PM, Gael Varoquaux
 wrote:
> On Wed, Apr 04, 2012 at 04:41:51PM -0700, Abhishek Pratap wrote:
>> Thanks Chris. So I guess the question becomes how can I efficiently
>> cluster 1 million x,y coordinates.
>
> Did you try the scikit-learn's implementation of DBSCAN:
> http://scikit-learn.org/stable/modules/clustering.html#dbscan
> ? I am not sure that it scales, but it's worth trying.
>
> Alternatively, the best way to cluster massive datasets is to use the
> mini-batch implementation of KMeans:
> http://scikit-learn.org/stable/modules/clustering.html#mini-batch-k-means
>
> Hope this helps,
>
> Gael
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] MemoryError : with scipy.spatial.distance

2012-04-04 Thread Gael Varoquaux
On Wed, Apr 04, 2012 at 04:41:51PM -0700, Abhishek Pratap wrote:
> Thanks Chris. So I guess the question becomes how can I efficiently
> cluster 1 million x,y coordinates.

Did you try the scikit-learn's implementation of DBSCAN:
http://scikit-learn.org/stable/modules/clustering.html#dbscan
? I am not sure that it scales, but it's worth trying.

Alternatively, the best way to cluster massive datasets is to use the
mini-batch implementation of KMeans:
http://scikit-learn.org/stable/modules/clustering.html#mini-batch-k-means

Hope this helps,

Gael
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] MemoryError : with scipy.spatial.distance

2012-04-04 Thread Abhishek Pratap
Thanks Chris. So I guess the question becomes how can I efficiently
cluster 1 million x,y coordinates.

-Abhi

On Wed, Apr 4, 2012 at 4:35 PM, Chris Barker  wrote:
> On Wed, Apr 4, 2012 at 4:17 PM, Abhishek Pratap
>> close to a 900K points using DBSCAN algo. My input is a list of ~900k
>> tuples each having two points (x,y) coordinates. I am converting them
>> to numpy array and passing them to pdist method of
>> scipy.spatial.distance for calculating distance between each point.
>
> I think pdist creates an array that is:
>
> sum(range(num+points)) in size.
>
> That's going to be pretty darn big:
>
> 40499955 elements
>
> I think that's about 3 terabytes:
>
> In [41]: sum(range(90)) / 1024. / 1024 / 1024 / 1024 * 8
> Out[41]: 2.946759559563361
>
> (for 64 bit floats)
>
>
>> I think the error has something to do with the default double dtype
>> of numpy array of pdist function.
>
> you *may* be able to get it to use float32 -- but as you can see, that
> probably won't help enough!
>
> You'll need a different approach!
>
> -Chris
>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> chris.bar...@noaa.gov
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] MemoryError : with scipy.spatial.distance

2012-04-04 Thread Chris Barker
On Wed, Apr 4, 2012 at 4:17 PM, Abhishek Pratap
> close to a 900K points using DBSCAN algo. My input is a list of ~900k
> tuples each having two points (x,y) coordinates. I am converting them
> to numpy array and passing them to pdist method of
> scipy.spatial.distance for calculating distance between each point.

I think pdist creates an array that is:

sum(range(num+points)) in size.

That's going to be pretty darn big:

40499955 elements

I think that's about 3 terabytes:

In [41]: sum(range(90)) / 1024. / 1024 / 1024 / 1024 * 8
Out[41]: 2.946759559563361

(for 64 bit floats)


> I think the error has something to do with the default double dtype
> of numpy array of pdist function.

you *may* be able to get it to use float32 -- but as you can see, that
probably won't help enough!

You'll need a different approach!

-Chris



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] MemoryError : with scipy.spatial.distance

2012-04-04 Thread Abhishek Pratap
Hey Guys

I am new to both python and more so to numpy. I am trying to cluster
close to a 900K points using DBSCAN algo. My input is a list of ~900k
tuples each having two points (x,y) coordinates. I am converting them
to numpy array and passing them to pdist method of
scipy.spatial.distance for calculating distance between each point.

Here is some size info on my numpy array
shape of input array  : (828575, 2)
Size :  6872000 bytes

I think the error has something to do with the default double dtype
of numpy array of pdist function. I would appreciate if you could help
me debug this. I am sure I overlooking some naive thing here

See the traceback below.


MemoryError   Traceback (most recent call last)
/house/homedirs/a/apratap/Dropbox/dev/ipython/
in ()
 36
 37 print cleaned_senseBam
---> 38 cluster_pet_points_per_chromosome(sense_bamFile)

/house/homedirs/a/apratap/Dropbox/dev/ipython/
in cluster_pet_points_per_chromosome(bamFile)
 30 print 'Size of list points is %d' % sys.getsizeof(points)
 31 print 'Size of numpy array is %d' %
sys.getsizeof(points_array)
---> 32 cluster_points_DBSCAN(points_array)
 33 #print points_array

 34

/house/homedirs/a/apratap/Dropbox/dev/ipython/
in cluster_points_DBSCAN(data_numpy_array)
  9 def cluster_points_DBSCAN(data_numpy_array):
 10 #eucledian distance calculation

---> 11 D = distance.pdist(data_numpy_array)
 12 S = distance.squareform(D)
 13 H = 1 - S/np.max(S)

/house/homedirs/a/apratap/playground/software/epd-7.2-2-rh5-x86_64/lib/python2.7/site-packages/scipy/spatial/distance.pyc
in pdist(X, metric, p, w, V, VI)
   1155
   1156 m, n = s
-> 1157 dm = np.zeros((m * (m - 1) / 2,), dtype=np.double)
   1158
   1159 wmink_names = ['wminkowski', 'wmi', 'wm', 'wpnorm']
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion