Re: [Scikit-learn-general] Scikit-learn scalability options ?

2013-02-07 Thread Olivier Grisel
2013/2/6 Vinay B, vybe3...@gmail.com: Hi Almost there (I hope) , but not quite: I put my code up at https://gist.github.com/balamuru/4726232 for readability. Reading a directory of text files in chunks of 5, and returning them in a dictionary (key= filename, value= text contents) I wanted

Re: [Scikit-learn-general] Scikit-learn scalability options ?

2013-02-07 Thread Vinay B,
So I tried your recommendations. The partial fit seems to operate to an extent. Then BOOM! It looks very similar to the example in http://scikit-learn.org/dev/auto_examples/document_clustering.html#example-document-clustering-py . Wonder what I'm doing wrong this time? . Relevant code

Re: [Scikit-learn-general] Scikit-learn scalability options ?

2013-02-07 Thread Vinay B,
I updated scikit to the latest version. The bug I reported earlier no longer exists. Now the minibatch k means completes. Now, I have an error printing out the docs per cluster. Complete code at https://gist.github.com/balamuru/4734765 Thanks in advance Output . . ## counts: (10, 1) ##

Re: [Scikit-learn-general] Scikit-learn scalability options ?

2013-02-07 Thread Olivier Grisel
There is probably an issue when accessing the `labels_` attribute if you do `partial_fit` instead of `fit`. Instead of using label, you should probably do another pass over the data and call `predict` instead to compute the cluster membership info for each sample.

Re: [Scikit-learn-general] Scikit-learn scalability options ?

2013-02-07 Thread Vinay B,
Hi, I tried again . I feel there's something wrong I'm doing with my code so far. In any case, the print loop I added was doc_idx = 0 for cluster_doc_filename in file_names: #predicted_cluster = km.predict(cluster_doc_filename) predicted_cluster = km.predict(doc_idx)

[Scikit-learn-general] Weighted and Balanced Random Forests

2013-02-07 Thread Manish Amde
Fellow sklearners, I am working on a classification problem with an unbalanced data set and have been successful using SVM classifiers with the class_weight option. I have also tried Random Forests and am getting a decent ROC performance but I am hoping to get a performance improvement by using

Re: [Scikit-learn-general] Weighted and Balanced Random Forests

2013-02-07 Thread Gilles Louppe
Hello, You might achieve what you want by using sample weights when fitting your forest (See the 'sample_weight' parameter). There is also a 'balance_weights' method from the preprocessing module that basically generates sample weights for you, such that classes become balanced.

Re: [Scikit-learn-general] Weighted and Balanced Random Forests

2013-02-07 Thread Manish Amde
Thanks Gilles. This definitely helps. I am glad I asked. :-) -Manish On Feb 7, 2013, at 11:33 PM, Gilles Louppe g.lou...@gmail.com wrote: Hello, You might achieve what you want by using sample weights when fitting your forest (See the 'sample_weight' parameter). There is also a