Re: [Scikit-learn-general] random forest question
So the short answer is no. All features will be considered when building a decision tree, as it should. Tommy, I know the speaker at pydata today claimed that the features are partitioned, but I don't believe this to be the case in how random forests were originally specified. Best, Joseph -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] random forest question
Hi, I know the speaker at pydata today claimed that the features are partitioned, Can you elaborate? If you pick your features prior to the construction of the tree and then build it on that subset only, then indeed, this is not random forest. That algorithm is called Random Subspaces. Best, Gilles -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] random forest question
Gilles, I met Tommy Guy at the pydata conference today. If I remember correctly, Brian Eoff (I don't have his email address) errantly said that random forests partitions/samples the features before creating each tree. I didn't want to correct him in front of the audience, and it slipped my mind to mention it to him later. But I remembered when Tommy Guy asked the question. Joseph On Sat, Oct 27, 2012 at 5:16 AM, Gilles Louppe g.lou...@gmail.com wrote: Hi, I know the speaker at pydata today claimed that the features are partitioned, Can you elaborate? If you pick your features prior to the construction of the tree and then build it on that subset only, then indeed, this is not random forest. That algorithm is called Random Subspaces. Best, Gilles -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Joseph Turian, Ph.D. | President, MetaOptimize Optimize Profits. Optimize Engagement. http://metaoptimize.com 855-ALL-DATA The web's most active forum for data scientists: http://metaoptimize.com/qa/ -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Jython and Scikit-Learn
All, it s look like that the system ERP that we want to implement has yet an API in C++. SO this is a good news for python and scikit learn. It will be just a question to create a wrapper in Python to have access to the system through their C++ API. Does it looks sensible ? Regards Didier From: robert.k...@gmail.com Date: Fri, 26 Oct 2012 21:52:03 +0100 To: scikit-learn-general@lists.sourceforge.net Subject: Re: [Scikit-learn-general] Jython and Scikit-Learn On Fri, Oct 26, 2012 at 4:52 PM, Didier Vila dv...@capquestco.com wrote: Mathieu and Olivier, Thanks for your emails. My interest on python and scikit-learn growth each day so I will try a solution for the new system through Jepp or Jpype. I will let you know what will happen. You may also want to consider jnius: http://pypi.python.org/pypi/jnius/ -- Robert Kern -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] ANN: astroML version 0.1
Thanks Gael, Yes, I've been thinking a lot about density estimation, and I've designed all the astroML code to be fairly easy to move upstream if desired. I have a bit of a vision for density estimation: I'd love in the future to create an sklearn.density submodule which has things like KDE (built on an improved ball tree), KNN density, Extreme Deconvolution, etc. They'd have an interface similar to the current GMM (most of that code, as you saw, is already in astroML). When that is in place, we could create a very general Bayesian generative classifier, which would learn a density representation for each class using any of these estimators, allow for user-specifiable priors, and then perform probabilistic classification of new points based on the per-class densities. This would supersede GaussianNB, KNeighborsClassifier, and RadiusNeighborsClassifier (and maybe others), in the sense that they could be easily implemented as specializations of the new routine. I think this could be a really powerful addition to scikit-learn. Just my thoughts for the morning... back to PyData! Jake On 10/27/2012 01:17 AM, Gael Varoquaux wrote: It looks really awesome! The examples are superbe. It looks like you have some really cool density estimation code. I would personnally love to see such functionality in the scikit. Do you think that some of it could be move upstream? Thanks a lot for being our astrophysics figure-head! I feel that the astroML and the scikit will have an impact there. Gael -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Precision-recall now requires probas_pred to be in [0, 1]
On Fri, Oct 26, 2012 at 06:24:28PM +0100, Andreas Mueller wrote: Which PR was that. That is bad :-( I suggest to change it back to working with any non-bounded test statistic. Any reason not to? I am proposing to do the work. +1 Done in 90c007981f54 G -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Jython and Scikit-Learn
How does jnius compare with jpype? On Fri, Oct 26, 2012 at 4:52 PM, Robert Kern robert.k...@gmail.com wrote: On Fri, Oct 26, 2012 at 4:52 PM, Didier Vila dv...@capquestco.com wrote: Mathieu and Olivier, Thanks for your emails. My interest on python and scikit-learn growth each day so I will try a solution for the new system through Jepp or Jpype. I will let you know what will happen. You may also want to consider jnius: http://pypi.python.org/pypi/jnius/ -- Robert Kern -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Joseph Turian, Ph.D. | President, MetaOptimize Optimize Profits. Optimize Engagement. http://metaoptimize.com 855-ALL-DATA The web's most active forum for data scientists: http://metaoptimize.com/qa/ -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Jython and Scikit-Learn
On Sat, Oct 27, 2012 at 10:39 PM, Joseph Turian jos...@metaoptimize.com wrote: How does jnius compare with jpype? It isn't dead, mostly. More seriously, with active developers and Cython underpinnings, they might accept some PRs to add efficient numpy support. -- Robert Kern -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] All-pairs-similarity calculation
Am 27.10.2012 23:43, schrieb Joseph Turian: If you only care about near matches and not the full n^2 matrix: +1 to OG's suggestion to use pylucene. You can use pylucene to generate candidates, and then compute the exact tf*idf cosine distance on the shortlist. Yes exactly. I would only need the most similar matches. The problem with the lucene solution is that I do not need tfidf. I really have to do simple cosine similarity on my available vectors. So e.g., my matrix (vectors) look the following way: [[1 2 5] [3 1 0]] Now get the cosine similarity between row one and two or in this case get the most similar row given row one using cosine similarity without any further variations. As already mentioned I have the data in sparse form. I assume this will be n log n. Another option for fast all-pairs is to use locality sensitive hashing. (I didn't read the papers or see if that's what they do.) It is not clear what the accuracy will be, but it will probably be the fastest. ] Yeah, some kind of dimension reduction is another option, but actually this would be very hard for me because I have already done all my previous experiments on the complete representations, so if I could find any faster solution for my problem this would be awesome. Regards, Philipp On Fri, Oct 26, 2012 at 3:31 PM, Philipp Singer kill...@gmail.com wrote: Am 26.10.2012 15:35, schrieb Olivier Grisel: BTW, in the mean time you could encode your coocurrences as text identifiers use either Lucene/Solr in Java using the sunburnt python client or woosh [1] in python as a way to do efficient sparse lookups in such a sparse matrix to be able to quickly compute the non zero cosine similarities between all pairs. Solr also as MoreLikeThis queries that can be used to truncate the search to the top most similar samples in the set of samples in the case you have some very frequent non zero features that would mostly break the sparsity of the cosine similarity matrix. As Trey Grainger says in his talk Building a real time, solr-powered recommendation engine: A Lucene index is a multi-dimensional sparse matrix… with very fast and powerful lookup capabilities. [1] http://packages.python.org/Whoosh/quickstart.html [2] http://www.slideshare.net/treygrainger/building-a-real-time-solrpowered-recommendation-engine Thanks, this looks promising. What do you exactly mean, by encoding cooccurrences as text identifiers? How would I handle my sparse vectors then? I know the MoreLikeThis functionality, but does it exactly do cosine similarity? The thing is, that I need this relatedness emasure for my studies. Philipp -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] All-pairs-similarity calculation
2012/10/26 Philipp Singer kill...@gmail.com: Am 26.10.2012 15:35, schrieb Olivier Grisel: BTW, in the mean time you could encode your coocurrences as text identifiers use either Lucene/Solr in Java using the sunburnt python client or woosh [1] in python as a way to do efficient sparse lookups in such a sparse matrix to be able to quickly compute the non zero cosine similarities between all pairs. Solr also as MoreLikeThis queries that can be used to truncate the search to the top most similar samples in the set of samples in the case you have some very frequent non zero features that would mostly break the sparsity of the cosine similarity matrix. As Trey Grainger says in his talk Building a real time, solr-powered recommendation engine: A Lucene index is a multi-dimensional sparse matrix… with very fast and powerful lookup capabilities. [1] http://packages.python.org/Whoosh/quickstart.html [2] http://www.slideshare.net/treygrainger/building-a-real-time-solrpowered-recommendation-engine Thanks, this looks promising. What do you exactly mean, by encoding cooccurrences as text identifiers? How would I handle my sparse vectors then? It's just that the Solr API deals with text document as inputs rather than precomputed integer feature index + float feature value: you cannot bypass the text feature extraction layer of Solr (the analyzers) unfortunately. I know the MoreLikeThis functionality, but does it exactly do cosine similarity? The thing is, that I need this relatedness emasure for my studies. No it's a truncated approximation (a lower bound) but it keeps many zeros in your similarity matrix in case you have terms that occur in every single document. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -- WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general