Re: [Scikit-learn-general] random forest question

2012-10-27 Thread Joseph Turian
 So the short answer is no. All features will be considered when
 building a decision tree, as it should.

Tommy,

I know the speaker at pydata today claimed that the features are
partitioned, but I don't believe this to be the case in how random
forests were originally specified.

Best,
   Joseph

--
WINDOWS 8 is here. 
Millions of people.  Your app in 30 days.
Visit The Windows 8 Center at Sourceforge for all your go to resources.
http://windows8center.sourceforge.net/
join-generation-app-and-make-money-coding-fast/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] random forest question

2012-10-27 Thread Gilles Louppe
Hi,

 I know the speaker at pydata today claimed that the features are
 partitioned,

Can you elaborate? If you pick your features prior to the construction
of the tree and then build it on that subset only, then indeed, this
is not random forest. That algorithm is called Random Subspaces.

Best,

Gilles

--
WINDOWS 8 is here. 
Millions of people.  Your app in 30 days.
Visit The Windows 8 Center at Sourceforge for all your go to resources.
http://windows8center.sourceforge.net/
join-generation-app-and-make-money-coding-fast/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] random forest question

2012-10-27 Thread Joseph Turian
Gilles,

I met Tommy Guy at the pydata conference today.
If I remember correctly, Brian Eoff (I don't have his email address)
errantly said that random forests partitions/samples the features
before creating each tree. I didn't want to correct him in front of
the audience, and it slipped my mind to mention it to him later.

But I remembered when Tommy Guy asked the question.

   Joseph

On Sat, Oct 27, 2012 at 5:16 AM, Gilles Louppe g.lou...@gmail.com wrote:
 Hi,

 I know the speaker at pydata today claimed that the features are
 partitioned,

 Can you elaborate? If you pick your features prior to the construction
 of the tree and then build it on that subset only, then indeed, this
 is not random forest. That algorithm is called Random Subspaces.

 Best,

 Gilles

 --
 WINDOWS 8 is here.
 Millions of people.  Your app in 30 days.
 Visit The Windows 8 Center at Sourceforge for all your go to resources.
 http://windows8center.sourceforge.net/
 join-generation-app-and-make-money-coding-fast/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Joseph Turian, Ph.D. | President, MetaOptimize
Optimize Profits. Optimize Engagement.
http://metaoptimize.com
855-ALL-DATA

The web's most active forum for data scientists: http://metaoptimize.com/qa/

--
WINDOWS 8 is here. 
Millions of people.  Your app in 30 days.
Visit The Windows 8 Center at Sourceforge for all your go to resources.
http://windows8center.sourceforge.net/
join-generation-app-and-make-money-coding-fast/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Jython and Scikit-Learn

2012-10-27 Thread didier vila

All,
it s look like that the system ERP that we want to implement has yet an API in 
C++.
SO this is a good news for python and scikit learn. It will be just a question 
to create a wrapper in Python to have access to the system through their C++ 
API. Does it looks sensible ? 
Regards
Didier

 From: robert.k...@gmail.com
 Date: Fri, 26 Oct 2012 21:52:03 +0100
 To: scikit-learn-general@lists.sourceforge.net
 Subject: Re: [Scikit-learn-general] Jython and Scikit-Learn
 
 On Fri, Oct 26, 2012 at 4:52 PM, Didier Vila dv...@capquestco.com wrote:
  Mathieu and Olivier,
 
  Thanks for your emails.
 
  My interest on python and scikit-learn growth each day so I will try a
  solution for the new system through Jepp or Jpype. I will let you know
  what will happen.
 
 You may also want to consider jnius:
 
 http://pypi.python.org/pypi/jnius/
 
 -- 
 Robert Kern
 
 --
 WINDOWS 8 is here. 
 Millions of people.  Your app in 30 days.
 Visit The Windows 8 Center at Sourceforge for all your go to resources.
 http://windows8center.sourceforge.net/
 join-generation-app-and-make-money-coding-fast/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
  --
WINDOWS 8 is here. 
Millions of people.  Your app in 30 days.
Visit The Windows 8 Center at Sourceforge for all your go to resources.
http://windows8center.sourceforge.net/
join-generation-app-and-make-money-coding-fast/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] ANN: astroML version 0.1

2012-10-27 Thread Jake Vanderplas
Thanks Gael,

Yes, I've been thinking a lot about density estimation, and I've 
designed all the astroML code to be fairly easy to move upstream if 
desired.  I have a bit of a vision for density estimation: I'd love in 
the future to create an sklearn.density submodule which has things like 
KDE (built on an improved ball tree), KNN density, Extreme 
Deconvolution, etc.  They'd have an interface similar to the current GMM 
(most of that code, as you saw, is already in astroML).

When that is in place, we could create a very general Bayesian 
generative classifier, which would learn a density representation for 
each class using any of these estimators, allow for user-specifiable 
priors, and then perform probabilistic classification of new points 
based on the per-class densities.  This would supersede GaussianNB, 
KNeighborsClassifier, and RadiusNeighborsClassifier (and maybe others), 
in the sense that they could be easily implemented as specializations of 
the new routine. I think this could be a really powerful addition to 
scikit-learn.

Just my thoughts for the morning... back to PyData!
Jake

On 10/27/2012 01:17 AM, Gael Varoquaux wrote:
 It looks really awesome! The examples are superbe.

 It looks like you have some really cool density estimation code. I would
 personnally love to see such functionality in the scikit. Do you think
 that some of it could be move upstream?

 Thanks a lot for being our astrophysics figure-head! I feel that the
 astroML and the scikit will have an impact there.

 Gael

 --
 WINDOWS 8 is here.
 Millions of people.  Your app in 30 days.
 Visit The Windows 8 Center at Sourceforge for all your go to resources.
 http://windows8center.sourceforge.net/
 join-generation-app-and-make-money-coding-fast/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



--
WINDOWS 8 is here. 
Millions of people.  Your app in 30 days.
Visit The Windows 8 Center at Sourceforge for all your go to resources.
http://windows8center.sourceforge.net/
join-generation-app-and-make-money-coding-fast/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Precision-recall now requires probas_pred to be in [0, 1]

2012-10-27 Thread Gael Varoquaux
On Fri, Oct 26, 2012 at 06:24:28PM +0100, Andreas Mueller wrote:
 Which PR was that. That is bad :-(
  I suggest to change it back to working with any non-bounded test
  statistic. Any reason not to? I am proposing to do the work.
 +1

Done in 90c007981f54

G

--
WINDOWS 8 is here. 
Millions of people.  Your app in 30 days.
Visit The Windows 8 Center at Sourceforge for all your go to resources.
http://windows8center.sourceforge.net/
join-generation-app-and-make-money-coding-fast/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Jython and Scikit-Learn

2012-10-27 Thread Joseph Turian
How does jnius compare with jpype?

On Fri, Oct 26, 2012 at 4:52 PM, Robert Kern robert.k...@gmail.com wrote:
 On Fri, Oct 26, 2012 at 4:52 PM, Didier Vila dv...@capquestco.com wrote:
 Mathieu and Olivier,

 Thanks for your emails.

 My interest on python and scikit-learn growth each day so I will try a
 solution for the new system through Jepp or Jpype. I will let you know
 what will happen.

 You may also want to consider jnius:

 http://pypi.python.org/pypi/jnius/

 --
 Robert Kern

 --
 WINDOWS 8 is here.
 Millions of people.  Your app in 30 days.
 Visit The Windows 8 Center at Sourceforge for all your go to resources.
 http://windows8center.sourceforge.net/
 join-generation-app-and-make-money-coding-fast/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Joseph Turian, Ph.D. | President, MetaOptimize
Optimize Profits. Optimize Engagement.
http://metaoptimize.com
855-ALL-DATA

The web's most active forum for data scientists: http://metaoptimize.com/qa/

--
WINDOWS 8 is here. 
Millions of people.  Your app in 30 days.
Visit The Windows 8 Center at Sourceforge for all your go to resources.
http://windows8center.sourceforge.net/
join-generation-app-and-make-money-coding-fast/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Jython and Scikit-Learn

2012-10-27 Thread Robert Kern
On Sat, Oct 27, 2012 at 10:39 PM, Joseph Turian jos...@metaoptimize.com wrote:
 How does jnius compare with jpype?

It isn't dead, mostly.

More seriously, with active developers and Cython underpinnings, they
might accept some PRs to add efficient numpy support.

-- 
Robert Kern

--
WINDOWS 8 is here. 
Millions of people.  Your app in 30 days.
Visit The Windows 8 Center at Sourceforge for all your go to resources.
http://windows8center.sourceforge.net/
join-generation-app-and-make-money-coding-fast/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] All-pairs-similarity calculation

2012-10-27 Thread Philipp Singer
Am 27.10.2012 23:43, schrieb Joseph Turian:
 If you only care about near matches and not the full n^2 matrix:

 +1 to OG's suggestion to use pylucene.

 You can use pylucene to generate candidates, and then compute the
 exact tf*idf cosine distance on the shortlist.

Yes exactly. I would only need the most similar matches.

The problem with the lucene solution is that I do not need tfidf. I 
really have to do simple cosine similarity on my available vectors.

So e.g., my matrix (vectors) look the following way:

[[1 2 5]
   [3 1 0]]

Now get the cosine similarity between row one and two or in this case 
get the most similar row given row one using cosine similarity without 
any further variations. As already mentioned I have the data in sparse form.

 I assume this will be n log n.

 Another option for fast all-pairs is to use locality sensitive
 hashing. (I didn't read the papers or see if that's what they do.)
 It is not clear what the accuracy will be, but it will probably be the 
 fastest.
 ]
Yeah, some kind of dimension reduction is another option, but actually 
this would be very hard for me because I have already done all my 
previous experiments on the complete representations, so if I could find 
any faster solution for my problem this would be awesome.

Regards,
Philipp

 On Fri, Oct 26, 2012 at 3:31 PM, Philipp Singer kill...@gmail.com wrote:
 Am 26.10.2012 15:35, schrieb Olivier Grisel:
 BTW, in the mean time you could encode your coocurrences as text
 identifiers use either Lucene/Solr in Java using the sunburnt python
 client or woosh [1] in python as a way to do efficient sparse lookups
 in such a sparse matrix to be able to quickly compute the non zero
 cosine similarities between all pairs. Solr also as MoreLikeThis
 queries that can be used to truncate the search to the top most
 similar samples in the set of samples in the case you have some very
 frequent non zero features that would mostly break the sparsity of the
 cosine similarity matrix. As Trey Grainger says in his talk Building
 a real time, solr-powered recommendation engine: A Lucene index is a
 multi-dimensional sparse matrix… with very fast and powerful lookup
 capabilities. [1] http://packages.python.org/Whoosh/quickstart.html
 [2]
 http://www.slideshare.net/treygrainger/building-a-real-time-solrpowered-recommendation-engine
 Thanks, this looks promising. What do you exactly mean, by encoding
 cooccurrences as text identifiers? How would I handle my sparse vectors
 then?

 I know the MoreLikeThis functionality, but does it exactly do cosine
 similarity? The thing is, that I need this relatedness emasure for my
 studies.

 Philipp


 --
 WINDOWS 8 is here.
 Millions of people.  Your app in 30 days.
 Visit The Windows 8 Center at Sourceforge for all your go to resources.
 http://windows8center.sourceforge.net/
 join-generation-app-and-make-money-coding-fast/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




--
WINDOWS 8 is here. 
Millions of people.  Your app in 30 days.
Visit The Windows 8 Center at Sourceforge for all your go to resources.
http://windows8center.sourceforge.net/
join-generation-app-and-make-money-coding-fast/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] All-pairs-similarity calculation

2012-10-27 Thread Olivier Grisel
2012/10/26 Philipp Singer kill...@gmail.com:
 Am 26.10.2012 15:35, schrieb Olivier Grisel:
 BTW, in the mean time you could encode your coocurrences as text
 identifiers use either Lucene/Solr in Java using the sunburnt python
 client or woosh [1] in python as a way to do efficient sparse lookups
 in such a sparse matrix to be able to quickly compute the non zero
 cosine similarities between all pairs. Solr also as MoreLikeThis
 queries that can be used to truncate the search to the top most
 similar samples in the set of samples in the case you have some very
 frequent non zero features that would mostly break the sparsity of the
 cosine similarity matrix. As Trey Grainger says in his talk Building
 a real time, solr-powered recommendation engine: A Lucene index is a
 multi-dimensional sparse matrix… with very fast and powerful lookup
 capabilities. [1] http://packages.python.org/Whoosh/quickstart.html
 [2]
 http://www.slideshare.net/treygrainger/building-a-real-time-solrpowered-recommendation-engine

 Thanks, this looks promising. What do you exactly mean, by encoding
 cooccurrences as text identifiers? How would I handle my sparse vectors
 then?

It's just that the Solr API deals with text document as inputs rather
than precomputed integer feature index + float feature value: you
cannot bypass the text feature extraction layer of Solr (the
analyzers) unfortunately.

 I know the MoreLikeThis functionality, but does it exactly do cosine
 similarity? The thing is, that I need this relatedness emasure for my
 studies.

No it's a truncated approximation (a lower bound) but it keeps many
zeros in your similarity matrix in case you have terms that occur in
every single document.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

--
WINDOWS 8 is here. 
Millions of people.  Your app in 30 days.
Visit The Windows 8 Center at Sourceforge for all your go to resources.
http://windows8center.sourceforge.net/
join-generation-app-and-make-money-coding-fast/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general