Re: [Scikit-learn-general] Text Classification with more than 10 classes

2012-06-15 Thread Robert Layton
Hi there. Can you please post the code you are using? Thanks, Robert On Jun 16, 2012 10:35 AM, "Fahd S. Alotaibi" wrote: > Hi everybody, > > I'm using this brilliant framework in text classification. I spotted that > when the number of classes are > 10, the sklearn just work on with the > first

[Scikit-learn-general] Text Classification with more than 10 classes

2012-06-15 Thread Fahd S. Alotaibi
Hi everybody, I'm using this brilliant framework in text classification. I spotted that when the number of classes are > 10, the sklearn just work on with the first 10 classes only and ignore the remaining classes. This seems a bit strange. I went quickly throw the sklearn files to see if I cou

Re: [Scikit-learn-general] feature selection algo

2012-06-15 Thread josef . pktd
On Fri, Jun 15, 2012 at 4:50 PM, Yaroslav Halchenko wrote: > > On Fri, 15 Jun 2012, [email protected] wrote: >> https://github.com/PyMVPA/PyMVPA/blob/master/mvpa2/misc/dcov.py#L160 >> looks like a double sum, but wikipedia only has one sum, elementwise product. > > sorry -- I might be slow -- w

Re: [Scikit-learn-general] LogisticRegression versus SGDClassifier(loss="log")?

2012-06-15 Thread Lars Buitinck
2012/6/15 Peter Prettenhofer : > Both are not proper multinomial logistic regression models; > LogisticRegression does not care and simply computes the probability > estimates of each OVR classifier and normalized to make sure they sum > to one. You could do the same for SGDClassifier(loss='log') b

Re: [Scikit-learn-general] LogisticRegression versus SGDClassifier(loss="log")?

2012-06-15 Thread Fred Mailhot
Thanks for the prompt reply, Peter. I may be in a situation that will call for SGDClassifier, so I have two follow-up questions: 1) I'd like to compute the class probs; are the probs for the individual OvR classifiers (easily) accessible? My intuition is that I can compute these from the returned

Re: [Scikit-learn-general] Possibility to do a sprint in Paris, 13-14 September

2012-06-15 Thread Nelle Varoquaux
I've added the sprint on pyconfr's website: http://www.pycon.fr/2012/sprints/ and I've updated the upcoming event on the github's wiki: https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events I've transfered the information on the Granada sprint to the "previous sprint" section. Thanks,

Re: [Scikit-learn-general] LogisticRegression versus SGDClassifier(loss="log")?

2012-06-15 Thread Peter Prettenhofer
Hi Fred, the major difference is the optimization algorithm: Liblinear/Coordinate Descent vs. Stochastic Gradient Descent. If your problem is high dimensional (10K or more) and you have a large number of examples (100K or more) you should choose the latter - otherwise, LogisticRegression should b

Re: [Scikit-learn-general] feature selection algo

2012-06-15 Thread Yaroslav Halchenko
On Fri, 15 Jun 2012, [email protected] wrote: > https://github.com/PyMVPA/PyMVPA/blob/master/mvpa2/misc/dcov.py#L160 > looks like a double sum, but wikipedia only has one sum, elementwise product. sorry -- I might be slow -- what sum? there is only an outer product in 160:Axy = Ax[:, None

Re: [Scikit-learn-general] feature selection algo

2012-06-15 Thread josef . pktd
On Fri, Jun 15, 2012 at 4:20 PM, Yaroslav Halchenko wrote: > Here is a comparison to output of my code (marked with >): > >  0.00458652660079 0.788017364828 0.00700027844478 0.00483928213727 >> 0.145564526722 0.480124905375 0.422482399359 0.217567496918 > 6.50616752373e-07 7.99461373461e-05 0.0070

[Scikit-learn-general] LogisticRegression versus SGDClassifier(loss="log")?

2012-06-15 Thread Fred Mailhot
Dear all, What are the advantages of choosing one of the Subject line classifiers over the other? At a quick glance, I see the following: - LogisticRegression implements predict_proba for the multiclass case, while SGDClassifier doesn't - SGDClassifier(loss="log") lets you specify multiple CPUs f

Re: [Scikit-learn-general] feature selection algo

2012-06-15 Thread Yaroslav Halchenko
Here is a comparison to output of my code (marked with >): 0.00458652660079 0.788017364828 0.00700027844478 0.00483928213727 > 0.145564526722 0.480124905375 0.422482399359 0.217567496918 6.50616752373e-07 7.99461373461e-05 0.00700027844478 0.0094610687282 > 0.120884106118 0.249205123601 0.4224823

Re: [Scikit-learn-general] feature selection algo

2012-06-15 Thread josef . pktd
On Fri, Jun 15, 2012 at 3:50 PM, wrote: > On Fri, Jun 15, 2012 at 10:45 AM, Yaroslav Halchenko > wrote: >> >> On Fri, 15 Jun 2012, Satrajit Ghosh wrote: >>>    hi yarik, >>>    here is my attempt: >>>     >>> [1]https://github.com/satra/scikit-learn/blob/enh/covariance/sklearn/covariance/distan

Re: [Scikit-learn-general] feature selection algo

2012-06-15 Thread josef . pktd
On Fri, Jun 15, 2012 at 10:45 AM, Yaroslav Halchenko wrote: > > On Fri, 15 Jun 2012, Satrajit Ghosh wrote: >>    hi yarik, >>    here is my attempt: >>     >> [1]https://github.com/satra/scikit-learn/blob/enh/covariance/sklearn/covariance/distance_covariance.py >>    i'll look at your code in det

Re: [Scikit-learn-general] Customizing the vectorizer classes

2012-06-15 Thread Lars Buitinck
2012/6/15 Dinesh B Vadhia : > The class CharNGramAnalyzer is documentated at > http://scikit-learn.org/0.8/modules/generated/scikits.learn.feature_extraction.text.CharNGramAnalyzer.html#scikits.learn.feature_extraction.text.CharNGramAnalyzer. That's the 0.8 documentation. The latest release is 0.1

Re: [Scikit-learn-general] Customizing the vectorizer classes

2012-06-15 Thread Dinesh B Vadhia
Olivier I tried to run https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py and got the error: from sklearn.feature_extraction.text import CharNGramAnalyzer ImportError: cannot import name CharNGramAnalyzer The class CharNGramAnalyzer is documentated at ht

Re: [Scikit-learn-general] feature selection algo

2012-06-15 Thread Yaroslav Halchenko
On Fri, 15 Jun 2012, Satrajit Ghosh wrote: >hi yarik, >here is my attempt: > > [1]https://github.com/satra/scikit-learn/blob/enh/covariance/sklearn/covariance/distance_covariance.py >i'll look at your code in detail later today to understand the uv=True it is just to compute dCo[v

Re: [Scikit-learn-general] feature selection algo

2012-06-15 Thread Satrajit Ghosh
hi yarik, here is my attempt: https://github.com/satra/scikit-learn/blob/enh/covariance/sklearn/covariance/distance_covariance.py i'll look at your code in detail later today to understand the uv=True case. cheers, satra On Fri, Jun 15, 2012 at 10:19 AM, Yaroslav Halchenko wrote: > I haven't

Re: [Scikit-learn-general] feature selection algo

2012-06-15 Thread Yaroslav Halchenko
I haven't had a chance to play with it extensively but I have a basic implementation: https://github.com/PyMVPA/PyMVPA/blob/master/mvpa2/misc/dcov.py which still lacks statistical assessment, but provides dCov, dCor values and yes -- it is "inherently multivariate", but since also could be useful

Re: [Scikit-learn-general] feature selection algo

2012-06-15 Thread Satrajit Ghosh
hi yarik, hm... interesting -- and there is no comparison against "minimizing > independence"? e.g. dCov measure > http://en.wikipedia.org/wiki/Distance_correlation which is really simple > to estimate and as intuitive as a correlation coefficient > thanks for bringing up dCov. have you had a cha

Re: [Scikit-learn-general] feature selection algo

2012-06-15 Thread xinfan meng
Submitted 5/07; Revised 6/11; Published 5/12 It takes such a long time ... On Fri, Jun 15, 2012 at 8:58 PM, Satrajit Ghosh wrote: > fyi > > -- Forwarded message -- > From: joshua vogelstein > Date: Fri, Jun 15, 2012 at 12:35 AM > > http://jmlr.csail.mit.edu/papers/volume13/song

Re: [Scikit-learn-general] feature selection algo

2012-06-15 Thread Yaroslav Halchenko
hm... interesting -- and there is no comparison against "minimizing independence"? e.g. dCov measure http://en.wikipedia.org/wiki/Distance_correlation which is really simple to estimate and as intuitive as a correlation coefficient On Fri, 15 Jun 2012, Satrajit Ghosh wrote: >fyi >

[Scikit-learn-general] feature selection algo

2012-06-15 Thread Satrajit Ghosh
fyi -- Forwarded message -- From: joshua vogelstein Date: Fri, Jun 15, 2012 at 12:35 AM http://jmlr.csail.mit.edu/papers/volume13/song12a/song12a.pdf these guys define a nice nonlinear/nonparametric measure of correlation that might be of interest to you. ---

Re: [Scikit-learn-general] pickled random forest file size, by design?

2012-06-15 Thread Emanuele Olivetti
On 06/13/2012 10:52 AM, Olivier Grisel wrote: > 2012/6/13 Emanuele Olivetti: >> Hi, >> >> You can use gzip.open() instead of open() to add compression and to >> (possibly) >> decrease the file size a lot - at least it did to me in a similar example: >> >> import gzip >> pickle.dump(clf, gzip.open(

Re: [Scikit-learn-general] Customizing the vectorizer classes ... for Asian Languages

2012-06-15 Thread Olivier Grisel
2012/6/15 xinfan meng : > The docs tell you that you can customize an define a preprocessor to first > segment the text if needed, e.g. in Chinese or Japanese. However, sklearn > does not provide one such preprocessor. To see how you can implement one, > the best way is to take a look at the codes.

Re: [Scikit-learn-general] fetch_mldata()

2012-06-15 Thread Andreas Mueller
Am 15.06.2012 10:48, schrieb Olivier Grisel: > 2012/6/15 iBayer: >> Hey Andreas, >> >> I'm in contact with folks at mldata.org apparently thinks aren't as >> easy as I was hoping. The hdf5 format description isn't is outdated... >> I already uploaded a couple files but there aren't of any use and

Re: [Scikit-learn-general] fetch_mldata()

2012-06-15 Thread Olivier Grisel
2012/6/15 iBayer : > Hey Andreas, > > I'm in contact with folks at mldata.org apparently thinks aren't as > easy as I was hoping. The hdf5 format description isn't is outdated... > I already uploaded  a couple files but there aren't of any use and > yes the sparse format is especially problematic.

Re: [Scikit-learn-general] Possibility to do a sprint in Paris, 13-14 September

2012-06-15 Thread Nelle Varoquaux
I'll create a wiki page on the scikit's github wiki, and indicate the sprint on pyconfr's website. Cheers, N On 14 June 2012 18:42, Alexandre Gramfort wrote: > I should be there too > > Alex > > On Thu, Jun 14, 2012 at 6:50 PM, Olivier Grisel > wrote: > > 2012/6/14 Nelle Varoquaux : > >> Hi eve

Re: [Scikit-learn-general] Customizing the vectorizer classes ... for Asian Languages

2012-06-15 Thread xinfan meng
The docs tell you that you can customize an define a preprocessor to first segment the text if needed, e.g. in Chinese or Japanese. However, sklearn does not provide one such preprocessor. To see how you can implement one, the best way is to take a look at the codes. I think the text processing pip

Re: [Scikit-learn-general] fetch_mldata()

2012-06-15 Thread iBayer
Hey Andreas, I'm in contact with folks at mldata.org apparently thinks aren't as easy as I was hoping. The hdf5 format description isn't is outdated... I already uploaded a couple files but there aren't of any use and yes the sparse format is especially problematic. I'll keep you posted 2012/6/