Re: [scikit-learn] why the modification in the df-idf formula?

2024-05-28 Thread Sebastian Raschka
Hi Sole,

It’s been a long time, but I remember helping with drafting the Tf-idf text in 
the documentation as part of a scikit-learn sprint at SciPy a looong time ago 
where I mentioned this difference (since it initially surprised me, because I 
couldn’t get it to match my from-scratch implementation). As far as I remember, 
the sklearn version addressed some instability issues for certain edge cases.

I am not sure if that helps, but I have briefly compared the textbook vs the 
sklearn tf-idf here: 
https://github.com/rasbt/machine-learning-book/blob/main/ch08/ch08.ipynb

Best,
Sebastian





--
Sebastian Raschka, PhD
Machine learning and AI researcher, https://sebastianraschka.com

Staff Research Engineer at Lightning AI, https://lightning.ai


On May 28, 2024 at 9:43 AM -0500, Sole Galli via scikit-learn 
, wrote:
> Hi guys,
>
> I'd like to understand why sklearn's implementation of tf-idf is different 
> from the standard textbook notation as described in the docs: 
> https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting
>
> Do you have any reference that I could take a look at? I didn't manage to 
> find them in the docs, maybe I missed something?
>
> Thank you!
>
> Best wishes
> Sole
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] New core developer: Tim Head

2023-03-08 Thread Sebastian Raschka
Awesome news! Congrats Tim!

Cheers,
Sebastian








On Mar 8, 2023, 8:35 AM -0600, Ruchika Nayyar , wrote:
> Congratulations Tim! Good to see you virtually :)
>
> Thanks,
> Ruchika
>
> 
> Dr. Ruchika Nayyar
> Data Scientist, Greene Tweed & Co.
>
>
> > On Wed, Mar 8, 2023 at 5:09 AM Tim Head  wrote:
> > > Thanks a lot! I look forward to working together with the community and 
> > > other contributors!
> > >
> > > T
> > >
> > > > On Mon, 6 Mar 2023 at 23:51, Christian Lorentzen 
> > > >  wrote:
> > > > > Dear all
> > > > >
> > > > > I'm very excited to announce that Tim Head, 
> > > > > https://github.com/betatim,
> > > > > is joining scikit-learn as core developer.
> > > > > Congratulations and a warm welcome Tim!
> > > > >
> > > > > on behalf of the scikit-learn team
> > > > > Christian
> > > > >
> > > > > ___
> > > > > scikit-learn mailing list
> > > > > scikit-learn@python.org
> > > > > https://mail.python.org/mailman/listinfo/scikit-learn
> > > ___
> > > scikit-learn mailing list
> > > scikit-learn@python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANNOUNCEMENT] scikit-learn 1.0 release

2021-09-24 Thread Sebastian Raschka
A 1.0 release is huge, and this is really awesome news! Very exciting! Congrats 
to the scikit-learn team and everyone who helped making this possible!

Cheers,
Sebastian
On Sep 24, 2021, 11:40 AM -0500, Adrin , wrote:
> Hi everyone,
>
> We're happy to announce the 1.0 release which you can install via pip or 
> conda:
>
>     pip install -U scikit-learn
>
> or
>
>     conda install -c conda-forge scikit-learn
>
> You can read the release highlights under 
> https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_0_0.html
>  and the long list of the changes under 
> https://scikit-learn.org/stable/whats_new/v1.0.html
>
> New major features include: mandatory keyword arguments in many places, 
> Spline Transformers, Quantile Regressor, Feature Names Support, a more 
> flexible plotting API, Online One-Class SVM, and much more!
>
> This version supports Python versions 3.7 to 3.9.
>
> A big thanks to all contributors for making this release possible.
>
> Regards,
> Adrin,
> On the behalf of the scikit-learn maintainer team.
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Regarding negative value of sklearn.metrics.r2_score and sklearn.metrics.explained_variance_score

2021-08-12 Thread Sebastian Raschka
The R2 function in scikit-learn works fine. A negative means that the 
regression model fits the data worse than a horizontal line representing the 
sample mean. E.g. you usually get that if you are overfitting the training set 
a lot and then apply that model to the test set. The econometrics book probably 
didn't cover applying a model to an independent data or test set, hence the [0, 
1] suggestion.

Cheers,
Sebastian


On Aug 12, 2021, 2:20 PM -0500, Samir K Mahajan , 
wrote:
>
> Dear Christophe Pallier,  Reshama Saikh and Tromek Drabas,
>
> Thank you for your kind response.  Fair enough. I go with you R2 is not a 
> square.  However, if you open any  book of econometrics,  it says R2 is  a 
> ratio that lies between 0  and 1.  This is the constraint. It measures the 
> proportion or percentage of the total variation in  response variable (Y)  
> explained by the regressors (Xs) in the model . Remaining proportion of 
> variation in Y, if any,  is explained by the residual term(u) Now, 
> sklearn.matrics. metrics.r2_score gives me a negative value lying on a linear 
> scale (-5.763335245921777). This negative value breaks the constraint. I just 
> want to highlight that. I think it needs to be corrected. Rest is up to you .
>
> I find that  Reshama Saikh  is hurt by my email. I am really sorry for that. 
> Please note I never undermine your  capabilities and initiatives. You are 
> great people doing great jobs. I realise that I should have been more 
> sensible.
>
> My regards to all of you.
>
> Samir K Mahajan
>
>
>
>
>
>
>
>
> > On Thu, Aug 12, 2021 at 12:02 PM Christophe Pallier 
> >  wrote:
> > > Simple: despite its name R2 is not a square. Look up its definition.
> > >
> > > > On Wed, 11 Aug 2021, 21:17 Samir K Mahajan, 
> > > >  wrote:
> > > > > Dear All,
> > > > > I am amazed to find  negative  values of  sklearn.metrics.r2_score 
> > > > > and sklearn.metrics.explained_variance_score in a model ( cross 
> > > > > validation of OLS regression model)
> > > > > However, what amuses me more  is seeing you justifying   negative  
> > > > > 'sklearn.metrics.r2_score ' in your documentation.  This does not 
> > > > > make sense to me . Please justify to me how squared values are 
> > > > > negative.
> > > > >
> > > > > Regards,
> > > > > Samir K Mahajan.
> > > > >
> > > > > ___
> > > > > scikit-learn mailing list
> > > > > scikit-learn@python.org
> > > > > https://mail.python.org/mailman/listinfo/scikit-learn
> > > ___
> > > scikit-learn mailing list
> > > scikit-learn@python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Presented scikit-learn to the French President

2020-12-05 Thread Sebastian Raschka
This is really awesome news! Thanks a lot to everyone developing scikit-learn. 
I am just wrapping up another successful semester, teaching students ML basics. 
Most coming from an R background, they really loved scikit-learn and 
appreciated it's ease of use and well-thought-out API.

Best,
Sebastian

> On Dec 5, 2020, at 9:28 AM, Jitesh Khandelwal  wrote:
> 
> Amazing, inspiring! Kudos to the sklearn team.
> 
> On Sat, Dec 5, 2020, 4:30 AM Gael Varoquaux  
> wrote:
> Hi scikit-learn community,
> 
> Today, I presented some efforts in digital health to the French president
> and part of the government. As these efforts were partly powered by
> scikit-learn (and the whole pydata stack, to be fair), the team in charge
> of the event had printed a huge scikit-learn logo behind me:
> https://twitter.com/GaelVaroquaux/status/1334959438059462659 (terrible
> mobile-phone picture)
> 
> I would have liked to get a picture with the president and the logo, but
> it seems that they are releasing only a handful of pictures :(. Anyhow... 
> 
> 
> Thanks to the community! This is a huge success. For health topics (we
> are talking nationwide electronic health records) the ability to build on
> an independent open-source stack is extremely important. We, as a wider
> community, are building something priceless.
> 
> Cheers,
> 
> Gaël
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] make_classification question

2020-08-12 Thread Sebastian Raschka
Hi Anna,

You can set shuffle=False (it's set to True by default in the 
make_classification function). Then, the resulting features will be sorted as 
follows:  X[:, :n_informative + n_redundant + n_repeated]. I.e., if you set 
“n_features=1000” and “n_informative=20”, the first 20 features will be the 
informative ones.

Best,
Sebastian

> On Aug 12, 2020, at 8:35 AM, Anna Jenul  wrote:
> 
> Hi!
> I am generating own datasets with sklearn.datasets.make_classification. 
> Unfortunately, I cannot figure out which of the generated features are the 
> informative ones. In my example I generate “n_features=1000” and 
> “n_informative=20”. Is there any possibility to get the informative features 
> after the dataset is generated?
> Thanks,
> Anna
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] The exact formula used to compute the tf-idf

2020-02-01 Thread Sebastian Raschka
Hi there,

unfortunately I currently don't have time to walk through your example, but I 
wrote down how the Tf-idf in sklearn works using some examples here: 
https://github.com/rasbt/pattern_classification/blob/90710922e4f4d7e3f432221b8a4d2ec1dd2d9dc9/machine_learning/scikit-learn/tfidf_scikit-learn.ipynb

(I remember that we used it to write portions of the documentation in sklearn 
later)

Best,
Sebastian

> On Feb 1, 2020, at 12:53 PM, Peng Yu  wrote:
> 
> Hi,
> 
> I am trying to understand the exact formula for tf-idf.
> 
> vectorizer = TfidfVectorizer(ngram_range = (1, 1), norm = None)
> wordtfidf = vectorizer.fit_transform(texts)
> 
> Given the following 3 documents (id1, id2, id3 are the IDs of the
> three documents).
> 
> id1   AA BB BB CC CC CC
> id2   AA AA AA AA BB BB BB BB BB DD DD DD DD DD DD
> id3   AA AA AA AA AA AA AA DD DD DD DD DD DD DD DD FF FF FF FF FF FF FF FF FF
> 
> The results are the following.
> 
> id1▸  cc▸ 5.079441541679836¬
> id1▸  bb▸ 2.5753641449035616¬
> id1▸  aa▸ 1.0¬
> id2▸  dd▸ 7.726092434710685¬
> id2▸  bb▸ 6.438410362258904¬
> id2▸  aa▸ 4.0¬
> id3▸  ff▸ 15.238324625039509¬
> id3▸  dd▸ 10.301456579614246¬
> id3▸  aa▸ 7.0¬
> 
> According to "6.2.3.4. Tf–idf term weighting" on the following page.
> 
> https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
> 
> For aa, as n = 3 and df =3, idf(aa) = log((1+n)/(1+df)) + 1 = 1.
> 
> But I don't understand why tf-idf(id1, aa) is 1. This means that
> tf(id1, aa) is 1, which is just the count of aa, shouldn't it be
> divided by the number of terms in the doc id1, which should result in
> 1/6 instead of 1?
> 
> Thanks.
> 
> -- 
> Regards,
> Peng
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] What are the stopwords used by CountVectorizer?

2020-01-27 Thread Sebastian Raschka
Hi Peng,

check out 
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/_stop_words.py

Best,
Sebastian

> On Jan 27, 2020, at 2:30 PM, Peng Yu  wrote:
> 
> Hi,
> 
> I don't see what stopwords are used by CountVectorizer with
> stop_wordsstring = ‘english’.
> 
> https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
> 
> Is there a way to figure it out? Thanks.
> 
> -- 
> Regards,
> Peng
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] scikit-learn twitter account

2019-11-04 Thread Sebastian Raschka
I think that a twitter account for scikit-learn would be awesome. I could 
envision it for announcements (new features, package releases, etc.), but it 
would be cool to share interesting applications of scikit-learn, upcoming 
events (tutorials, conference talks) as well -- somewhat similar to what they 
are doing with @PyTorch. That would be super nice.

Best.
Sebastian

> On Nov 4, 2019, at 8:04 AM, Guillaume Lemaître  wrote:
> 
> +1 for outreach / -1 for support
> 
> FWIW we have several persons asking us how they could know about future 
> sprints at the Man AHL sprint. The Twitter account could be a nice channel to 
> relay the info about such public event. Communicating on the releases would 
> also be great. 
> 
> 
> 
> 
> Sent from my phone - sorry to be brief and potential misspell.
> 
> 
> 
> Original Message  
> 
> 
> 
> From: gael.varoqu...@normalesup.org
> Sent: 4 November 2019 14:05
> To: scikit-learn@python.org
> Reply to: scikit-learn@python.org
> Subject: Re: [scikit-learn] scikit-learn twitter account
> 
> 
> On Mon, Nov 04, 2019 at 05:41:31PM +0530, Siddharth Gupta wrote:
>> Would be good for the users to have a social media account to reach out to.
> 
> I do not think that the point is to do support, but outreach.
> 
> Gaël
> 
>> On Mon, 4 Nov 2019, 17:38 Nicolas Hug,  wrote:
> 
> 
>>  I like the idea as well
> 
>>  On 11/4/19 5:58 AM, Adrin wrote:
> 
>>  sounds pretty good to me :)
> 
>>  On Mon, Nov 4, 2019 at 10:51 AM Chiara Marmo 
>> 
>>  wrote:
> 
>>  Hello everybody,
> 
>>  I've taken a look to the last meeting minutes: talking about
>>  releases and sprint announcements, it seems that the need for a
>>  centralized communication channel is rising, both from user and 
>> dev
>>  sides.
>>  What about starting to use the scikit-learn twitter account for
>>  that?
>>  This will also help to animate the community, sckit-learn 
>> benefits
>>  of a lot of mentions which are never answered.
>>  I can help with managing the account if needed.
> 
>>  WDYT?
> 
>>  Chiara
>>  ___
>>  scikit-learn mailing list
>>  scikit-learn@python.org
>>  https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
>>  ___
>>  scikit-learn mailing list
>>  scikit-learn@python.org
>>  https://mail.python.org/mailman/listinfo/scikit-learn
> 
>>  ___
>>  scikit-learn mailing list
>>  scikit-learn@python.org
>>  https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> --
> Gael Varoquaux
> Research Director, INRIA   Visiting professor, McGill
> http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Can we say stochastic gradient descent as an ML model?

2019-10-28 Thread Sebastian Raschka
Hi Bulbul,

I would rather say SGD is a method for optimizing the objective function of 
certain ML models, or optimize the loss function of certain ML models / learn 
the parameters of certain ML models.

Best,
Sebastian

> On Oct 28, 2019, at 4:00 PM, Bulbul Ahmmed via scikit-learn 
>  wrote:
> 
> Dear Scikit Learn Community!
> 
> Scikit learn puts stochastic gradient descent (SGD) as an ML model under the 
> umbrella of linear model. I know SGD is an optimization algorithm. My 
> question is: can we say SGD is an ML model? Thanks,
> 
> Best Regards,
> Bulbul
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-06 Thread Sebastian Raschka
You are right, changing the figure size would fix the issue (updated the 
notebook). In practice, I think the issue becomes choosing a good aspect ratio 
such that the 

a) general proportions of the plot look ok
b) proportions of the boxes wrt the arrows look ok

It's all possible for a user to do, but for my use cases (e.g., making a quick 
graphic for a presentation / meeting) it was just quicker with graphviz. On the 
other hand, I would prefer/recommend the plot_tree func just because it is 
based on matplotlib ...

In any case, I haven't had a chance to look at the plot_tree func but I guess 
this could potentially be relatively easy to address. I guess it would just 
require finding and setting a good default value for the

a) XOR case where a user provides either feature names or class label names. 
b) AND case where a user provides both feature names and class label names.



> On Oct 6, 2019, at 9:55 AM, Andreas Mueller  wrote:
> 
> Thanks!
> I'll double check that issue. Generally you have to set the figure size to 
> get good results.
> We should probably add some code to set the figure size automatically (if we 
> create a figure?).
> 
> 
> On 10/6/19 10:40 AM, Sebastian Raschka wrote:
>> Sure, I just ran an example I made with graphviz via plot_tree, and it looks 
>> like there's an issue with overlapping boxes if you use class (and/or 
>> feature) names. I made a reproducible example here so that you can take a 
>> look:
>> https://github.com/rasbt/bugreport/blob/master/scikit-learn/plot_tree/tree-demo-1.ipynb
>> 
>> Happy to add this to the sklearn issue list if there's no issue filed for 
>> that yet.
>> 
>> Best,
>> Sebastian
>> 
>>> On Oct 6, 2019, at 9:10 AM, Andreas Mueller  wrote:
>>> 
>>> 
>>> 
>>> On 10/4/19 11:28 PM, Sebastian Raschka wrote:
>>>> The docs show a way such that you don't need to write it as png file using 
>>>> tree.plot_tree:
>>>> https://scikit-learn.org/stable/modules/tree.html#classification
>>>> 
>>>> I don't remember why, but I think I had problems with that in the past (I 
>>>> think it didn't look so nice visually, but don't remember), which is why I 
>>>> still stick to graphviz.
>>> Can you give me examples that don't look as nice? I would love to improve 
>>> it.
>>> 
>>>>  For my use cases, it's not much hassle -- it used to be a bit of a hassle 
>>>> to get GraphViz working, but now you can do
>>>> 
>>>> conda install pydotplus
>>>> conda install graphviz
>>>> 
>>>> Coincidentally, I just made an example for a lecture I was teaching on 
>>>> Tue: 
>>>> https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb
>>>> 
>>>> Best,
>>>> Sebastian
>>>> 
>>>> 
>>>>> On Oct 4, 2019, at 10:09 PM, C W  wrote:
>>>>> 
>>>>> On a separate note, what do you use for plotting?
>>>>> 
>>>>> I found graphviz, but you have to first save it as a png on your 
>>>>> computer. That's a lot work for just one plot. Is there something like a 
>>>>> matplotlib?
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka 
>>>>>  wrote:
>>>>> Yeah, think of it more as a computational workaround for achieving the 
>>>>> same thing more efficiently (although it looks inelegant/weird)-- 
>>>>> something like that wouldn't be mentioned in textbooks.
>>>>> 
>>>>> Best,
>>>>> Sebastian
>>>>> 
>>>>>> On Oct 4, 2019, at 6:33 PM, C W  wrote:
>>>>>> 
>>>>>> Thanks Sebastian, I think I get it.
>>>>>> 
>>>>>> It's just have never seen it this way. Quite different from what I'm 
>>>>>> used in Elements of Statistical Learning.
>>>>>> 
>>>>>> On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka 
>>>>>>  wrote:
>>>>>> Not sure if there's a website for that. In any case, to explain this 
>>>>>> differently, as discussed earlier sklearn assumes continuous features 
>>>>>> for decision trees. So, it will use a binary threshold for splitting 
>>>>>> along a feature attribute. In other words, it cannot do sth like
>>>>>> 
>>>>>> if x == 1 then right child node
>>>

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-06 Thread Sebastian Raschka
Sure, I just ran an example I made with graphviz via plot_tree, and it looks 
like there's an issue with overlapping boxes if you use class (and/or feature) 
names. I made a reproducible example here so that you can take a look:
https://github.com/rasbt/bugreport/blob/master/scikit-learn/plot_tree/tree-demo-1.ipynb

Happy to add this to the sklearn issue list if there's no issue filed for that 
yet.

Best,
Sebastian

> On Oct 6, 2019, at 9:10 AM, Andreas Mueller  wrote:
> 
> 
> 
> On 10/4/19 11:28 PM, Sebastian Raschka wrote:
>> The docs show a way such that you don't need to write it as png file using 
>> tree.plot_tree:
>> https://scikit-learn.org/stable/modules/tree.html#classification
>> 
>> I don't remember why, but I think I had problems with that in the past (I 
>> think it didn't look so nice visually, but don't remember), which is why I 
>> still stick to graphviz.
> Can you give me examples that don't look as nice? I would love to improve it.
> 
>>  For my use cases, it's not much hassle -- it used to be a bit of a hassle 
>> to get GraphViz working, but now you can do
>> 
>> conda install pydotplus
>> conda install graphviz
>> 
>> Coincidentally, I just made an example for a lecture I was teaching on Tue: 
>> https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb
>> 
>> Best,
>> Sebastian
>> 
>> 
>>> On Oct 4, 2019, at 10:09 PM, C W  wrote:
>>> 
>>> On a separate note, what do you use for plotting?
>>> 
>>> I found graphviz, but you have to first save it as a png on your computer. 
>>> That's a lot work for just one plot. Is there something like a matplotlib?
>>> 
>>> Thanks!
>>> 
>>> On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka 
>>>  wrote:
>>> Yeah, think of it more as a computational workaround for achieving the same 
>>> thing more efficiently (although it looks inelegant/weird)-- something like 
>>> that wouldn't be mentioned in textbooks.
>>> 
>>> Best,
>>> Sebastian
>>> 
>>>> On Oct 4, 2019, at 6:33 PM, C W  wrote:
>>>> 
>>>> Thanks Sebastian, I think I get it.
>>>> 
>>>> It's just have never seen it this way. Quite different from what I'm used 
>>>> in Elements of Statistical Learning.
>>>> 
>>>> On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka 
>>>>  wrote:
>>>> Not sure if there's a website for that. In any case, to explain this 
>>>> differently, as discussed earlier sklearn assumes continuous features for 
>>>> decision trees. So, it will use a binary threshold for splitting along a 
>>>> feature attribute. In other words, it cannot do sth like
>>>> 
>>>> if x == 1 then right child node
>>>> else left child node
>>>> 
>>>> Instead, what it does is
>>>> 
>>>> if x >= 0.5 then right child node
>>>> else left child node
>>>> 
>>>> These are basically equivalent as you can see when you just plug in values 
>>>> 0 and 1 for x.
>>>> 
>>>> Best,
>>>> Sebastian
>>>> 
>>>>> On Oct 4, 2019, at 5:34 PM, C W  wrote:
>>>>> 
>>>>> I don't understand your answer.
>>>>> 
>>>>> Why after one-hot-encoding it still outputs greater than 0.5 or less 
>>>>> than? Does sklearn website have a working example on categorical input?
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka 
>>>>>  wrote:
>>>>> Like Nicolas said, the 0.5 is just a workaround but will do the right 
>>>>> thing on the one-hot encoded variables, here. You will find that the 
>>>>> threshold is always at 0.5 for these variables. I.e., what it will do is 
>>>>> to use the following conversion:
>>>>> 
>>>>> treat as car_Audi=1 if car_Audi >= 0.5
>>>>> treat as car_Audi=0 if car_Audi < 0.5
>>>>> 
>>>>> or, it may be
>>>>> 
>>>>> treat as car_Audi=1 if car_Audi > 0.5
>>>>> treat as car_Audi=0 if car_Audi <= 0.5
>>>>> 
>>>>> (Forgot which one sklearn is using, but either way. it will be fine.)
>>>>> 
>>>>> Best,
>>>>> Sebastian
>>>>> 
>>>>> 
>>>>>> O

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-04 Thread Sebastian Raschka
The docs show a way such that you don't need to write it as png file using 
tree.plot_tree:
https://scikit-learn.org/stable/modules/tree.html#classification

I don't remember why, but I think I had problems with that in the past (I think 
it didn't look so nice visually, but don't remember), which is why I still 
stick to graphviz. For my use cases, it's not much hassle -- it used to be a 
bit of a hassle to get GraphViz working, but now you can do

conda install pydotplus
conda install graphviz

Coincidentally, I just made an example for a lecture I was teaching on Tue: 
https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb

Best,
Sebastian


> On Oct 4, 2019, at 10:09 PM, C W  wrote:
> 
> On a separate note, what do you use for plotting? 
> 
> I found graphviz, but you have to first save it as a png on your computer. 
> That's a lot work for just one plot. Is there something like a matplotlib?
> 
> Thanks!
> 
> On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka  
> wrote:
> Yeah, think of it more as a computational workaround for achieving the same 
> thing more efficiently (although it looks inelegant/weird)-- something like 
> that wouldn't be mentioned in textbooks. 
> 
> Best,
> Sebastian
> 
> > On Oct 4, 2019, at 6:33 PM, C W  wrote:
> > 
> > Thanks Sebastian, I think I get it.
> > 
> > It's just have never seen it this way. Quite different from what I'm used 
> > in Elements of Statistical Learning.
> > 
> > On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka 
> >  wrote:
> > Not sure if there's a website for that. In any case, to explain this 
> > differently, as discussed earlier sklearn assumes continuous features for 
> > decision trees. So, it will use a binary threshold for splitting along a 
> > feature attribute. In other words, it cannot do sth like
> > 
> > if x == 1 then right child node
> > else left child node
> > 
> > Instead, what it does is
> > 
> > if x >= 0.5 then right child node
> > else left child node
> > 
> > These are basically equivalent as you can see when you just plug in values 
> > 0 and 1 for x.
> > 
> > Best,
> > Sebastian
> > 
> > > On Oct 4, 2019, at 5:34 PM, C W  wrote:
> > > 
> > > I don't understand your answer.
> > > 
> > > Why after one-hot-encoding it still outputs greater than 0.5 or less 
> > > than? Does sklearn website have a working example on categorical input?
> > > 
> > > Thanks!
> > > 
> > > On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka 
> > >  wrote:
> > > Like Nicolas said, the 0.5 is just a workaround but will do the right 
> > > thing on the one-hot encoded variables, here. You will find that the 
> > > threshold is always at 0.5 for these variables. I.e., what it will do is 
> > > to use the following conversion:
> > > 
> > > treat as car_Audi=1 if car_Audi >= 0.5
> > > treat as car_Audi=0 if car_Audi < 0.5
> > > 
> > > or, it may be
> > > 
> > > treat as car_Audi=1 if car_Audi > 0.5
> > > treat as car_Audi=0 if car_Audi <= 0.5
> > > 
> > > (Forgot which one sklearn is using, but either way. it will be fine.)
> > > 
> > > Best,
> > > Sebastian
> > > 
> > > 
> > >> On Oct 4, 2019, at 1:44 PM, Nicolas Hug  wrote:
> > >> 
> > >> 
> > >>> But, decision tree is still mistaking one-hot-encoding as numerical 
> > >>> input and split at 0.5. This is not right. Perhaps, I'm doing something 
> > >>> wrong?
> > >> 
> > >> You're not doing anything wrong, and neither is the tree. Trees don't 
> > >> support categorical variables in sklearn, so everything is treated as 
> > >> numerical.
> > >> 
> > >> This is why we do one-hot-encoding: so that a set of numerical (one hot 
> > >> encoded) features can be treated as if they were just one categorical 
> > >> feature.
> > >> 
> > >> 
> > >> 
> > >> Nicolas
> > >> 
> > >> On 10/4/19 2:01 PM, C W wrote:
> > >>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on 
> > >>> my part.
> > >>> 
> > >>> Looks like I did one-hot-encoding correctly. My new variable names are: 
> > >>> car_Audi, car_BMW, etc.
> > >>> 
> > >>> But, decision tree is still mistaking one-hot-encoding as numerical 
>

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-04 Thread Sebastian Raschka
Yeah, think of it more as a computational workaround for achieving the same 
thing more efficiently (although it looks inelegant/weird)-- something like 
that wouldn't be mentioned in textbooks. 

Best,
Sebastian

> On Oct 4, 2019, at 6:33 PM, C W  wrote:
> 
> Thanks Sebastian, I think I get it.
> 
> It's just have never seen it this way. Quite different from what I'm used in 
> Elements of Statistical Learning.
> 
> On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka  
> wrote:
> Not sure if there's a website for that. In any case, to explain this 
> differently, as discussed earlier sklearn assumes continuous features for 
> decision trees. So, it will use a binary threshold for splitting along a 
> feature attribute. In other words, it cannot do sth like
> 
> if x == 1 then right child node
> else left child node
> 
> Instead, what it does is
> 
> if x >= 0.5 then right child node
> else left child node
> 
> These are basically equivalent as you can see when you just plug in values 0 
> and 1 for x.
> 
> Best,
> Sebastian
> 
> > On Oct 4, 2019, at 5:34 PM, C W  wrote:
> > 
> > I don't understand your answer.
> > 
> > Why after one-hot-encoding it still outputs greater than 0.5 or less than? 
> > Does sklearn website have a working example on categorical input?
> > 
> > Thanks!
> > 
> > On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka 
> >  wrote:
> > Like Nicolas said, the 0.5 is just a workaround but will do the right thing 
> > on the one-hot encoded variables, here. You will find that the threshold is 
> > always at 0.5 for these variables. I.e., what it will do is to use the 
> > following conversion:
> > 
> > treat as car_Audi=1 if car_Audi >= 0.5
> > treat as car_Audi=0 if car_Audi < 0.5
> > 
> > or, it may be
> > 
> > treat as car_Audi=1 if car_Audi > 0.5
> > treat as car_Audi=0 if car_Audi <= 0.5
> > 
> > (Forgot which one sklearn is using, but either way. it will be fine.)
> > 
> > Best,
> > Sebastian
> > 
> > 
> >> On Oct 4, 2019, at 1:44 PM, Nicolas Hug  wrote:
> >> 
> >> 
> >>> But, decision tree is still mistaking one-hot-encoding as numerical input 
> >>> and split at 0.5. This is not right. Perhaps, I'm doing something wrong?
> >> 
> >> You're not doing anything wrong, and neither is the tree. Trees don't 
> >> support categorical variables in sklearn, so everything is treated as 
> >> numerical.
> >> 
> >> This is why we do one-hot-encoding: so that a set of numerical (one hot 
> >> encoded) features can be treated as if they were just one categorical 
> >> feature.
> >> 
> >> 
> >> 
> >> Nicolas
> >> 
> >> On 10/4/19 2:01 PM, C W wrote:
> >>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my 
> >>> part.
> >>> 
> >>> Looks like I did one-hot-encoding correctly. My new variable names are: 
> >>> car_Audi, car_BMW, etc.
> >>> 
> >>> But, decision tree is still mistaking one-hot-encoding as numerical input 
> >>> and split at 0.5. This is not right. Perhaps, I'm doing something wrong?
> >>> 
> >>> Is there a good toy example on the sklearn website? I am only see this: 
> >>> https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html.
> >>> 
> >>> Thanks!
> >>> 
> >>> 
> >>> 
> >>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka 
> >>>  wrote:
> >>> Hi,
> >>> 
> >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, 
> >>>> Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5
> >>> 
> >>> that's not a onehot encoding then.
> >>> 
> >>> For an Audi datapoint, it should be
> >>> 
> >>> BMW=0
> >>> Toyota=0
> >>> Audi=1
> >>> 
> >>> for BMW
> >>> 
> >>> BMW=1
> >>> Toyota=0
> >>> Audi=0
> >>> 
> >>> and for Toyota
> >>> 
> >>> BMW=0
> >>> Toyota=1
> >>> Audi=0
> >>> 
> >>> The split threshold should then be at 0.5 for any of these features.
> >>> 
> >>> Based on your email, I think you were assuming that the DT does the 
> >>> one-hot encoding internally, which it doesn't. In practice, it is hard to 
>

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-04 Thread Sebastian Raschka
Not sure if there's a website for that. In any case, to explain this 
differently, as discussed earlier sklearn assumes continuous features for 
decision trees. So, it will use a binary threshold for splitting along a 
feature attribute. In other words, it cannot do sth like

if x == 1 then right child node
else left child node

Instead, what it does is

if x >= 0.5 then right child node
else left child node

These are basically equivalent as you can see when you just plug in values 0 
and 1 for x.

Best,
Sebastian

> On Oct 4, 2019, at 5:34 PM, C W  wrote:
> 
> I don't understand your answer.
> 
> Why after one-hot-encoding it still outputs greater than 0.5 or less than? 
> Does sklearn website have a working example on categorical input?
> 
> Thanks!
> 
> On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka  
> wrote:
> Like Nicolas said, the 0.5 is just a workaround but will do the right thing 
> on the one-hot encoded variables, here. You will find that the threshold is 
> always at 0.5 for these variables. I.e., what it will do is to use the 
> following conversion:
> 
> treat as car_Audi=1 if car_Audi >= 0.5
> treat as car_Audi=0 if car_Audi < 0.5
> 
> or, it may be
> 
> treat as car_Audi=1 if car_Audi > 0.5
> treat as car_Audi=0 if car_Audi <= 0.5
> 
> (Forgot which one sklearn is using, but either way. it will be fine.)
> 
> Best,
> Sebastian
> 
> 
>> On Oct 4, 2019, at 1:44 PM, Nicolas Hug  wrote:
>> 
>> 
>>> But, decision tree is still mistaking one-hot-encoding as numerical input 
>>> and split at 0.5. This is not right. Perhaps, I'm doing something wrong?
>> 
>> You're not doing anything wrong, and neither is the tree. Trees don't 
>> support categorical variables in sklearn, so everything is treated as 
>> numerical.
>> 
>> This is why we do one-hot-encoding: so that a set of numerical (one hot 
>> encoded) features can be treated as if they were just one categorical 
>> feature.
>> 
>> 
>> 
>> Nicolas
>> 
>> On 10/4/19 2:01 PM, C W wrote:
>>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my 
>>> part.
>>> 
>>> Looks like I did one-hot-encoding correctly. My new variable names are: 
>>> car_Audi, car_BMW, etc.
>>> 
>>> But, decision tree is still mistaking one-hot-encoding as numerical input 
>>> and split at 0.5. This is not right. Perhaps, I'm doing something wrong?
>>> 
>>> Is there a good toy example on the sklearn website? I am only see this: 
>>> https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html.
>>> 
>>> Thanks!
>>> 
>>> 
>>> 
>>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka 
>>>  wrote:
>>> Hi,
>>> 
>>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, 
>>>> Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5
>>> 
>>> that's not a onehot encoding then.
>>> 
>>> For an Audi datapoint, it should be
>>> 
>>> BMW=0
>>> Toyota=0
>>> Audi=1
>>> 
>>> for BMW
>>> 
>>> BMW=1
>>> Toyota=0
>>> Audi=0
>>> 
>>> and for Toyota
>>> 
>>> BMW=0
>>> Toyota=1
>>> Audi=0
>>> 
>>> The split threshold should then be at 0.5 for any of these features.
>>> 
>>> Based on your email, I think you were assuming that the DT does the one-hot 
>>> encoding internally, which it doesn't. In practice, it is hard to guess 
>>> what is a nominal and what is a ordinal variable, so you have to do the 
>>> onehot encoding before you give the data to the decision tree.
>>> 
>>> Best,
>>> Sebastian
>>> 
>>>> On Oct 4, 2019, at 11:48 AM, C W  wrote:
>>>> 
>>>> I'm getting some funny results. I am doing a regression decision tree, the 
>>>> response variables are assigned to levels.
>>>> 
>>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, 
>>>> Audi=2) as numerical values, not category.
>>>> 
>>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How 
>>>> does the sklearn know internally 0 vs. 1 is categorical, not numerical? 
>>>> 
>>>> In R for instance, you do as.factor(), which explicitly states the data 
>>>> type.
>>>> 
>>>> Thank you!
>>>> 
>>>> 
>>>> On Wed, Sep 18, 20

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-04 Thread Sebastian Raschka
Like Nicolas said, the 0.5 is just a workaround but will do the right thing on 
the one-hot encoded variables, here. You will find that the threshold is always 
at 0.5 for these variables. I.e., what it will do is to use the following 
conversion:

treat as car_Audi=1 if car_Audi >= 0.5
treat as car_Audi=0 if car_Audi < 0.5

or, it may be

treat as car_Audi=1 if car_Audi > 0.5
treat as car_Audi=0 if car_Audi <= 0.5

(Forgot which one sklearn is using, but either way. it will be fine.)

Best,
Sebastian


> On Oct 4, 2019, at 1:44 PM, Nicolas Hug  wrote:
> 
> 
>> But, decision tree is still mistaking one-hot-encoding as numerical input 
>> and split at 0.5. This is not right. Perhaps, I'm doing something wrong?
> 
> You're not doing anything wrong, and neither is the tree. Trees don't support 
> categorical variables in sklearn, so everything is treated as numerical.
> 
> This is why we do one-hot-encoding: so that a set of numerical (one hot 
> encoded) features can be treated as if they were just one categorical feature.
> 
> 
> 
> Nicolas
> 
> On 10/4/19 2:01 PM, C W wrote:
>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my 
>> part.
>> 
>> Looks like I did one-hot-encoding correctly. My new variable names are: 
>> car_Audi, car_BMW, etc.
>> 
>> But, decision tree is still mistaking one-hot-encoding as numerical input 
>> and split at 0.5. This is not right. Perhaps, I'm doing something wrong?
>> 
>> Is there a good toy example on the sklearn website? I am only see this: 
>> https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html 
>> <https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html>.
>> 
>> Thanks!
>> 
>> 
>> 
>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka > <mailto:m...@sebastianraschka.com>> wrote:
>> Hi,
>> 
>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, 
>>> Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5
>> 
>> that's not a onehot encoding then.
>> 
>> For an Audi datapoint, it should be
>> 
>> BMW=0
>> Toyota=0
>> Audi=1
>> 
>> for BMW
>> 
>> BMW=1
>> Toyota=0
>> Audi=0
>> 
>> and for Toyota
>> 
>> BMW=0
>> Toyota=1
>> Audi=0
>> 
>> The split threshold should then be at 0.5 for any of these features.
>> 
>> Based on your email, I think you were assuming that the DT does the one-hot 
>> encoding internally, which it doesn't. In practice, it is hard to guess what 
>> is a nominal and what is a ordinal variable, so you have to do the onehot 
>> encoding before you give the data to the decision tree.
>> 
>> Best,
>> Sebastian
>> 
>>> On Oct 4, 2019, at 11:48 AM, C W >> <mailto:tmrs...@gmail.com>> wrote:
>>> 
>>> I'm getting some funny results. I am doing a regression decision tree, the 
>>> response variables are assigned to levels.
>>> 
>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, 
>>> Audi=2) as numerical values, not category.
>>> 
>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How does 
>>> the sklearn know internally 0 vs. 1 is categorical, not numerical? 
>>> 
>>> In R for instance, you do as.factor(), which explicitly states the data 
>>> type.
>>> 
>>> Thank you!
>>> 
>>> 
>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller >> <mailto:t3k...@gmail.com>> wrote:
>>> 
>>> 
>>> On 9/15/19 8:16 AM, Guillaume Lemaître wrote:
>>>> 
>>>> 
>>>> On Sat, 14 Sep 2019 at 20:59, C W >>> <mailto:tmrs...@gmail.com>> wrote:
>>>> Thanks, Guillaume. 
>>>> Column transformer looks pretty neat. I've also heard though, this 
>>>> pipeline can be tedious to set up? Specifying what you want for every 
>>>> feature is a pain.
>>>> 
>>>> It would be interesting for us which part of the pipeline is tedious to 
>>>> set up to know if we can improve something there.
>>>> Do you mean, that you would like to automatically detect of which type of 
>>>> feature (categorical/numerical) and apply a
>>>> default encoder/scaling such as discuss there: 
>>>> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127
>>>>  
>>>> <https://github.com/scikit-learn/scikit-learn/issues/10603#issueco

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-04 Thread Sebastian Raschka
Hi,

> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, 
> Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5

that's not a onehot encoding then.

For an Audi datapoint, it should be

BMW=0
Toyota=0
Audi=1

for BMW

BMW=1
Toyota=0
Audi=0

and for Toyota

BMW=0
Toyota=1
Audi=0

The split threshold should then be at 0.5 for any of these features.

Based on your email, I think you were assuming that the DT does the one-hot 
encoding internally, which it doesn't. In practice, it is hard to guess what is 
a nominal and what is a ordinal variable, so you have to do the onehot encoding 
before you give the data to the decision tree.

Best,
Sebastian

> On Oct 4, 2019, at 11:48 AM, C W  wrote:
> 
> I'm getting some funny results. I am doing a regression decision tree, the 
> response variables are assigned to levels.
> 
> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, 
> Audi=2) as numerical values, not category.
> 
> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How does 
> the sklearn know internally 0 vs. 1 is categorical, not numerical? 
> 
> In R for instance, you do as.factor(), which explicitly states the data type.
> 
> Thank you!
> 
> 
> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller  > wrote:
> 
> 
> On 9/15/19 8:16 AM, Guillaume Lemaître wrote:
>> 
>> 
>> On Sat, 14 Sep 2019 at 20:59, C W > > wrote:
>> Thanks, Guillaume. 
>> Column transformer looks pretty neat. I've also heard though, this pipeline 
>> can be tedious to set up? Specifying what you want for every feature is a 
>> pain.
>> 
>> It would be interesting for us which part of the pipeline is tedious to set 
>> up to know if we can improve something there.
>> Do you mean, that you would like to automatically detect of which type of 
>> feature (categorical/numerical) and apply a
>> default encoder/scaling such as discuss there: 
>> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127
>>  
>> 
>> 
>> IMO, one a user perspective, it would be cleaner in some cases at the cost 
>> of applying blindly a black box
>> which might be dangerous.
> Also see 
> https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor
>  
> 
> Which basically does that.
> 
> 
>>  
>> 
>> Jaiver,
>> Actually, you guessed right. My real data has only one numerical variable, 
>> looks more like this:
>> 
>> Gender DateIncome  Car   Attendance
>> Male 2019/3/01   1   BMW  Yes
>> Female 2019/5/029000   Toyota  No
>> Male 2019/7/15   12000Audi   Yes
>> 
>> I am predicting income using all other categorical variables. Maybe it is 
>> catboost!
>> 
>> Thanks,
>> 
>> M
>> 
>> 
>> 
>> 
>> 
>> 
>> On Sat, Sep 14, 2019 at 9:25 AM Javier López  
>>  wrote:
>> If you have datasets with many categorical features, and perhaps many 
>> categories, the tools in sklearn are quite limited, 
>> but there are alternative implementations of boosted trees that are designed 
>> with categorical features in mind. Take a look
>> at catboost [1], which has an sklearn-compatible API.
>> 
>> J
>> 
>> [1] https://catboost.ai/ 
>> On Sat, Sep 14, 2019 at 3:40 AM C W > > wrote:
>> Hello all,
>> I'm very confused. Can the decision tree module handle both continuous and 
>> categorical features in the dataset? In this case, it's just CART 
>> (Classification and Regression Trees).
>> 
>> For example,
>> Gender Age Income  Car   Attendance
>> Male 30   1   BMW  Yes
>> Female 35 9000  Toyota  No
>> Male 50   12000Audi   Yes
>> 
>> According to the documentation 
>> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart
>>  
>> ,
>>  it can not! 
>> 
>> It says: "scikit-learn implementation does not support categorical variables 
>> for now". 
>> 
>> Is this true? If not, can someone point me to an example? If yes, what do 
>> people do?
>> 
>> Thank you very much!
>> 
>> 
>> 
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org 
>> https://mail.python.org/mailman/listinfo/scikit-learn 
>> 
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org 
>> https://mail.python.org/mailman/listinfo/scikit-learn 
>> 
>> ___
>> 

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-09-13 Thread Sebastian Raschka
Hi Mike,

just to make sure we are on the same page,

> I have mixed data type (continuous and categorical). Should I 
> tree.DecisionTreeClassifier() or tree.DecisionTreeRegressor()?

that's independent from the previous email. The comment 

> > "scikit-learn implementation does not support categorical variables for 
> > now". 

we discussed via the previous email was referring to feature variables. Whether 
you choose the DT regressor or classifier depends on the format of your target 
variable.

Best,
Sebastian

> On Sep 13, 2019, at 11:41 PM, C W  wrote:
> 
> Thanks, Sebastian. It's great to know that it works, just need to do 
> one-hot-encoding first.
> 
> I have mixed data type (continuous and categorical). Should I 
> tree.DecisionTreeClassifier() or tree.DecisionTreeRegressor()?
> 
> I'm guessing tree.DecisionTreeClassifier()?
> 
> Best,
> 
> Mike
> 
> On Fri, Sep 13, 2019 at 11:59 PM Sebastian Raschka 
>  wrote:
> Hi,
> 
> if you have the category "car" as shown in your example, this would 
> effectively be something like
> 
> BMW=0
> Toyota=1
> Audi=2
> 
> Sure, the algorithm will execute just fine on the feature column with values 
> in {0, 1, 2}. However, the problem is that it will come up with binary rules 
> like x_i>= 0.5, x_i>= 1.5, and x_i>= 2.5. I.e., it will treat it is a 
> continuous variable. 
> 
> What you can do is to encode this feature via one-hot encoding -- basically 
> extend it into 2 (or 3) binary variables. This has it's own problems (if you 
> have a feature with many possible values, you will end up with a large number 
> of binary variables, and they may dominate in the resulting tree over other 
> feature variables).
> 
> In any case, I guess this is what 
> 
> > "scikit-learn implementation does not support categorical variables for 
> > now". 
> 
> 
> means ;).
> 
> Best,
> Sebastian
> 
> > On Sep 13, 2019, at 9:38 PM, C W  wrote:
> > 
> > Hello all,
> > I'm very confused. Can the decision tree module handle both continuous and 
> > categorical features in the dataset? In this case, it's just CART 
> > (Classification and Regression Trees).
> > 
> > For example,
> > Gender Age Income  Car   Attendance
> > Male 30   1   BMW  Yes
> > Female 35 9000  Toyota  No
> > Male 50   12000Audi   Yes
> > 
> > According to the documentation 
> > https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart,
> >  it can not! 
> > 
> > It says: "scikit-learn implementation does not support categorical 
> > variables for now". 
> > 
> > Is this true? If not, can someone point me to an example? If yes, what do 
> > people do?
> > 
> > Thank you very much!
> > 
> > 
> > 
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] No convergence warning in logistic regression

2019-08-30 Thread Sebastian Raschka
Hi Ben,

I can recall seeing convergence warnings for scikit-learn's logistic regression 
model on datasets in the past as well. Which solver did you use for 
LogisticRegression in sklearn? If you haven't done so, have used the lbfgs 
solver? I.e.,

LogisticRegression(..., solver='lbfgs')?

Best,
Sebastian

> On Aug 30, 2019, at 9:52 AM, Benoît Presles  
> wrote:
> 
> Dear all,
> 
> I compared the logistic regression of statsmodels (Logit) with the logistic 
> regression of sklearn (LogisticRegression). As I do not do regularization, I 
> use the fit method with statsmodels and set penalty='none' in sklearn. Most 
> of the time, I have got the same results between the two packages.
> 
> However, when data are correlated, it is not the case. In fact, I have got a 
> very useful convergence warning with statsmodel (ConvergenceWarning: Maximum 
> Likelihood optimization failed to converge) that I do not have with sklearn? 
> Is it normal that I do not have any convergence warning with sklearn even if 
> I put verbose=1? I guess sklearn did not converge either.
> 
> 
> Thanks for your help,
> Best regards,
> Ben
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Predict Method of OneVsRestClassifier Integration with Google Cloud ML

2019-04-10 Thread Sebastian Raschka
Hm, weird that their platform seems to be so picky about it. Have you tried to 
just make the output of the pipeline dense? I.e., 

(model.predict(X)).toarray()

Best,
Sebastian

> On Apr 10, 2019, at 1:10 PM, Liam Geron  wrote:
> 
> Hi Sebastian,
> 
> Thanks for the advice! The model actually works on it's own in python fine 
> luckily, so I don't think that that is the issue exactly. I have tried 
> rolling my own estimator to wrap the pipeline to have it call the 
> predict_proba method to return a dense array, however I then came across the 
> problem that I would have to have that custom estimator defined on the Cloud 
> ML end, which I'm unsure how to do.
> 
> Thanks,
> Liam
> 
> On Wed, Apr 10, 2019 at 2:06 PM Sebastian Raschka  
> wrote:
> Hi Liam,
> 
> not sure what your exact error message is, but it may also be that the 
> XGBClassifier only accepts dense arrays? I think the TfidfVectorizer returns 
> sparse arrays. You could probably fix your issues by inserting a 
> "DenseTransformer" into your pipelone (a simple class that just transforms an 
> array from a sparse to a dense format). I've implemented sth like that that 
> you can import or copy it from here:
> 
> https://github.com/rasbt/mlxtend/blob/master/mlxtend/preprocessing/dense_transformer.py
> 
> The usage would then basically be
> 
> model = Pipeline([('tfidf', TfidfVectorizer()), ('to_dense', 
> DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))])
> 
> Best,
> Sebastian
> 
> 
> 
> 
> > On Apr 10, 2019, at 12:25 PM, Liam Geron  wrote:
> > 
> > Hi all,
> > 
> > I was hoping to get some guidance re: changing the result of the predict 
> > method of the OneVsRestClassifier to return a dense array rather than a 
> > sparse array, given that Google Cloud ML only accepts dense numpy arrays as 
> > a result of a given models predict method. Right now my model architecture 
> > looks like:
> > 
> > model = Pipeline([('tfidf', TfidfVectorizer()), ('clf', 
> > OneVsRestClassifier(XGBClassifier()))])
> > 
> > Which returns a sparse array with the predict method. I saw the Stack 
> > Overflow post here: 
> > https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-learn-prediction-probability-predict-proba
> > 
> > which recommends overwriting the predict method with the predict_proba 
> > method, however I found that I can't serialize the model after doing so. I 
> > also have a stack overflow post here: 
> > https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-onevsrestclassifier-predict-method-output-to-dense-a
> >  which details the specific pickling error.
> > 
> > Is this a known issue? Is there an accepted way to convert this into a 
> > dense array?
> > 
> > Thanks,
> > Liam Geron
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Predict Method of OneVsRestClassifier Integration with Google Cloud ML

2019-04-10 Thread Sebastian Raschka
Hi Liam,

not sure what your exact error message is, but it may also be that the 
XGBClassifier only accepts dense arrays? I think the TfidfVectorizer returns 
sparse arrays. You could probably fix your issues by inserting a 
"DenseTransformer" into your pipelone (a simple class that just transforms an 
array from a sparse to a dense format). I've implemented sth like that that you 
can import or copy it from here:

https://github.com/rasbt/mlxtend/blob/master/mlxtend/preprocessing/dense_transformer.py

The usage would then basically be

model = Pipeline([('tfidf', TfidfVectorizer()), ('to_dense', 
DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))])

Best,
Sebastian




> On Apr 10, 2019, at 12:25 PM, Liam Geron  wrote:
> 
> Hi all,
> 
> I was hoping to get some guidance re: changing the result of the predict 
> method of the OneVsRestClassifier to return a dense array rather than a 
> sparse array, given that Google Cloud ML only accepts dense numpy arrays as a 
> result of a given models predict method. Right now my model architecture 
> looks like:
> 
> model = Pipeline([('tfidf', TfidfVectorizer()), ('clf', 
> OneVsRestClassifier(XGBClassifier()))])
> 
> Which returns a sparse array with the predict method. I saw the Stack 
> Overflow post here: 
> https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-learn-prediction-probability-predict-proba
> 
> which recommends overwriting the predict method with the predict_proba 
> method, however I found that I can't serialize the model after doing so. I 
> also have a stack overflow post here: 
> https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-onevsrestclassifier-predict-method-output-to-dense-a
>  which details the specific pickling error.
> 
> Is this a known issue? Is there an accepted way to convert this into a dense 
> array?
> 
> Thanks,
> Liam Geron
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] GridsearchCV returns worse scoring the broader parameter space gets

2019-03-31 Thread Sebastian Raschka
Hi Andreas,

the best score is determined by computing the test fold performance (I think 
R^2 by default) and then averaging over them. Since you chose cv=10, you have 
10 test folds, and the performance is the average performance over those for 
choosing the best hyper parameter setting. 

Then, it looks like you are computing the performance manually:

> simple_tree.fit(x_tr,y_tr).score(x_tr,y_tr)

on the whole training set. Instead, I would take a look at the 
simple_tree.best_score_ attribute after fitting. If you do 

Best,
Sebastian

> On Mar 31, 2019, at 5:15 AM, Andreas Tosstorff  wrote:
> 
> Dear all,
> I am new to scikit learn so please excuse my ignorance. Using GridsearchCV I 
> am trying to optimize a DecisionTreeRegressor. The broader I make the 
> parameter space, the worse the scoring gets.
> Setting min_samples_split to range(2,10) gives me a neg_mean_squared_error of 
> -0.04. When setting it to range(2,5) The score is -0.004.
> simple_tree =GridSearchCV(tree.DecisionTreeRegressor(random_state=42), 
> n_jobs=4, param_grid={'min_samples_split': range(2, 10)}, 
> scoring='neg_mean_squared_error', cv=10, refit='neg_mean_squared_error')
> 
> simple_tree.fit(x_tr,y_tr).score(x_tr,y_tr)
> 
> I expect an equal or more positive score for a more extensive grid search 
> compared to the less extensive one.
> 
> I would really appreciate your help!
> 
> Kind regards,
> Andreas
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] What theory cause SGDRegressor can partial_fit but RandomForestRegressor can't?

2019-03-13 Thread Sebastian Raschka
It's not necessarily unique to stochastic gradient descent, it's more that some 
other algorithms are generally not well suited for "partial_fit". For SGD, 
partial fit is a more natural thing to do since you estimate the training loss 
from minibatches anyway -- i.e., you do SGD step by step anyway.

Also, think about it this way: models trained via SGD are typically parametric, 
so the number of parameters is fixed, and you simply just adjust their values 
iteratively during training. For nonparametric models, such as RF, the number 
of parameters (e.g., if you think about each node in the decision tree as a 
parameter) depends on the examples present in the training set. I.e., how deep 
each individual decision tree eventually becomes depends on the training set. 
So, it doesn't make sense to build a decision tree on a few training examples 
and then update it later by feeding it more training examples. Either way, you 
would probably end up throwing away the decision tree and build a new one if 
you get additional data. I am sure solutions for "updating" decision trees 
exist, which produce somewhat reasonable results efficiently, but it's less 
natural and not a common thing to do, which is why it's probably not 
implemented in scikit-learn.

Best,
Sebastian


> On Mar 13, 2019, at 10:45 PM, lampahome  wrote:
> 
> As title, I'm confused why some algo can partial_fit and some algo can't.
> 
> For regression model, I found SGD can but RF can't.
> 
> Is about the difference of algo? I thought it's able to partial_fit because 
> gradient descent, or just another reason?
> 
> thx
> _

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] AUCROC/MAP confidence intervals in scikit

2019-02-07 Thread Sebastian Raschka
Still haven't had a chance to read it, but ROC for binary classification 
anyway? Also, i.i.d. assumptions are typical for the learning algorithms as 
well.

Best,
Sebastian

> On Feb 7, 2019, at 10:15 AM, josef.p...@gmail.com wrote:
> 
> Just a skeptical comment from a bystander.
> 
> I only skimmed parts of the article. My impression is that this does not 
> apply (directly) to the regression setting.
> AFAIU, they assume that all observations have the same propability.
> 
> To me it looks more like the literature on testing of or confidence intervals 
> for a single proportion.
> 
> I might be wrong.
> 
> Josef
> 
> On Thu, Feb 7, 2019 at 11:00 AM Andreas Mueller  wrote:
> The paper definitely looks interesting and the authors are certainly 
> some giants in the field.
> But it is actually not widely cited (139 citations since 2005), and I've 
> never seen it used.
> 
> I don't know why that is, and looking at the citations there doesn't 
> seem to be a lot of follow-up work.
> I think this would need more validation before getting into sklearn.
> 
> Sebastian: This paper is distribution independent and doesn't need 
> bootstrapping, so it looks indeed quite nice.
> 
> 
> On 2/6/19 1:19 PM, Sebastian Raschka wrote:
> > Hi Stuart,
> >
> > I don't think so because there is no standard way to compute CI's. That 
> > goes for all performance measures (accuracy, precision, recall, etc.). Some 
> > people use simple binomial approximation intervals, some people prefer 
> > bootstrapping etc. And it also depends on the data you have. In large 
> > datasets, binomial approximation intervals may be sufficient and 
> > bootstrapping too expensive etc.
> >
> > Thanks for sharing that paper btw, will have a look.
> >
> > Best,
> > Sebastian
> >
> >
> >> On Feb 6, 2019, at 11:28 AM, Stuart Reynolds  
> >> wrote:
> >>
> >> https://papers.nips.cc/paper/2645-confidence-intervals-for-the-area-under-the-roc-curve.pdf
> >> Does scikit (or other Python libraries) provide functions to measure the 
> >> confidence interval of AUROC scores? Same question also for mean average 
> >> precision.
> >>
> >> It seems like this should be a standard results reporting practice if a 
> >> method is available.
> >>
> >> - Stuart
> >> ___
> >> scikit-learn mailing list
> >> scikit-learn@python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] AUCROC/MAP confidence intervals in scikit

2019-02-06 Thread Sebastian Raschka
Hi Stuart,

I don't think so because there is no standard way to compute CI's. That goes 
for all performance measures (accuracy, precision, recall, etc.). Some people 
use simple binomial approximation intervals, some people prefer bootstrapping 
etc. And it also depends on the data you have. In large datasets, binomial 
approximation intervals may be sufficient and bootstrapping too expensive etc.

Thanks for sharing that paper btw, will have a look.

Best,
Sebastian


> On Feb 6, 2019, at 11:28 AM, Stuart Reynolds  
> wrote:
> 
> https://papers.nips.cc/paper/2645-confidence-intervals-for-the-area-under-the-roc-curve.pdf
> Does scikit (or other Python libraries) provide functions to measure the 
> confidence interval of AUROC scores? Same question also for mean average 
> precision.
> 
> It seems like this should be a standard results reporting practice if a 
> method is available.
> 
> - Stuart
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Does model consider about previous training results after reloading model and then training with new data?

2019-01-31 Thread Sebastian Raschka

> So if I want to reach like "continue training", I should choose model with 
> partial_fit, right?

Yes.

> but I saw nothing have partial_fit function in ensemble methods,


Hm, technically, if the models in the ensemble support partial_fit the ensemble 
method itself should also be able to use partial_fit. My guess is that it is 
not implemented because it cannot be guaranteed that the individual models 
support partial_fit. However, if you are using the voting classifier, you could 
probably just train the individual models of the ensemble, because the voting 
classifier's decision rule is fixed.

I think the following could work if the estimators_ support partial_fit:

voter = VotingClassifier(...)
voter.fit(...)

For further training:

for i in len(estimators_):
voter.estimators_[i].partial_fit(...)



Best,
Sebastian

> On Feb 1, 2019, at 12:52 AM, lampahome  wrote:
> 
> 
> 
> Sebastian Raschka  於 2019年2月1日 週五 下午1:48寫道:
> Hi there,
> 
> if you call the "fit" method, the learning will essentially start from 
> scratch. So no, it doesn't consider previous training results.  
> However, certain algorithms are implemented with an additional partial_fit 
> method that would consider previous training rounds.
> 
> So if I want to reach like "continue training", I should choose model with 
> partial_fit, right?
> 
> What I want is regression, but I saw nothing have partial_fit function in 
> ensemble methods,
> 
> Can found in other places?
> 
> thx 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Does model consider about previous training results after reloading model and then training with new data?

2019-01-31 Thread Sebastian Raschka
Hi there,

if you call the "fit" method, the learning will essentially start from scratch. 
So no, it doesn't consider previous training results. 
However, certain algorithms are implemented with an additional partial_fit 
method that would consider previous training rounds.

Best,
Sebastian

> On Jan 31, 2019, at 11:19 PM, lampahome  wrote:
> 
> As title, I'm confused.
> 
> If I reload model and train with new data, what happened?
> 
> 1st train old data -> save model -> reload -> train with new data
> 
> Does the 2nd training will consider about previous training results?
> Or just a new result with new data?
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] LogisticRegression coef_ greater than n_features?

2019-01-08 Thread Sebastian Raschka
It seems like it's determined by the order in which they occur in the training 
set. E.g.,

from sklearn.preprocessing import OneHotEncoder
import numpy as np

x = np.array([['b'],
  ['a'], 
  ['b']])
ohe = OneHotEncoder()
xt = ohe.fit_transform(x)
xt.todense()

matrix([[0., 1.],
[1., 0.],
[0., 1.]])


and

x = np.array([['a'],
  ['b'], 
  ['a']])
ohe = OneHotEncoder()
xt = ohe.fit_transform(x)
xt.todense()

matrix([[1., 0.],
[0., 1.],
[1., 0.]])

Not sure how you used the OHE, but you also want to make sure that you only use 
it on those columns that are indeed categorical, e.g., note the following 
behavior: 

x = np.array([['a', 1.1],
  ['b', 1.2], 
  ['a', 1.3]])
ohe = OneHotEncoder()
xt = ohe.fit_transform(x)
xt.todense()

matrix([[1., 0., 1., 0., 0.],
[0., 1., 0., 1., 0.],
[1., 0., 0., 0., 1.]])


Best,
Sebastian

> On Jan 8, 2019, at 9:33 AM, pisymbol  wrote:
> 
> Also Sebastian, I have binary classes but they are strings:
> 
> clf.classes_:
> array(['American', 'Southwest'], dtype=object)
> 
> 
> 
> On Tue, Jan 8, 2019 at 9:51 AM pisymbol  wrote:
> If that is the case, what order are the coefficients in then?
> 
> -aps
> 
> On Tue, Jan 8, 2019 at 12:48 AM Sebastian Raschka  
> wrote:
> E.g, if you have a feature with values 'a' , 'b', 'c', then applying the one 
> hot encoder will transform this into 3 features.
> 
> Best,
> Sebastian
> 
> > On Jan 7, 2019, at 11:02 PM, pisymbol  wrote:
> > 
> > 
> > 
> > On Mon, Jan 7, 2019 at 11:50 PM pisymbol  wrote:
> > According to the doc (0.20.2) the coef_ variables are suppose to be shape 
> > (1, n_features) for binary classification. Well I created a Pipeline and 
> > performed a GridSearchCV to create a LogisticRegresion model that does 
> > fairly well. However, when I want to rank feature importance I noticed that 
> > my coefs_ for my best_estimator_ has 24 entries while my training data has 
> > 22.
> > 
> > What am I missing? How could coef_ > n_features?
> > 
> > 
> > Just a follow-up, I am using a OneHotEncoder to encode two categoricals as 
> > part of my pipeline (I am also using an imputer/standard scaler too but I 
> > don't see how that could add features).
> > 
> > Could my pipeline actually add two more features during fitting?
> > 
> > -aps
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] LogisticRegression coef_ greater than n_features?

2019-01-07 Thread Sebastian Raschka
Maybe check 

a) if the actual labels of the training examples don't start at 0
b) if you have gaps, e.g,. if your unique training labels are 0, 1, 4, ..., 23

Best,
Sebastian

> On Jan 7, 2019, at 10:50 PM, pisymbol  wrote:
> 
> According to the doc (0.20.2) the coef_ variables are suppose to be shape (1, 
> n_features) for binary classification. Well I created a Pipeline and 
> performed a GridSearchCV to create a LogisticRegresion model that does fairly 
> well. However, when I want to rank feature importance I noticed that my 
> coefs_ for my best_estimator_ has 24 entries while my training data has 22.
> 
> What am I missing? How could coef_ > n_features?
> 
> -aps
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] How GridSearchCV to get best_params?

2019-01-03 Thread Sebastian Raschka
I think it refers to the test folds via the k-fold cross-validation that is 
internally used via the `cv` parameter of GridSearchCV (or the test folds of an 
alternative cross validation scheme that you may pass as an iterator to cv)

Best,
Sebastian

> On Jan 3, 2019, at 9:44 PM, lampahome  wrote:
> 
> as title
> 
> In the doc it says:
> 
> best_params_ : dict
> Parameter setting that gave the best results on the hold out data.
> 
> My question is what is the hold out data?
> It's score of train data or test data, or mean of train and test score?
> 
> thx
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Any way to tune the parameters better than GridSearchCV?

2018-12-24 Thread Sebastian Raschka
I would like to make a related suggestion but instead of focusing on the upper 
bound for the number of trees rather set choosing the lower bound. From a 
theoretical perspective, it doesn't make sense to me how fewer trees can result 
in a better performing random forest model in terms of generalization 
performance. If you observe a better performance on the same independent test 
set with fewer trees, I would say that this is likely not a good indicator of 
better generalization performance. It could be due to overfitting and 
train/test set resampling and/or picking up artifacts in the dataset. 

As a general suggestion, I would suggest choosing a reasonable number of trees 
that seems computationally feasible given the size of the dataset and the 
number hyperparameters to compare via model selection. Then, after tuning, I 
would use the best hyperparameter setting with 10x more trees and see if you 
notice any significant different in the cross-validation performance. Next, I 
would use the model and fit it to the whole training set with those best 
hyperparameters and evaluate the performance on the independent test set.

Best,
Sebastian


> On Dec 24, 2018, at 9:27 PM, Brown J.B. via scikit-learn 
>  wrote:
> 
> Take random forest as example, if I give estimator from 10 to 1(10, 100, 
> 1000, 1) into grid search.
> Based on the result, I found estimator=100 is the best, but I don't know 
> lower or greater than 100 is better.
> How should I decide? brute force or any tools better than GridSearchCV?
> 
> A simple but nonetheless practical solution is to 
>   (1) start with an upper bound on the number of trees you are willing to 
> accept in the model, 
>   (2) obtain its performance (ACC, MCC, F1, etc) as the starting reference 
> point,
>   (3) systematically lower the number of trees (log2 scale down, fixed size 
> decrement, etc)
>   (4) obtain the reduced forest size performance,
>   (5) Repeat (3)-(4) until [performance(reference) - performance(current 
> forest size)] > tolerance
> 
> You can encapsulate that in a function which then returns the final model you 
> obtain.
> From the model object, the number of trees can be obtained.
> 
> J.B.
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] plan to add the association rule classification algorithm in scikit learn

2018-12-16 Thread Sebastian Raschka
Hi Rui,

I agree with Joel that association rule mining could be a bit tricky to fit 
nicely within the scikit-learn API. Maybe this could be some transformer class? 
I thought about that a few years ago but remember that I couldn't come up with 
a good solution at that point.

In any case, I have an association rule implementation in mlxtend 
(http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/),
 which is based on the apriori algorithm. Some users were asking about Eclat 
and FP-Growth algorithms, instead of apriori. If you are interested in such a 
contribution, i.e., implementing Eclat or FP-Growth such that instead of 

frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

one could use

frequent_itemsets = eclat(df, min_support=0.6, use_colnames=True)

or

frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

I would be very happy about such a contribution (see issue tracker at 
https://github.com/rasbt/mlxtend/issues/248)

If you had an alternative algorithm for frequent itemset generation in mind (I 
am not sure if others exist, to be honest). I would also be happy about that 
one, too.

Best,
Sebastian

> On Dec 17, 2018, at 12:26 AM, Joel Nothman  wrote:
> 
> Hi Rui,
> 
> This has been discussed several times on the mailing list and issue tracker. 
> We are not interested in association rule mining in Scikit-learn for its own 
> purposes. We would be interested in association rule mining only as part of a 
> classification algorithm. Are there such algorithms which are mature and 
> popular enough to meet our inclusion criteria (see our FAQ)?
> 
> Cheers,
> 
> Joel
> 
> On Mon, 17 Dec 2018 at 09:24, rui min  wrote:
> Dear scikit-learn developers,
> 
>I am Rui from Spain, Granada University. Currently I am planning to write 
> an association rule algorithm in scikit-learn.
>I don’t know if anyone is working on that. So avoid duplication of the 
> work, I would like to ask here.
> 
> Hope to hear from you soon.
> 
> 
> Best Regards
> 
> 
> Rui
> 
> 
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] make all new parameters keyword-only?

2018-11-15 Thread Sebastian Raschka
Also want to say that I really welcome this decision/change. Personally, as far 
as I am aware, I've trying been using keyword arguments consistently for years, 
except for cases where it is really obvious, like .fit(X_train, y_train), and I 
believe that it really helped me regarding writing less error-prone 
code/analyses.

Thinking back of the times where I was using MATLAB, it was really clunky and 
error-prone to import functions and being careful about the argument order. 

Besides, keynote arguments definitely make code and documentation much more 
readable (within and esp. across different package versions) despite (or maybe 
because) being more verbose.

Best,
Sebastian



> On Nov 15, 2018, at 10:18 PM, Brown J.B. via scikit-learn 
>  wrote:
> 
> As an end-user, I would strongly support the idea of future enforcement of 
> keyword arguments for new parameters.
> In my group, we hold a standard that we develop APIs where _all_ arguments 
> must be given by keyword (slightly pedantic style, but has shown to have 
> benefits).
> Initialization/call-time state checks are done by a class' internal methods.
> 
> As Andy said, one could consider leaving prototypical X,y as positional, but 
> one benefit my group has seen with full keyword parameterization is the 
> ability to write code for small investigations where we are more concerned 
> with effects from parameters rather than the data (e.g., a fixed problem to 
> model, and one wants to first see on the code line what the estimators and 
> their parameterizations were). 
> If one could shift the sklearn X,y to the back of a function call, it would 
> enable all participants in a face-to-face code review session to quickly see 
> the emphasis/context of the discussion and move to the conclusion faster.
> 
> To satisfy keyword X,y as well, I would presume that the BaseEstimator would 
> need to have a sanity check for error-raising default X,y values -- though 
> does it not have many checks on X,y already?
> 
> Not sure if everyone else agrees about keyword X and y, but just a thought 
> for consideration.
> 
> Kind regards,
> J.B.
> 
> 2018年11月15日(木) 18:34 Gael Varoquaux :
> I am really in favor of the general idea: it is much better to use named
> arguments for everybody (for readability, and to be less depend on
> parameter ordering).
> 
> However, I would maintain that we need to move slowly with backward
> compatibility: changing in a backward-incompatible way a library brings
> much more loss than benefit to our users.
> 
> So +1 for enforcing the change on all new arguments, but -1 for changing
> orders in the existing arguments any time soon.
> 
> I agree that it would be good to push this change in existing models. We
> should probably announce it strongly well in advance, make sure that all
> our examples are changed (people copy-paste), wait a lot, and find a
> moment to squeeze this in.
> 
> Gaël
> 
> On Thu, Nov 15, 2018 at 06:12:35PM +1100, Joel Nothman wrote:
> > We could just announce that we will be making this a syntactic constraint 
> > from
> > version X and make the change wholesale then. It would be less formal 
> > backwards
> > compatibility than we usually hold by, but we already are loose with 
> > parameter
> > ordering when adding new ones.
> 
> > It would be great if after this change we could then reorder parameters to 
> > make
> > some sense!
> 
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> -- 
> Gael Varoquaux
> Senior Researcher, INRIA Parietal
> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
> Phone:  ++ 33-1-69-08-79-68
> http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] How does the random state influence the decision tree splits?

2018-10-28 Thread Sebastian Raschka
That's nice to know, thanks a lot for the reference!

Best,
Sebastian

> On Oct 28, 2018, at 3:34 AM, Guillaume Lemaître  
> wrote:
> 
> FYI: https://github.com/scikit-learn/scikit-learn/pull/12364
> 
> On Sun, 28 Oct 2018 at 09:32, Guillaume Lemaître  
> wrote:
> There is always a shuffling when iteration over the features (even when going 
> to all features).
> So in the case of a tie the split will be done on the first feature encounter 
> which will be different due to the shuffling.
> 
> There is a PR which was intending to make the algorithm deterministic to 
> always select the same feature in the case of tie.
> 
> On Sun, 28 Oct 2018 at 09:22, Fernando Marcos Wittmann 
>  wrote:
> The random_state is used in the splitters:
> 
> SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS
> 
> splitter = self.splitter
> if not isinstance(self.splitter, Splitter):
> splitter = SPLITTERS[self.splitter](criterion,
> self.max_features_,
> min_samples_leaf,
> min_weight_leaf,
> random_state,
> self.presort)
> 
> Which is defined as:
> 
> DENSE_SPLITTERS = {"best": _splitter.BestSplitter,
>"random": _splitter.RandomSplitter}
> 
> SPARSE_SPLITTERS = {"best": _splitter.BestSparseSplitter,
> "random": _splitter.RandomSparseSplitter}
> 
> Both 'best' and 'random' uses random states. The DecisionTreeClassifier uses 
> 'best' as default `splitter` parameter. I am not sure how this 'best' 
> strategy was defined. The docs define as "Supported strategies are “best”. 
> 
> 
> 
> 
> On Sun, Oct 28, 2018 at 9:32 AM Piotr Szymański  wrote:
> Just a small side note that I've come across with Random Forests which in the 
> end form an ensemble of Decision Trees. I ran a thousand iterations of RFs on 
> multi-label data and managed to get a 4-10 percentage points difference in 
> subset accuracy, depending on the data set, just as a random effect, while 
> I've seen papers report differences of just a couple pp as statistically 
> significant after a non-parametric rank test. 
> 
> On Sun, Oct 28, 2018 at 7:44 AM Sebastian Raschka  
> wrote:
> Good suggestion. The trees look different. I.e., there seems to be a tie at 
> some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65
> 
> So, I suspect that the features are shuffled, let's call it X_shuffled. Then 
> at some point the max_features are selected, which is by default 
> X_shuffled[:, :n_features]. Based on that, if there's a tie between 
> impurities for the different features, it's probably selecting the first 
> feature in the array among these ties.
> 
> If this is true (have to look into the code more deeply then) I wonder if it 
> would be worthwhile to change the implementation such that the shuffling only 
> occurs if  max_features < n_feature, because this way we could have 
> deterministic behavior for the trees by default, which I'd find more 
> intuitive for plain decision trees tbh.
> 
> Let me know what you all think.
> 
> Best,
> Sebastian
> 
> > On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente 
> >  wrote:
> > 
> > Hmmm that’s weird...
> > 
> > Have you tried to plot the trees (the decision rules) for the tree with 
> > different seeds, and see if the gain for the first split is the same even 
> > if the split itself is different?
> > 
> > I’d at least try that before diving into the source code...
> > 
> > Cheers,
> > 
> > --
> > Julio
> > 
> >> El 28 oct 2018, a las 2:24, Sebastian Raschka  
> >> escribió:
> >> 
> >> Thanks, Javier,
> >> 
> >> however, the max_features is n_features by default. But if you execute sth 
> >> like
> >> 
> >> import numpy as np
> >> from sklearn.datasets import load_iris
> >> from sklearn.model_selection import train_test_split
> >> from sklearn.tree import DecisionTreeClassifier
> >> 
> >> iris = load_iris()
> >> X, y = iris.data, iris.target
> >> X_train, X_test, y_train, y_test = train_test_split(X, y,
> >>   test_size=0.3,
> >>   random_state=123,
> >>   shuffle=True,
> >> 

Re: [scikit-learn] How does the random state influence the decision tree splits?

2018-10-28 Thread Sebastian Raschka
Good suggestion. The trees look different. I.e., there seems to be a tie at 
some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65

So, I suspect that the features are shuffled, let's call it X_shuffled. Then at 
some point the max_features are selected, which is by default X_shuffled[:, 
:n_features]. Based on that, if there's a tie between impurities for the 
different features, it's probably selecting the first feature in the array 
among these ties.

If this is true (have to look into the code more deeply then) I wonder if it 
would be worthwhile to change the implementation such that the shuffling only 
occurs if  max_features < n_feature, because this way we could have 
deterministic behavior for the trees by default, which I'd find more intuitive 
for plain decision trees tbh.

Let me know what you all think.

Best,
Sebastian

> On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente  
> wrote:
> 
> Hmmm that’s weird...
> 
> Have you tried to plot the trees (the decision rules) for the tree with 
> different seeds, and see if the gain for the first split is the same even if 
> the split itself is different?
> 
> I’d at least try that before diving into the source code...
> 
> Cheers,
> 
> --
> Julio
> 
>> El 28 oct 2018, a las 2:24, Sebastian Raschka  
>> escribió:
>> 
>> Thanks, Javier,
>> 
>> however, the max_features is n_features by default. But if you execute sth 
>> like
>> 
>> import numpy as np
>> from sklearn.datasets import load_iris
>> from sklearn.model_selection import train_test_split
>> from sklearn.tree import DecisionTreeClassifier
>> 
>> iris = load_iris()
>> X, y = iris.data, iris.target
>> X_train, X_test, y_train, y_test = train_test_split(X, y,
>>   test_size=0.3,
>>   random_state=123,
>>   shuffle=True,
>>   stratify=y)
>> 
>> for i in range(20):
>>   tree = DecisionTreeClassifier()
>>   tree.fit(X_train, y_train)
>>   print(tree.score(X_test, y_test))
>> 
>> 
>> 
>> You will find that the tree will produce different results if you don't fix 
>> the random seed. I suspect, related to what you said about the random 
>> feature selection if max_features is not n_features, that there is generally 
>> some sorting of the features going on, and the different trees are then due 
>> to tie-breaking if two features have the same information gain?
>> 
>> Best,
>> Sebastian
>> 
>> 
>> 
>>> On Oct 27, 2018, at 6:16 PM, Javier López  wrote:
>>> 
>>> Hi Sebastian,
>>> 
>>> I think the random state is used to select the features that go into each 
>>> split (look at the `max_features` parameter)
>>> 
>>> Cheers,
>>> Javier
>>> 
>>> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka 
>>>  wrote:
>>> Hi all,
>>> 
>>> when I was implementing a bagging classifier based on scikit-learn's 
>>> DecisionTreeClassifier, I noticed that the results were not deterministic 
>>> and found that this was due to the random_state in the 
>>> DescisionTreeClassifier (which is set to None by default).
>>> 
>>> I am wondering what exactly this random state is used for? I can imaging it 
>>> being used for resolving ties if the information gain for multiple features 
>>> is the same, or it could be that the feature splits of continuous features 
>>> is different? (I thought the heuristic is to sort the features and to 
>>> consider those feature values next to each associated with examples that 
>>> have different class labels -- but is there maybe some random subselection 
>>> involved?)
>>> 
>>> If someone knows more about this, where the random_state is used, I'd be 
>>> happy to hear it :)
>>> 
>>> Also, we could then maybe add the info to the DecisionTreeClassifier's 
>>> docstring, which is currently a bit too generic to be useful, I think:
>>> 
>>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py
>>> 
>>> 
>>>   random_state : int, RandomState instance or None, optional (default=None)
>>>   If int, random_state is the seed used by the random number generator;
>>>   If RandomState instance, random_state is the random number generator;
>>>   If None, the random number generator is the RandomState instance used
>&

Re: [scikit-learn] How does the random state influence the decision tree splits?

2018-10-27 Thread Sebastian Raschka
Thanks, Javier,

however, the max_features is n_features by default. But if you execute sth like

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=123,
shuffle=True,
stratify=y)

for i in range(20):
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
print(tree.score(X_test, y_test))



You will find that the tree will produce different results if you don't fix the 
random seed. I suspect, related to what you said about the random feature 
selection if max_features is not n_features, that there is generally some 
sorting of the features going on, and the different trees are then due to 
tie-breaking if two features have the same information gain?

Best,
Sebastian



> On Oct 27, 2018, at 6:16 PM, Javier López  wrote:
> 
> Hi Sebastian,
> 
> I think the random state is used to select the features that go into each 
> split (look at the `max_features` parameter)
> 
> Cheers,
> Javier
> 
> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka 
>  wrote:
> Hi all,
> 
> when I was implementing a bagging classifier based on scikit-learn's 
> DecisionTreeClassifier, I noticed that the results were not deterministic and 
> found that this was due to the random_state in the DescisionTreeClassifier 
> (which is set to None by default).
> 
> I am wondering what exactly this random state is used for? I can imaging it 
> being used for resolving ties if the information gain for multiple features 
> is the same, or it could be that the feature splits of continuous features is 
> different? (I thought the heuristic is to sort the features and to consider 
> those feature values next to each associated with examples that have 
> different class labels -- but is there maybe some random subselection 
> involved?)
> 
> If someone knows more about this, where the random_state is used, I'd be 
> happy to hear it :)
> 
> Also, we could then maybe add the info to the DecisionTreeClassifier's 
> docstring, which is currently a bit too generic to be useful, I think:
> 
> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py
> 
> 
> random_state : int, RandomState instance or None, optional (default=None)
> If int, random_state is the seed used by the random number generator;
> If RandomState instance, random_state is the random number generator;
> If None, the random number generator is the RandomState instance used
> by `np.random`.
> 
> 
> Best,
> Sebastian
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] How does the random state influence the decision tree splits?

2018-10-27 Thread Sebastian Raschka
Hi all,

when I was implementing a bagging classifier based on scikit-learn's 
DecisionTreeClassifier, I noticed that the results were not deterministic and 
found that this was due to the random_state in the DescisionTreeClassifier 
(which is set to None by default).

I am wondering what exactly this random state is used for? I can imaging it 
being used for resolving ties if the information gain for multiple features is 
the same, or it could be that the feature splits of continuous features is 
different? (I thought the heuristic is to sort the features and to consider 
those feature values next to each associated with examples that have different 
class labels -- but is there maybe some random subselection involved?)

If someone knows more about this, where the random_state is used, I'd be happy 
to hear it :)

Also, we could then maybe add the info to the DecisionTreeClassifier's 
docstring, which is currently a bit too generic to be useful, I think:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py


random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.


Best,
Sebastian
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-10-03 Thread Sebastian Raschka
The ONNX-approach sounds most promising, esp. because it will also allow 
library interoperability but I wonder if this is for parametric models only and 
not for the nonparametric ones like KNN, tree-based classifiers, etc.

All-in-all I can definitely see the appeal for having a way to export sklearn 
estimators in a text-based format (e.g., via JSON), since it would make sharing 
code easier. This doesn't even have to be compatible with multiple sklearn 
versions. A typical use case would be to include these JSON exports as e.g., 
supplemental files of a research paper for other people to run the models etc. 
(here, one can just specify which sklearn version it would require; of course, 
one could also share pickle files, by I am personally always hesitant reg. 
running/trusting other people's pickle files).

Unfortunately though, as Gael pointed out, this "feature" would be a huge 
burden for the devs, and it would probably also negatively impact the 
development of scikit-learn itself because it imposes another design constraint.

However, I do think this sounds like an excellent case for a contrib project. 
Like scikit-export, scikit-serialize or sth like that.

Best,
Sebastian



> On Oct 3, 2018, at 5:49 AM, Javier López  wrote:
> 
> 
> On Tue, Oct 2, 2018 at 5:07 PM Gael Varoquaux  
> wrote:
> The reason that pickles are brittle and that sharing pickles is a bad
> practice is that pickle use an implicitly defined data model, which is
> defined via the internals of objects.
> 
> Plus the fact that loading a pickle can execute arbitrary code, and there is 
> no way to know
> if any malicious code is in there in advance because the contents of the 
> pickle cannot
> be easily inspected without loading/executing it.
>  
> So, the problems of pickle are not specific to pickle, but rather
> intrinsic to any generic persistence code [*]. Writing persistence code that
> does not fall in these problems is very costly in terms of developer time
> and makes it harder to add new methods or improve existing one. I am not
> excited about it.
> 
> My "text-based serialization" suggestion was nowhere near as ambitious as 
> that,
> as I have already explained, and wasn't aiming at solving the versioning 
> issues, but
> rather at having something which is "about as good" as pickle but in a 
> human-readable
> format. I am not asking for a Turing-complete language to reproduce the 
> prediction
> function, but rather something simple in the spirit of the output produced by 
> the gist code I linked above, just for the model families where it is 
> reasonable:
> 
> https://gist.github.com/jlopezpena/2cdd09c56afda5964990d5cf278bfd31
> 
> The code I posted mostly works (specific cases of nested models need to be 
> addressed 
> separately, as well as pipelines), and we have been using (a version of) it 
> in production
> for quite some time. But there are hackish aspects to it that we are not 
> happy with,
> such as the manual separation of init and fitted parameters by checking if 
> the name ends with "_", having to infer class name and location using 
> "model.__class__.__name__" and "model.__module__", and the wacky use of 
> "__import__".
> 
> My suggestion was more along the lines of adding some metadata to sklearn 
> estimators so
> that a code in a similar style would be nicer to write; little things like 
> having a `init_parameters` and `fit_parameters` properties that would return 
> the lists of named parameters, 
> or a `model_info` method that would return data like sklearn version, class 
> name and location, or a package level dictionary pointing at the estimator 
> classes by a string name, like
> 
> from sklearn.linear_models import LogisticRegression
> estimator_classes = {"LogisticRegression": LogisticRegression, ...}
> 
> so that one can load the appropriate class from the string description 
> without calling __import__ or eval; that sort of stuff.
> 
> I am aware this would not address the common complain of "prefect prediction 
> reproducibility"
> across versions, but I think we can all agree that this utopia of perfect 
> reproducibility is not 
> feasible.
> 
> And in the long, long run, I agree that PFA/onnx or whichever similar format 
> that emerges, is
> the way to go.
> 
> J
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Splitting Method on RandomForestClassifier

2018-10-02 Thread Sebastian Raschka
This is explained here

http://scikit-learn.org/stable/modules/ensemble.html#random-forests:

"In addition, when splitting a node during the construction of the tree, the 
split that is chosen is no longer the best split among all features. Instead, 
the split that is picked is the best split among a random subset of the 
features."

and the "best split" (in the decision trees) among the random feature subset is 
based on maximizing information gain or equivalently minimizing child node 
impurity as described here: 
http://scikit-learn.org/stable/modules/tree.html#mathematical-formulation



Looking at this, I have a question though ...

In the docs 
(http://scikit-learn.org/stable/modules/tree.html#mathematical-formulation) it 
says

"Select the parameters that minimises the impurity"

and

"Recurse for subsets Q_left and Q_right until the maximum allowable depth is 
reached"

So but this is basically not the whole definition, right? There should be 
condition that if the weighted average of the child node impurities for any 
given feature is not smaller than the parent node impurity, the tree growing 
algorithm would terminate, right?

Best,
Sebastian

> On Oct 2, 2018, at 10:49 AM, Guillaume Lemaître  
> wrote:
> 
> In Random Forest, the best split for each feature is selected. The
> Extra Randomized Trees will make a random split instead.
> On Tue, 2 Oct 2018 at 17:43, Michael Reupold
>  wrote:
>> 
>> Hello all,
>> I currently struggle to find information what or which specific split 
>> Methods are used on the RandomForestClassifier. Is it a random selection? A 
>> median? The best of a set of methods?
>> 
>> Kind regards
>> 
>> Michael Reupold
>> 
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> -- 
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-09-28 Thread Sebastian Raschka
> 
> > I think model serialization should be a priority.
> 
> There is also the ONNX specification that is gaining industrial adoption and 
> that already includes open source exporters for several families of 
> scikit-learn models:
> 
> https://github.com/onnx/onnxmltools


Didn't know about that. This is really nice! What do you think about referring 
to it under http://scikit-learn.org/stable/modules/model_persistence.html to 
make people aware that this option exists?
Would be happy to add a PR.

Best,
Sebastian



> On Sep 28, 2018, at 9:30 AM, Olivier Grisel  wrote:
> 
> 
> > I think model serialization should be a priority.
> 
> There is also the ONNX specification that is gaining industrial adoption and 
> that already includes open source exporters for several families of 
> scikit-learn models:
> 
> https://github.com/onnx/onnxmltools
> 
> -- 
> Olivier
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-09-27 Thread Sebastian Raschka
Congrats everyone, this is awesome!!! I just started teaching an ML course this 
semester and introduced scikit-learn this week -- it was a great timing to 
demonstrate how well maintained the library is and praise all the efforts that 
go into it :). 

> I think model serialization should be a priority.


While this could potentially a bit inefficient for large non-parametric models, 
I think the serialization into a text-readable format has some advantages for 
real-world use cases. E.g., sharing models (pickle is a bit problematic because 
of security issues) in applications but also as supplementary material in 
archives for accompanying research articles, etc (esp in cases where datasets 
cannot be shared in their original form due to some copyright or other 
concerns).

Chris Emmery, Chris Wagner and I toyed around with JSON a while back 
(https://cmry.github.io/notes/serialize), and it could be feasible -- but yeah, 
it will involve some work, especially with testing things thoroughly for all 
kinds of estimators. Maybe this could somehow be automated though in a 
grid-search kind of way with a build matrix for estimators and parameters once 
a general framework has been developed. 


> On Sep 27, 2018, at 6:22 PM, Javier López  wrote:
> 
> First of all, congratulations on the release, great work, everyone!
> 
> I think model serialization should be a priority. Particularly, 
> I think that (whenever practical) there should be a way of 
> serializing estimators (either unfitted or fitted) in a text-readable format,
> prefereably JSON or PMML/PFA (or several others).
> 
> Obviously for some models it is not practical (eg random forests with 
> thousands of trees), but for simpler situations I believe it would
> provide a great tool for model sharing without the dangers of pickling
> and the versioning hell.
> 
> I am (painfully) aware that when rebuilding a model on a different setup,
> it might yield different results; in my company we address that by saving
> together with the serialized model a reasonably small validation dataset
> together with its predictions, upon unserializing we check that the rebuilt
> model reproduces the predictions within some acceptable range. 
> 
> About the new release, I am particularly happy about the joblib update,
> as it has been a major source of pain for me over the last year. On that
> note, I think it would be a good idea to stop vendoring joblib and list it as
> a dependency instead; wheels, pip and conda are mature enough to 
> handle the situation nowadays.
> 
> Last, but not least, it would be great to relax the checks concerning nans 
> at prediction time, and allow, for instance, that an estimator yields nans if
> any features are nan's; we face that situation when working with ensembles,
> where a few of the submodels might not get enough features available, but
> the rest do.  
> 
> Of the top of my head, that's all, keep up the fantastic work!
> J
> 
> On Thu, Sep 27, 2018 at 6:31 PM Andreas Mueller  wrote:
> I think we should work on the formatting, make sure it's complete, link it to 
> issues /PRs and
> then make this into a public document on the website and request feedback.
> 
> Right now it's a bit in a format that is understandable for core-developers 
> but some of the things are not clear
> to the average audience. Linking the issues / PRs will help that a bit, but 
> also we might want to add a sentence
> to each point in the roadmap.
> 
> I had some issues with the formatting, I'll try to fix that later.
> Any volunteers for adding the frozen estimator (or has someone added that 
> already?).
> 
> Cheers,
> Andy
> 
> 
> On 09/27/2018 04:29 AM, Olivier Grisel wrote:
>> Le mer. 26 sept. 2018 à 23:02, Joel Nothman  a écrit 
>> :
>> And for those interested in what's in the pipeline, we are trying to draft a 
>> roadmap... 
>> https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018
>> 
>> But there are no doubt many features that are absent there too.
>> 
>> Indeed, it would be great to get some feedback on this roadmap from heavy 
>> scikit-learn users: which points do you think are the most important? What 
>> is missing from this roadmap?
>> 
>> Feel free to reply to this thread.
>> 
>> -- 
>> Olivier
>> 
>> 
>> ___
>> scikit-learn mailing list
>> 
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] ANN Scikit-learn 0.20rc1 release candidate available

2018-08-31 Thread Sebastian Raschka
That's awesome! Congrats and thanks everyone for all the work that went into 
this!

Just finished reading through the What's New docs... Wow, that took a while -- 
here, in a positive sense ;). It's a huge release with lots of important fixes. 
It's great to see that you prioritized the maintenance and improvement of 
scikit-learn as a fundamental ML library, rather than adding useful yet "niche" 
features.

Cheers,
Sebastian 


> On Aug 31, 2018, at 8:26 PM, Andreas Mueller  wrote:
> 
> Hey Folks!
> 
> I'm happy to announce that the scikit-learn 0.20 release candidate 1 is now 
> available via conda-forge and pip.
> Please help us by testing this release candidate so we can make sure the 
> final release will go seamlessly!
> 
> You can install the release candidate from conda-forge using
> 
> conda install scikit-learn=0.20rc1 -c conda-forge/label/rc -c conda-forge
> 
> (please take into account that if you're using the default conda channel 
> otherwise, this will pull in some other
> dependencies from conda-forge).
> 
> You can install the release candidate via pip using
> 
> pip install --pre scikit-learn
> 
> The documentation for 0.20 is available at
> 
> http://scikit-learn.org/0.20/
> 
> and will move to http://scikit-learn.org/ upon final release.
> 
> You can find the release note with all new features and changes here:
> 
> http://scikit-learn.org/0.20/whats_new.html#version-0-20
> 
> Thank you for your help in testing the RC and thank you to everybody that 
> made the release possible!
> 
> All the best,
> 
> Andy
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Unable to connect HDInsight hive to python

2018-08-12 Thread Sebastian Raschka
Hi Debu,

since Azure HDInsights is a commercial service, their customer support should 
handle questions like this

> On Aug 12, 2018, at 7:16 AM, Debabrata Ghosh  wrote:
> 
> Hi All,
>Greetings ! Wish you are doing good ! I am just 
> reaching out to you in case if you have any answer or help me direct to the 
> right forum please:
> 
> We are facing with hive connectivity from the python on Azure HDinsights, We 
> have installed required SASL,thrift_sasl(0.2.1) and Thirft (0.9.3) packages 
> on Ubuntu , but some how when we are trying to connect Hive using following 
> packages we are getting errors , It would be really great help if you could 
> provide some pointers based on your experience
> 
> Example 1: from impala.dbapi import connect conn=connect(host="localhost", 
> port=10001 , auth_mechanism="PLAIN", user="admin", password="PWD") (tried 
> both 127.0.0.1:1/10001)
> 
> Example 2:
> 
> import pyhs2 conn = pyhs2.connect(host='localhost ', 
> port=1,authMechanism="PLAIN", user='admin', password=,database='default')
> 
> Example 3:
> 
> from pyhive import hive conn = hive.Connection(host="localhost", port=10001, 
> username="admin", password=None, auth='NONE')
> 
> Across all of the above examples we are getting the error message: 
> thrift.transport.TTransport.TTransportException: Tsocket read 0 bytes
> 
> Thanks,
> Debu
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Using GPU in scikit learn

2018-08-08 Thread Sebastian Raschka
Hi,

scikit-learn doesn't support computations on the GPU, unfortunately. 
Specifically for random forests, there's CudaTree, which implements a GPU 
version of scikit-learn's random forests. It doesn't look like the library is 
actively developed (hard to tell whether that's a good thing or a bad thing -- 
whether it's stable enough that it didn't need any updates). Anyway, maybe 
worth a try: https://github.com/EasonLiao/CudaTree

Otherwise, I can imagine there are probably alternative implementations out 
there?

Best,
Sebastian

> On Aug 8, 2018, at 7:50 PM, hoang trung Ta  wrote:
> 
> Dear all members,
> 
> I am using Random forest for classification satellite images. I have a bunch 
> of images, thus the processing is quite slow. I searched on the Internet and 
> they said that GPU can accelerate the process. 
> 
> I have GPU NDVIA Geforce GTX 1080 Ti installed in the computer
> 
> Do you know how to use GPU in Scikit learn, I mean the packages to use and 
> sample code that used GPU in random forest classification?
> 
> Thank you very much
> 
> -- 
> Ta Hoang Trung (Mr)
> 
> Master student
> Graduate School of Life and Environmental Sciences
> University of Tsukuba, Japan
> 
> Mobile:  +81 70 3846 2993
> Email :  ta.hoang-trung...@alumni.tsukuba.ac.jp
>  tahoangtr...@gmail.com
>  s1626...@u.tsukuba.ac.jp
> 
> Mapping Technician
> Department of Surveying and Mapping Vietnam
> No 2, Dang Thuy Tram street, Hanoi, Viet Nam
> 
> Mobile: +84 1255151344
> Email : tahoangtr...@gmail.com
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Help with Pull Request( Checks failing)

2018-07-24 Thread Sebastian Raschka
I am not a core dev, but I think I can see what's wrong there (mostly Flake8 
issues). Let me comment about that over there.

> On Jul 24, 2018, at 7:34 PM, Prathusha Jonnagaddla Subramanyam Naidu 
>  wrote:
> 
> This is the link to the PR - 
> https://github.com/scikit-learn/scikit-learn/pull/11670
> 
> On Tue, Jul 24, 2018 at 8:33 PM, Prathusha Jonnagaddla Subramanyam Naidu 
>  wrote:
> Hi everyone,
>   I submitted my first PR few hours back and I see that two tests failed. 
> Would really appreciate if anyone can help me with how to fix these/ what I 
> am doing wrong. 
> 
> Thank you !
> 
> 
> 
> -- 
> Regards,
> Prathusha JS Naidu
> Graduate Student
> Department of CSEE
> UMBC
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] RFE with logistic regression

2018-07-24 Thread Sebastian Raschka
I addition to checking _n_iter and fixing the random seed as I suggested maybe 
also try normalizing the features (eg z scores via the standard scale we) to 
see if that stabilizes the training 

Sent from my iPhone

> On Jul 24, 2018, at 1:07 PM, Benoît Presles  
> wrote:
> 
> I did the same tests as before adding fit_intercept=False and:
> 
> 1. I have got the same problem as before, i.e. when I execute the RFE 
> multiple times I don't get the same ranking each time.
> 
> 2. When I change the solver to 'sag' 
> (classifier_RFE=LogisticRegression(C=1e9, verbose=1, max_iter=1, 
> fit_intercept=False, solver='sag')), it seems that I get the same ranking at 
> each run. This is not the case with the 'saga' solver.
> The ranking is not the same between the solvers.
> 
> 3. With C=1, it seems that I have the same results at each run for all 
> solvers (liblinear, sag and saga), however the ranking is not the same 
> between the solvers.
> 
> 
> How can I get reproducible and consistent results?
> 
> 
> Thanks for your help,
> Best regards,
> Ben
> 
> 
> 
>> Le 24/07/2018 à 18:16, Stuart Reynolds a écrit :
>> liblinear regularizes the intercept (which is a questionable thing to
>> do and a poor choice of default in sklearn).
>> The other solvers do not.
>> 
>> On Tue, Jul 24, 2018 at 4:07 AM, Benoît Presles
>>  wrote:
>>> Dear scikit-learn users,
>>> 
>>> I am using the recursive feature elimination (RFE) tool from sklearn to rank
>>> my features:
>>> 
>>> from sklearn.linear_model import LogisticRegression
>>> classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=1)
>>> from sklearn.feature_selection import RFE
>>> rfe = RFE(estimator=classifier_RFE, n_features_to_select=1, step=1)
>>> rfe.fit(X, y)
>>> ranking = rfe.ranking_
>>> print(ranking)
>>> 
>>> 1. The first problem I have is when I execute the above code multiple times,
>>> I don't get the same results.
>>> 
>>> 2. When I change the solver to 'sag' or 'saga' (classifier_RFE =
>>> LogisticRegression(C=1e9, verbose=1, max_iter=1), solver='sag'), it
>>> seems that I get the same results at each run but the ranking is not the
>>> same between these two solvers.
>>> 
>>> 3. With C=1, it seems I have the same results at each run for the
>>> solver='liblinear', but not for the solvers 'sag' and 'saga'. I still don't
>>> get the same results between the different solvers.
>>> 
>>> 
>>> Thanks for your help,
>>> Best regards,
>>> Ben
>>> 
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] RFE with logistic regression

2018-07-24 Thread Sebastian Raschka
Agreed. But then the setting is c=1e9 in this context (where C is the inverse 
regularization strength), so the regularization effect should be very small. 

Probably shouldn't matter much for convex optimization, but I would still try 
to 

a) set the random_state to some fixed value
b) make sure that .n_iter_ < .max_iter

to see if that results in more consistency.

Best,
Sebastian

> On Jul 24, 2018, at 11:16 AM, Stuart Reynolds  
> wrote:
> 
> liblinear regularizes the intercept (which is a questionable thing to
> do and a poor choice of default in sklearn).
> The other solvers do not.
> 
> On Tue, Jul 24, 2018 at 4:07 AM, Benoît Presles
>  wrote:
>> Dear scikit-learn users,
>> 
>> I am using the recursive feature elimination (RFE) tool from sklearn to rank
>> my features:
>> 
>> from sklearn.linear_model import LogisticRegression
>> classifier_RFE = LogisticRegression(C=1e9, verbose=1, max_iter=1)
>> from sklearn.feature_selection import RFE
>> rfe = RFE(estimator=classifier_RFE, n_features_to_select=1, step=1)
>> rfe.fit(X, y)
>> ranking = rfe.ranking_
>> print(ranking)
>> 
>> 1. The first problem I have is when I execute the above code multiple times,
>> I don't get the same results.
>> 
>> 2. When I change the solver to 'sag' or 'saga' (classifier_RFE =
>> LogisticRegression(C=1e9, verbose=1, max_iter=1), solver='sag'), it
>> seems that I get the same results at each run but the ranking is not the
>> same between these two solvers.
>> 
>> 3. With C=1, it seems I have the same results at each run for the
>> solver='liblinear', but not for the solvers 'sag' and 'saga'. I still don't
>> get the same results between the different solvers.
>> 
>> 
>> Thanks for your help,
>> Best regards,
>> Ben
>> 
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] New core dev: Joris Van den Bossche

2018-06-23 Thread Sebastian Raschka
That's great news! I am glad to hear that you joined the project, Joris Van den 
Bossche!  I am a scikit-learn user (and sometimes contributor) and really 
appreciate all the time and effort that the core developers and contributors 
spend on maintaining and extending the library. 

Best regards,
Sebastian


> On Jun 23, 2018, at 6:42 AM, Olivier Grisel  wrote:
> 
> Hi everyone!
> 
> Let's welcome Joris Van den Bossche (@jorisvdbossche) officially as a 
> scikit-learn core developer!
> 
> Joris is one of the maintainers of the pandas project and recently 
> contributed many new great PRs to scikit-learn (notably the ColumnTransformer 
> and a refactoring of the categorical variable preprocessing tools). 
> 
> Cheers!
> 
> -- 
> Olivier
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Jeff Levesque: association rules

2018-06-11 Thread Sebastian Raschka
Hi Jeff,

had a similar question 1-2 years ago and ended up using Chris Borgelt's C 
command line tools but for convenience, i also implemented basic association 
rule & frequent pattern mining in Python here:
http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

Best,
Sebastian

> On Jun 11, 2018, at 7:23 AM, Jeffrey Levesque via scikit-learn 
>  wrote:
> 
> Hi guys,
> What are some good approaches for association rules. Is there something built 
> in, or do people sometimes use alternate packages, maybe apache spark?
> 
> Thank you,
> 
> Jeff Levesque
> https://github.com/jeff1evesque
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Supervised prediction of multiple scores for a document

2018-06-03 Thread Sebastian Raschka
Hi,

> I quickly read about multinomal regression, is it something do you recommend 
> I use? Maybe you think about something else? 

Multinomial regression (or Softmax Regression) should give you results somewhat 
similar to a linear SVC (or logistic regression with OvO or OvR). The 
theoretical difference is that Softmax regression assumes that the classes are 
mutually exclusive, which is probably not the case in your setting since e.g., 
an article could be both "Art" and "Science" to some extend or so. Here a quick 
summary of softmax regression if useful: 
https://sebastianraschka.com/faq/docs/softmax_regression.html. In scikit-learn, 
you can use it via LogisticRegression(..., multi_class='ovr').

Howeever, spontaneously, I would say that Latent Dirichlet Allocation could be 
a better choice in your case. I.e., fit the model on the corpus for a specified 
number of topics (e.g., 10, but depends on your dataset, I would experiment a 
bit here), look at the top words in each topic and then assign a topic label to 
each topic. Then, for a given article, you can assign e.g., the top X labeled 
topics.

Best,
Sebastian




> On Jun 3, 2018, at 5:03 PM, Amirouche Boubekki  
> wrote:
> 
> Héllo,
> 
> I started a natural language processing project a few weeks ago called 
> wikimark (the code is all in wikimark.py)
> 
> Given a text it wants to return a dictionary scoring the input against vital 
> articles categories, e.g.:
> 
> out = wikimark("""Peter Hintjens wrote about the relation between technology 
> and culture. Without using a scientifical tone of state-of-the-art review of 
> the anthroposcene antropology, he gives a fair amount of food for thought. 
> According to Hintjens, technology is doomed to become cheap. As matter of 
> fact, intelligence tools will become more and more accessible which will 
> trigger a revolution to rebalance forces in society.""") 
> 
> for category, score in out: 
> print('{} ~ {}'.format(category, score))
> 
> The above program would output something like that:
> 
> Art ~ 0.1 
> Science ~ 0.5 
> Society ~ 0.4
> 
> Except not everything went as planned. Mind the fact that in the above 
> example the total is equal to 1, but I could not achieve that at all.
> 
> I am using gensim to compute vectors of paragraphs (doc2vev) and then submit 
> those vectors to svm.SVR in a one-vs-all strategy ie. a document is scored 1 
> if it's in that subcategory and zero otherwise. At prediction time, it goes 
> though the same doc2vec pipeline. The computer will score each paragraph 
> against the SVR models of wikipedia vital article subcategories and get a 
> value between 0 and 1 for each paragraph. I compute the sum and group by 
> subcategory and then I have a score per category for the input document
> 
> It somewhat works. I made a web ui online you can find it at 
> https://sensimark.com where you can test it. You can directly access the
> full api e.g. 
> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html=1
> 
> The output JSON document is a list of category dictionary where the 
> prediction key is associated with the average of the "prediction" of the 
> subcategories. If you replace =1 by =5 you might get something else 
> as top categories e.g. 
> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html=10
> 
> or 
> 
> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html=5
> 
> I wrote "prediction" with double quotes because the value you see, is the 
> result of some formula. Since, the predictions I get are rather small between 
> 0 and 0.015 I apply the following formula:
> value = math.exp(prediction)
> magic = ((value * 100) - 110) * 100
> 
> In order to have values to spread between -200 and 200. Maybe this is the 
> symptom that my model doesn't work at all. 
> 
> Still, the top 10 results are almost always near each other (try with BBC 
> articles on https://sensimark.com . It is only when a regression model is 
> disqualified with a score of 0 that the results are simple to understand. 
> Sadly, I don't have an example at hand to support that claim. You have to 
> believe me.
> 
> I just figured looking at the machine learning map that my problem might be 
> classification problem, except I don't really want to know what is the class 
> of new documents, I want to how what are the different subjects that are 
> dealt in the document based on a hiearchical corpus;
> I don't want to guess a hiearchy! I want to now how the document content 
> spread over the different categories or subcategories.
> 
> I quickly read about multinomal regression, is it something do you recommend 
> I use? Maybe you think about something else? 
> 
> Also, it seems I should benchmark / evaluate my model against LDA.
> 
> I am rather noob in terms of datascience and my math skills are not so fresh. 
> I more likely looking for ideas on what algorithm, fine tuning and some 
> practice of datascience I must follow 

Re: [scikit-learn] Supervised prediction of multiple scores for a document

2018-06-03 Thread Sebastian Raschka
sorry, I had a copy & paste error, I meant "LogisticRegression(..., 
multi_class='multinomial')" and not "LogisticRegression(..., 
multi_class='ovr')" 

> On Jun 3, 2018, at 5:19 PM, Sebastian Raschka  
> wrote:
> 
> Hi,
> 
>> I quickly read about multinomal regression, is it something do you recommend 
>> I use? Maybe you think about something else? 
> 
> Multinomial regression (or Softmax Regression) should give you results 
> somewhat similar to a linear SVC (or logistic regression with OvO or OvR). 
> The theoretical difference is that Softmax regression assumes that the 
> classes are mutually exclusive, which is probably not the case in your 
> setting since e.g., an article could be both "Art" and "Science" to some 
> extend or so. Here a quick summary of softmax regression if useful: 
> https://sebastianraschka.com/faq/docs/softmax_regression.html. In 
> scikit-learn, you can use it via LogisticRegression(..., multi_class='ovr').
> 
> Howeever, spontaneously, I would say that Latent Dirichlet Allocation could 
> be a better choice in your case. I.e., fit the model on the corpus for a 
> specified number of topics (e.g., 10, but depends on your dataset, I would 
> experiment a bit here), look at the top words in each topic and then assign a 
> topic label to each topic. Then, for a given article, you can assign e.g., 
> the top X labeled topics.
> 
> Best,
> Sebastian
> 
> 
> 
> 
>> On Jun 3, 2018, at 5:03 PM, Amirouche Boubekki 
>>  wrote:
>> 
>> Héllo,
>> 
>> I started a natural language processing project a few weeks ago called 
>> wikimark (the code is all in wikimark.py)
>> 
>> Given a text it wants to return a dictionary scoring the input against vital 
>> articles categories, e.g.:
>> 
>> out = wikimark("""Peter Hintjens wrote about the relation between technology 
>> and culture. Without using a scientifical tone of state-of-the-art review of 
>> the anthroposcene antropology, he gives a fair amount of food for thought. 
>> According to Hintjens, technology is doomed to become cheap. As matter of 
>> fact, intelligence tools will become more and more accessible which will 
>> trigger a revolution to rebalance forces in society.""") 
>> 
>> for category, score in out: 
>>print('{} ~ {}'.format(category, score))
>> 
>> The above program would output something like that:
>> 
>> Art ~ 0.1 
>> Science ~ 0.5 
>> Society ~ 0.4
>> 
>> Except not everything went as planned. Mind the fact that in the above 
>> example the total is equal to 1, but I could not achieve that at all.
>> 
>> I am using gensim to compute vectors of paragraphs (doc2vev) and then submit 
>> those vectors to svm.SVR in a one-vs-all strategy ie. a document is scored 1 
>> if it's in that subcategory and zero otherwise. At prediction time, it goes 
>> though the same doc2vec pipeline. The computer will score each paragraph 
>> against the SVR models of wikipedia vital article subcategories and get a 
>> value between 0 and 1 for each paragraph. I compute the sum and group by 
>> subcategory and then I have a score per category for the input document
>> 
>> It somewhat works. I made a web ui online you can find it at 
>> https://sensimark.com where you can test it. You can directly access the
>> full api e.g. 
>> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html=1
>> 
>> The output JSON document is a list of category dictionary where the 
>> prediction key is associated with the average of the "prediction" of the 
>> subcategories. If you replace =1 by =5 you might get something else 
>> as top categories e.g. 
>> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html=10
>> 
>> or 
>> 
>> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html=5
>> 
>> I wrote "prediction" with double quotes because the value you see, is the 
>> result of some formula. Since, the predictions I get are rather small 
>> between 0 and 0.015 I apply the following formula:
>> value = math.exp(prediction)
>> magic = ((value * 100) - 110) * 100
>> 
>> In order to have values to spread between -200 and 200. Maybe this is the 
>> symptom that my model doesn't work at all. 
>> 
>> Still, the top 10 results are almost always near each other (try with BBC 
>> articles on https://sensimark.com . It is only when a regression model is 
>> disqualified with a score of 0 that the results are simple to understa

Re: [scikit-learn] DBScan freezes my computer !!!

2018-05-13 Thread Sebastian Raschka
> So I suggest that there is a test version that shows a proper message when an 
> error occurs.

I think the freezing that happens in your case is operating system specific and 
it would require some weird workarounds to detect at which RAM usage the 
combination of machine and operating system might freeze (i.e., I never 
observed my system freezing when I run out of RAM, since it has a pretty swift 
SSD, but the sklearn process may take a very long time to finish). Plus, 
scikit-learn would require to know and constantly check how much memory would 
be used and currently available (due to the use of other apps and the OS 
kernel), which wouldn't be feasible. 

I am not sure if this helps (depending where the memory-usage bottleneck is), 
but it could maybe help providing a sparse (CSR) array instead of a dense one 
to the .fit() method. Another thing to try would be to pre-compute the 
distances and give that to the .fit() method after initializing the DBSCAN 
object with metric='precomputed')

Best,
Sebastian

> On May 13, 2018, at 7:23 PM, Mauricio Reis  wrote:
> 
> I think the problem is due to the size of my database, which has 44,000 
> records. When I ran a database test with reduced sizes (10,000 and 20,000 
> first records), the routine ran normally.
> 
> You ask me to check the memory while running the DBScan routine, but I do not 
> know how to do that (if I did, I would have done that already).
> 
> I think the routine is not ready to work with too much data. The problem is 
> that my computer freezes and I can not analyze the case. I've tried to figure 
> out if any changes work (like changing routine parameters), but all 
> alternatives with lots of data (about 40,000 records) generate error.
> 
> I believe that package routines have no exception handling to improve 
> performance. So I suggest that there is a test version that shows a proper 
> message when an error occurs.
> 
> To summarize: 1) How to check the memory of the computer during the execution 
> of the routine? 2) I suggest developing test versions of routines that may 
> have a memory error.
> 
> Att.,
> Mauricio Reis
> 
> 2018-05-13 5:34 GMT-03:00 Roman Yurchak :
> Could you please check memory usage while running DBSCAN to make sure 
> freezing is due to running out of memory and not to something else?
> Which parameters do you run DBSCAN with? Changing algorithm, leaf_size 
> parameters and ensuring n_jobs=1 could help.
> 
> Assuming eps is reasonable, I think it shouldn't be an issue to run DBSCAN on 
> L2 normalized data: using the default euclidean metric, this should produce 
> somewhat similar results to clustering not normalized data with 
> metric='cosine'.
> 
> On 13/05/18 00:20, Andrew Nystrom wrote:
> If you’re l2 norming your data, you’re making it live on the surface of a 
> hypershere. That surface will have a high density of points and may not have 
> areas of low density, in which case the entire surface could be recognized as 
> a single cluster if epsilon is high enough and min neighbors is low enough. 
> I’d suggest not using l2 norm with DBSCAN.
> On Sat, May 12, 2018 at 7:27 AM Mauricio Reis  > wrote:
> 
> The DBScan "fit" method (in scikit-learn v0.19.1) is freezing my
> computer without any warning message!
> 
> I am using WinPython 3.6.5 64 bit.
> 
> The method works normally with the original data, but freezes when I
> use the normalized data (between 0 and 1).
> 
> What should I do?
> 
> Att.,
> Mauricio Reis
> ___
> scikit-learn mailing list
> scikit-learn@python.org 
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Retracting model from the 'blackbox' SVM

2018-05-04 Thread Sebastian Raschka
Dear Wouter,

for the SVM, scikit-learn wraps the LIBSVM and LIBLINEAR. I think the 
scikit-learn class SVC uses LIBSVM for every kernel. Since you are using the 
linear kernel, you could use the more efficient LinearSVC scikit-learn class to 
get similar results. I guess this in turn is easier to handle in terms of

>  Is there a way to get the underlying formula for the model out of scikit 
> instead of having it as a 'blackbox' in my svm function.

More specifically, LinearSVC uses the _fit_liblinear code available here: 
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/base.py

And more info on the LIBLINEAR library it is using can be found here: 
https://www.csie.ntu.edu.tw/~cjlin/liblinear/ (they have links to technical 
reports and implementation details there)

Best,
Sebastian

> On May 4, 2018, at 5:12 AM, Wouter Verduin  wrote:
> 
> Dear developers of Scikit,
> 
> I am working on a scientific paper on a predictionmodel predicting 
> complications in major abdominal resections. I have been using scikit to 
> create that model and got good results (score of 0.94). This makes us want to 
> see what the model is like that is made by scikit.
> 
> As for now we got 100 input variables but logically these arent all as 
> usefull as the others and we want to reduce this number to about 20 and see 
> what the effects on the score are.
> 
> My question: Is there a way to get the underlying formula for the model out 
> of scikit instead of having it as a 'blackbox' in my svm function.
> 
> At this moment i am predicting a dichtomous variable with 100 variables, 
> (continuous, ordinal and binair).
> 
> My code:
> 
> import numpy as
>  np
> 
> from numpy import *
> import pandas as
>  pd
> 
> from sklearn import tree, svm, linear_model, metrics,
>  preprocessing
> 
> import
>  datetime
> 
> from sklearn.model_selection import KFold, cross_val_score, ShuffleSplit, 
> GridSearchCV
> from time import gmtime,
>  strftime
> 
> 
> #database openen en voorbereiden
> 
> file 
> = "/home/wouter/scikit/DB_SCIKIT.csv"
> 
> DB 
> = pd.read_csv(file, sep=";", header=0, decimal= ',').as_matrix()
> 
> DBT 
> =
>  DB
> 
> print "Vorm van de DB: ", DB.
> shape
> target 
> = []
> for i in range(len(DB[:,-1])):
> 
> target
> .append(DB[i,-1])
> 
> DB 
> = delete(DB,s_[-1],1) #Laatste kolom verwijderen
> AantalOutcome = target.count(1)
> print "Aantal outcome:", AantalOutcome
> print "Aantal patienten:", len(target)
> 
> 
> A 
> =
>  DB
> b 
> =
>  target
> 
> 
> print len(DBT)
> 
> 
> svc
> =svm.SVC(kernel='linear', cache_size=500, probability=True)
> 
> indices 
> = np.random.permutation(len(DBT))
> 
> 
> rs 
> = ShuffleSplit(n_splits=5, test_size=.15, random_state=None)
> 
> scores 
> = cross_val_score(svc, A, b, cv=rs)
> 
> A 
> = ("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
> print
>  A
> 
> X_train 
> = DBT[indices[:-302]]
> 
> y_train 
> = []
> for i in range(len(X_train[:,-1])):
> 
> y_train
> .append(X_train[i,-1])
> 
> X_train 
> = delete(X_train,s_[-1],1) #Laatste kolom verwijderen
> 
> 
> X_test 
> = DBT[indices[-302:]]
> 
> y_test 
> = []
> for i in range(len(X_test[:,-1])):
> 
> y_test
> .append(X_test[i,-1])
> 
> X_test 
> = delete(X_test,s_[-1],1) #Laatste kolom verwijderen
> 
> 
> model 
> = svc.fit(X_train,y_train)
> print
>  model
> 
> uitkomst 
> = model.score(X_test, y_test)
> print
>  uitkomst
> 
> voorspel 
> = model.predict(X_test)
> print voorspel
> And output:
> 
> Vorm van de DB:  (2011, 101)
> Aantal outcome: 128
> Aantal patienten: 2011
> 2011
> Accuracy: 0.94 (+/- 0.01)
> 
> SVC
> (C=1.0, cache_size=500, class_weight=None, coef0=0.0,
> 
>   decision_function_shape
> ='ovr', degree=3, gamma='auto', kernel='linear',
> 
>   max_iter
> =-1, probability=True, random_state=None, shrinking=True,
> 
>   tol
> =0.001, verbose=False)
> 0.927152317881
> [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
> 
>  
> 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
> Thanks in advance!
> 
> with kind regards,
> 
> Wouter Verduin
> 
> 

Re: [scikit-learn] MLPClassifier - Softmax activation function

2018-04-18 Thread Sebastian Raschka
That's a good question since the outputs would be differently scaled if the 
logistic sigmoid vs the softmax is used in the output layer. I think you don't 
need to worry about setting anything though, since the "activation" only 
applies to the hidden layers, and the softmax is, regardless of "activation," 
automatically used in the output layer.

Best,
Sebastian

> On Apr 18, 2018, at 3:15 PM, Daniel Baláček  wrote:
> 
> Hello everyone
> 
> I have a question regarding MLPClassifier in sklearn. In the documentation in 
> section 1.17. Neural network models (supervised) - 1.17.2 Classification it 
> is stated that  "MLPClassifier supports multi-class classification by 
> applying Softmax as the output function."
> However it is not clear how to apply the Softmax function.
> 
> The way I think (or hope) this works is that if a parameter activation is set 
> to activation = 'logistic' Softmax function should be automatically applied 
> whenever there are more than two classes. Is this right or does one have to 
> explicitly specify the use of Softmax function in some way?
> 
> I am sorry if this is a nonsense question. I am new to scikit-learn and 
> machine learning in general and I was not sure about this one. Thank you for 
> any answers in advance.
> 
> With regards,
> D. B.
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Using KMeans cluster labels in KNN

2018-03-12 Thread Sebastian Raschka
Hi,
If you want to predict the Kmeans cluster membership, you can use Kmeans' 
predict method instead of training a KNN model on the cluster assignments. This 
will be computationally more efficient and give you the correct assignment at 
the borders between clusters.

Best,
Sebastian

> On Mar 12, 2018, at 2:55 AM, prince gosavi  wrote:
> 
> Hi,
> I have generated clusters using the KMeans algorithm and would like to use 
> the labels of the model in the KNN.
> 
> I don't have the implementation idea but I can visualize it as
> 
> KNNmodel = KNN.fit(X, KMeansModel.labels_)
> 
> Such that the KNN will predict the cluster the new point belong to.
> 
> -- 
> Regards
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Need help in dealing with large dataset

2018-03-05 Thread Sebastian Raschka
Like Guillaume suggested, you don't want to load the whole array into memory if 
it's that large. There are many different ways for how to deal with this. The 
most naive way would be to break up your NumPy array into smaller NumPy array 
and load them iteratively with a running accuracy calculation. My suggestion 
would be to create a HDF5 file from the NumPy array where each entry is an 
image. If it's just the test images, you can also save a batch of them as entry 
because you don't need to shuffle them anyway.

Ultimately, the recommendation based on the sweet spot between performance and 
convenience depends on what DL framework you use. Since this is a scikit-learn 
forum, I suppose you are using sklearn objects (although, I am not aware that 
sklearn has CNNs). The DataLoader in PyTorch is universally useful though and 
can come in handy no matter what CNN implementation you use. I have some 
examples here if that helps:

- 
https://github.com/rasbt/deep-learning-book/blob/master/code/model_zoo/pytorch_ipynb/custom-data-loader-celeba.ipynb
- 
https://github.com/rasbt/deep-learning-book/blob/master/code/model_zoo/pytorch_ipynb/custom-data-loader-csv.ipynb

Best,
Sebastian


> On Mar 5, 2018, at 12:13 PM, Guillaume Lemaître  
> wrote:
> 
> If you work with deep net you need to check the utils from the deep net 
> library.
> For instance in keras, you should create a batch generator if you need to 
> deal with large dataset.
> In patch torch you can use the data loader which and the ImageFolder from 
> torchvision which manage
> the loading for you.
> 
> On 5 March 2018 at 17:19, CHETHAN MURALI  wrote:
> Dear All,
> 
> I am working on building a CNN model for image classification problem.
> As par of it I have converted all my test images to numpy array.
> 
> Now when I am trying to split the array into training and test set I am 
> getting memory error.
> Details are as below:
> 
> X = np.load("./data/X_train.npy", mmap_mode='r')
> 
> train_pct_index 
> = int(0.8 * len(X))
> 
> X_train
> , X_test = X[:train_pct_index], X[train_pct_index:]
> 
> X_train 
> = X_train.reshape(X_train.shape[0], 256, 256, 3)
> 
> 
> X_train 
> = X_train.astype('float32')
> 
> 
> 
> -
> MemoryError   Traceback (most recent call last)
>  in ()
> 
>   
> 2 print("Normalizing Data")
> 
>   
> 3
>  
> 
> > 4 X_train = X_train.astype('float32')
> More information:
> 
> 1. my python version is
> 
> python --
> version
> 
> Python 3.6.4 :: Anaconda custom (64-bit)
> 2. I am running the code in ubuntu ubuntu 16.04.
> 
> 3. I have 32GB RAM
> 
> 4. X_train.npy file that I have loaded to np.array is of size 20GB
> 
> print("X_train Shape: ", X_train.shape)
> 
> X_train 
> Shape:  (85108, 256, 256, 3)
> I would be really glad if you can help me to overcome this problem.
> 
> Regards,
> -
> Chethan
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> 
> -- 
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] custom loss function in RandomForestRegressor

2018-03-01 Thread Sebastian Raschka
Unfortunately (or maybe fortunately :)) no, maximizing variance reduction & 
minimizing MSE are just special cases :)

Best,
Sebastian

> On Mar 1, 2018, at 9:59 AM, Thomas Evangelidis <teva...@gmail.com> wrote:
> 
> Does this generalize to any loss function? For example I also want to 
> implement Kendall's tau correlation coefficient and a combination of R, tau 
> and RMSE. :) 
> 
> On Mar 1, 2018 15:49, "Sebastian Raschka" <se.rasc...@gmail.com> wrote:
> Hi, Thomas,
> 
> as far as I know, it's all the same and doesn't matter, and you would get the 
> same splits, since R^2 is just a rescaled MSE.
> 
> Best,
> Sebastian
> 
> > On Mar 1, 2018, at 9:39 AM, Thomas Evangelidis <teva...@gmail.com> wrote:
> >
> > Hi Sebastian,
> >
> > Going back to Pearson's R loss function, does this imply that I must add an 
> > abstract "init2" method to RegressionCriterion (that's where MSE class 
> > inherits from) where I will add the target values X as extra argument? And 
> > then the node impurity will be 1-R (the lowest the best)? What about the 
> > impurities of the left and right split? In MSE class they are (sum_i^n 
> > y_i)**2 where n is the number of samples in the respective split. It is not 
> > clear how this is related to variance in order to adapt it for my purpose.
> >
> > Best,
> > Thomas
> >
> >
> > On Mar 1, 2018 14:56, "Sebastian Raschka" <se.rasc...@gmail.com> wrote:
> > Hi, Thomas,
> >
> > in regression trees, minimizing the variance among the target values is 
> > equivalent to minimizing the MSE between targets and predicted values. This 
> > is also called variance reduction: 
> > https://en.wikipedia.org/wiki/Decision_tree_learning#Variance_reduction
> >
> > Best,
> > Sebastian
> >
> > > On Mar 1, 2018, at 8:27 AM, Thomas Evangelidis <teva...@gmail.com> wrote:
> > >
> > >
> > > Hi again,
> > >
> > > I am currently revisiting this problem after familiarizing myself with 
> > > Cython and Scikit-Learn's code and I have a very important query:
> > >
> > > Looking at the class MSE(RegressionCriterion), the node impurity is 
> > > defined as the variance of the target values Y on that node. The 
> > > predictions X are nowhere involved in the computations. This contradicts 
> > > my notion of "loss function", which quantifies the discrepancy between 
> > > predicted and target values. Am I looking at the wrong class or what I 
> > > want to do is just not feasible with Random Forests? For example, I would 
> > > like to modify the RandomForestRegressor code to minimize the Pearson's R 
> > > between predicted and target values.
> > >
> > > I thank you in advance for any clarification.
> > > Thomas
> > >
> > >
> > >
> > >
> > > On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote:
> > >> Yes you are right pxd are the header and pyx the definition. You need to 
> > >> write a class as MSE. Criterion is an abstract class or base class (I 
> > >> don't have it under the eye)
> > >>
> > >> @Andy: if I recall the PR, we made the classes public to enable such 
> > >> custom criterion. However, ‎it is not documented since we were not 
> > >> officially supporting it. So this is an hidden feature. We could always 
> > >> discuss to make this feature more visible and document it.
> > >
> > >
> > >
> > >
> > >
> > > --
> > > ==
> > > Dr Thomas Evangelidis
> > > Post-doctoral Researcher
> > > CEITEC - Central European Institute of Technology
> > > Masaryk University
> > > Kamenice 5/A35/2S049,
> > > 62500 Brno, Czech Republic
> > >
> > > email: tev...@pharm.uoa.gr
> > >   teva...@gmail.com
> > >
> > > website: https://sites.google.com/site/thomasevangelidishomepage/
> > >
> > >
> > > ___
> > > scikit-learn mailing list
> > > scikit-learn@python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] custom loss function in RandomForestRegressor

2018-03-01 Thread Sebastian Raschka
Hi, Thomas,

as far as I know, it's all the same and doesn't matter, and you would get the 
same splits, since R^2 is just a rescaled MSE. 

Best,
Sebastian

> On Mar 1, 2018, at 9:39 AM, Thomas Evangelidis <teva...@gmail.com> wrote:
> 
> Hi Sebastian, 
> 
> Going back to Pearson's R loss function, does this imply that I must add an 
> abstract "init2" method to RegressionCriterion (that's where MSE class 
> inherits from) where I will add the target values X as extra argument? And 
> then the node impurity will be 1-R (the lowest the best)? What about the 
> impurities of the left and right split? In MSE class they are (sum_i^n 
> y_i)**2 where n is the number of samples in the respective split. It is not 
> clear how this is related to variance in order to adapt it for my purpose. 
> 
> Best, 
> Thomas
> 
> 
> On Mar 1, 2018 14:56, "Sebastian Raschka" <se.rasc...@gmail.com> wrote:
> Hi, Thomas,
> 
> in regression trees, minimizing the variance among the target values is 
> equivalent to minimizing the MSE between targets and predicted values. This 
> is also called variance reduction: 
> https://en.wikipedia.org/wiki/Decision_tree_learning#Variance_reduction
> 
> Best,
> Sebastian
> 
> > On Mar 1, 2018, at 8:27 AM, Thomas Evangelidis <teva...@gmail.com> wrote:
> >
> >
> > Hi again,
> >
> > I am currently revisiting this problem after familiarizing myself with 
> > Cython and Scikit-Learn's code and I have a very important query:
> >
> > Looking at the class MSE(RegressionCriterion), the node impurity is defined 
> > as the variance of the target values Y on that node. The predictions X are 
> > nowhere involved in the computations. This contradicts my notion of "loss 
> > function", which quantifies the discrepancy between predicted and target 
> > values. Am I looking at the wrong class or what I want to do is just not 
> > feasible with Random Forests? For example, I would like to modify the 
> > RandomForestRegressor code to minimize the Pearson's R between predicted 
> > and target values.
> >
> > I thank you in advance for any clarification.
> > Thomas
> >
> >
> >
> >
> > On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote:
> >> Yes you are right pxd are the header and pyx the definition. You need to 
> >> write a class as MSE. Criterion is an abstract class or base class (I 
> >> don't have it under the eye)
> >>
> >> @Andy: if I recall the PR, we made the classes public to enable such 
> >> custom criterion. However, ‎it is not documented since we were not 
> >> officially supporting it. So this is an hidden feature. We could always 
> >> discuss to make this feature more visible and document it.
> >
> >
> >
> >
> >
> > --
> > ==
> > Dr Thomas Evangelidis
> > Post-doctoral Researcher
> > CEITEC - Central European Institute of Technology
> > Masaryk University
> > Kamenice 5/A35/2S049,
> > 62500 Brno, Czech Republic
> >
> > email: tev...@pharm.uoa.gr
> >   teva...@gmail.com
> >
> > website: https://sites.google.com/site/thomasevangelidishomepage/
> >
> >
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] custom loss function in RandomForestRegressor

2018-03-01 Thread Sebastian Raschka
Hi, Thomas,

in regression trees, minimizing the variance among the target values is 
equivalent to minimizing the MSE between targets and predicted values. This is 
also called variance reduction: 
https://en.wikipedia.org/wiki/Decision_tree_learning#Variance_reduction

Best,
Sebastian

> On Mar 1, 2018, at 8:27 AM, Thomas Evangelidis  wrote:
> 
> 
> Hi again,
> 
> I am currently revisiting this problem after familiarizing myself with Cython 
> and Scikit-Learn's code and I have a very important query:
> 
> Looking at the class MSE(RegressionCriterion), the node impurity is defined 
> as the variance of the target values Y on that node. The predictions X are 
> nowhere involved in the computations. This contradicts my notion of "loss 
> function", which quantifies the discrepancy between predicted and target 
> values. Am I looking at the wrong class or what I want to do is just not 
> feasible with Random Forests? For example, I would like to modify the 
> RandomForestRegressor code to minimize the Pearson's R between predicted and 
> target values.
> 
> I thank you in advance for any clarification.
> Thomas
> 
> 
> 
> 
> On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote:
>> Yes you are right pxd are the header and pyx the definition. You need to 
>> write a class as MSE. Criterion is an abstract class or base class (I don't 
>> have it under the eye)
>> 
>> @Andy: if I recall the PR, we made the classes public to enable such custom 
>> criterion. However, ‎it is not documented since we were not officially 
>> supporting it. So this is an hidden feature. We could always discuss to make 
>> this feature more visible and document it. 
> 
> 
> 
> 
> 
> -- 
> ==
> Dr Thomas Evangelidis
> Post-doctoral Researcher
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/2S049, 
> 62500 Brno, Czech Republic 
> 
> email: tev...@pharm.uoa.gr
>   teva...@gmail.com
> 
> website: https://sites.google.com/site/thomasevangelidishomepage/
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Applying clustering to cosine distance matrix

2018-02-12 Thread Sebastian Raschka
Hi,

by default, the clustering classes from sklearn, (e.g., DBSCAN), take an 
[num_examples, num_features] array as input, but you can also provide the 
distance matrix directly, e.g., by instantiating it with metric='precomputed'

my_dbscan = DBSCAN(..., metric='precomputed')
my_dbscan.fit(my_distance_matrix)

Not sure if it helps in that particular case (depending on how many zero 
elements you have), you can also use a sparse matrix in CSR format 
(https://docs.scipy.org/doc/scipy-1.0.0/reference/generated/scipy.sparse.csr_matrix.html).
 

Also, you don't need to for-loop through the rows if you want to compute the 
pair-wise distances, you can simply do that on the complete array. E.g.,

from sklearn.metrics.pairwise import cosine_distances
from scipy import sparse

distance_matrix = cosine_distances(sparse.csr_matrix(X), dense_output=False)

where X is your "[num_examples, num_features]" array.

Best,
Sebastian


> On Feb 12, 2018, at 1:10 PM, prince gosavi  wrote:
> 
> I have generated a cosine distance matrix and would like to apply clustering 
> algorithm to the given matrix.
> np.shape(distance_matrix)==(14000,14000)
> 
> I would like to know which clustering suits better and is there any need to 
> process the data further to get it in the form so that a model can be applied.
> Also any performance tip as the matrix takes around 3-4 hrs of processing.
> You can find my code here 
> https://github.com/maxyodedara5/BE_Project/blob/master/main.ipynb
> Code for READ ONLY PURPOSE.
> -- 
> Regards
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] How to get centroids from SciPy's hierarchical agglomerative clustering?

2017-10-20 Thread Sebastian Raschka
Independent from the implementation, and unless you use the 'centroid' or 
'average linkage' method, cluster centroids don't need to be computed when 
performing the agglomerative hierarchical clustering . But you can always 
compute it manually by simply averaging all samples from a cluster (for each 
feature).

Best.
Sebastian

> On Oct 20, 2017, at 9:13 AM, Sema Atasever  wrote:
> 
> Dear scikit-learn members,
> 
> I am using SciPy's hierarchical agglomerative clustering methods to cluster a 
> 1000 x 22 matrix of features, after clustering my data set with 
> scipy.cluster.hierarchy.linkage and and assigning each sample to a cluster,
> I can't seem to figure out how to get the centroid from the resulting 
> clusters. 
> I would like to extract one element or a few out of each cluster, which is 
> the closest to that cluster's centroid.
> 
> Below follows my code:
> 
> D=np.loadtxt(open("C:\dataset.txt", "rb"), delimiter=";")
> Y = hierarchy.linkage(D, 'ward')
> assignments = hierarchy.fcluster(Y, 5, criterion="maxclust")
> 
> I am taking my matrix of features, computing the euclidean distance between 
> them, and then passing them onto the hierarchical clustering method. From 
> there, I am creating flat clusters, with a maximum of 5 clusters
> 
> Now, based on the flat clusters assignments, how do I get the 1 x 22 centroid 
> that represents each flat cluster?
> 
> Best.
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] 1. Re: unclear help file for sklearn.decomposition.pca

2017-10-16 Thread Sebastian Raschka
Oh, never mind my previous email, because while the components should be the 
same, the projection of the data points onto those components would still be 
affected by centering vs non-centering I guess.

Best,
Sebastian

> On Oct 16, 2017, at 3:25 PM, Sebastian Raschka <se.rasc...@gmail.com> wrote:
> 
> Hi,
> 
> if you compute the principal components (i.e., eigendecomposition) from the 
> covariance matrix, it shouldn't matter whether the data is centered or not, 
> since the covariance matrix is computed as 
> 
> CovMat = \fact{1}{n} \sum_{i=1}^{n} (x_n - \bar{x}) (x_n - \bar{x})^T
> 
> where \bar{x} = vector of feature means
> 
> So, if you center the data prior to computing the covariance matrix, \bar{x} 
> is simply 0.
> 
> Best,
> Sebastian
> 
>> On Oct 16, 2017, at 2:27 PM, Ismael Lemhadri <lemha...@stanford.edu 
>> <mailto:lemha...@stanford.edu>> wrote:
>> 
>> @Andreas Muller: 
>> My references do not assume centering, e.g. 
>> http://ufldl.stanford.edu/wiki/index.php/PCA 
>> <http://ufldl.stanford.edu/wiki/index.php/PCA>
>> any reference?
>> 
>> 
>> 
>> On Mon, Oct 16, 2017 at 10:20 AM, <scikit-learn-requ...@python.org 
>> <mailto:scikit-learn-requ...@python.org>> wrote:
>> Send scikit-learn mailing list submissions to
>> scikit-learn@python.org <mailto:scikit-learn@python.org>
>> 
>> To subscribe or unsubscribe via the World Wide Web, visit
>> https://mail.python.org/mailman/listinfo/scikit-learn 
>> <https://mail.python.org/mailman/listinfo/scikit-learn>
>> or, via email, send a message with subject or body 'help' to
>> scikit-learn-requ...@python.org 
>> <mailto:scikit-learn-requ...@python.org>
>> 
>> You can reach the person managing the list at
>> scikit-learn-ow...@python.org <mailto:scikit-learn-ow...@python.org>
>> 
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of scikit-learn digest..."
>> 
>> 
>> Today's Topics:
>> 
>>1. Re: unclear help file for sklearn.decomposition.pca
>>   (Andreas Mueller)
>> 
>> 
>> --
>> 
>> Message: 1
>> Date: Mon, 16 Oct 2017 13:19:57 -0400
>> From: Andreas Mueller <t3k...@gmail.com <mailto:t3k...@gmail.com>>
>> To: scikit-learn@python.org <mailto:scikit-learn@python.org>
>> Subject: Re: [scikit-learn] unclear help file for
>> sklearn.decomposition.pca
>> Message-ID: <04fc445c-d8f3-a3a9-4ab2-0535826a2...@gmail.com 
>> <mailto:04fc445c-d8f3-a3a9-4ab2-0535826a2...@gmail.com>>
>> Content-Type: text/plain; charset="utf-8"; Format="flowed"
>> 
>> The definition of PCA has a centering step, but no scaling step.
>> 
>> On 10/16/2017 11:16 AM, Ismael Lemhadri wrote:
>> > Dear Roman,
>> > My concern is actually not about not mentioning the scaling but about
>> > not mentioning the centering.
>> > That is, the sklearn PCA removes the mean but it does not mention it
>> > in the help file.
>> > This was quite messy for me to debug as I expected it to either: 1/
>> > center and scale simultaneously or / not scale and not center either.
>> > It would be beneficial to explicit the behavior in the help file in my
>> > opinion.
>> > Ismael
>> >
>> > On Mon, Oct 16, 2017 at 8:02 AM, <scikit-learn-requ...@python.org 
>> > <mailto:scikit-learn-requ...@python.org>
>> > <mailto:scikit-learn-requ...@python.org 
>> > <mailto:scikit-learn-requ...@python.org>>> wrote:
>> >
>> > Send scikit-learn mailing list submissions to
>> > scikit-learn@python.org <mailto:scikit-learn@python.org> 
>> > <mailto:scikit-learn@python.org <mailto:scikit-learn@python.org>>
>> >
>> > To subscribe or unsubscribe via the World Wide Web, visit
>> > https://mail.python.org/mailman/listinfo/scikit-learn 
>> > <https://mail.python.org/mailman/listinfo/scikit-learn>
>> > <https://mail.python.org/mailman/listinfo/scikit-learn 
>> > <https://mail.python.org/mailman/listinfo/scikit-learn>>
>> > or, via email, send a message with subject or body 'help' to
>> > scikit-learn-requ...@python.org 
>> > <mailto:scikit-learn-requ...@python.org>
>> > <mailto:scikit-learn-requ...@python.org 
>> > <mailto:scikit-learn-requ

Re: [scikit-learn] 1. Re: unclear help file for sklearn.decomposition.pca

2017-10-16 Thread Sebastian Raschka
Hi,

if you compute the principal components (i.e., eigendecomposition) from the 
covariance matrix, it shouldn't matter whether the data is centered or not, 
since the covariance matrix is computed as 

CovMat = \fact{1}{n} \sum_{i=1}^{n} (x_n - \bar{x}) (x_n - \bar{x})^T

where \bar{x} = vector of feature means

So, if you center the data prior to computing the covariance matrix, \bar{x} is 
simply 0.

Best,
Sebastian

> On Oct 16, 2017, at 2:27 PM, Ismael Lemhadri  wrote:
> 
> @Andreas Muller: 
> My references do not assume centering, e.g. 
> http://ufldl.stanford.edu/wiki/index.php/PCA 
> 
> any reference?
> 
> 
> 
> On Mon, Oct 16, 2017 at 10:20 AM,  > wrote:
> Send scikit-learn mailing list submissions to
> scikit-learn@python.org 
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> https://mail.python.org/mailman/listinfo/scikit-learn 
> 
> or, via email, send a message with subject or body 'help' to
> scikit-learn-requ...@python.org 
> 
> 
> You can reach the person managing the list at
> scikit-learn-ow...@python.org 
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
> 
> 
> Today's Topics:
> 
>1. Re: unclear help file for sklearn.decomposition.pca
>   (Andreas Mueller)
> 
> 
> --
> 
> Message: 1
> Date: Mon, 16 Oct 2017 13:19:57 -0400
> From: Andreas Mueller >
> To: scikit-learn@python.org 
> Subject: Re: [scikit-learn] unclear help file for
> sklearn.decomposition.pca
> Message-ID: <04fc445c-d8f3-a3a9-4ab2-0535826a2...@gmail.com 
> >
> Content-Type: text/plain; charset="utf-8"; Format="flowed"
> 
> The definition of PCA has a centering step, but no scaling step.
> 
> On 10/16/2017 11:16 AM, Ismael Lemhadri wrote:
> > Dear Roman,
> > My concern is actually not about not mentioning the scaling but about
> > not mentioning the centering.
> > That is, the sklearn PCA removes the mean but it does not mention it
> > in the help file.
> > This was quite messy for me to debug as I expected it to either: 1/
> > center and scale simultaneously or / not scale and not center either.
> > It would be beneficial to explicit the behavior in the help file in my
> > opinion.
> > Ismael
> >
> > On Mon, Oct 16, 2017 at 8:02 AM,  > 
> >  > >> wrote:
> >
> > Send scikit-learn mailing list submissions to
> > scikit-learn@python.org  
> > >
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> > https://mail.python.org/mailman/listinfo/scikit-learn 
> > 
> >  > >
> > or, via email, send a message with subject or body 'help' to
> > scikit-learn-requ...@python.org 
> >  > >
> >
> > You can reach the person managing the list at
> > scikit-learn-ow...@python.org  
> >  > >
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of scikit-learn digest..."
> >
> >
> > Today's Topics:
> >
> > ? ?1. unclear help file for sklearn.decomposition.pca (Ismael
> > Lemhadri)
> > ? ?2. Re: unclear help file for sklearn.decomposition.pca
> > ? ? ? (Roman Yurchak)
> > ? ?3. Question about LDA's coef_ attribute (Serafeim Loukas)
> > ? ?4. Re: Question about LDA's coef_ attribute (Alexandre Gramfort)
> > ? ?5. Re: Question about LDA's coef_ attribute (Serafeim Loukas)
> >
> >
> > --
> >
> > Message: 1
> > Date: Sun, 15 Oct 2017 18:42:56 -0700
> > From: Ismael Lemhadri  > 
> > >>
> > To: scikit-learn@python.org  
> > >
> > Subject: 

Re: [scikit-learn] Combine already fitted models

2017-10-07 Thread Sebastian Raschka
I agree. I had added sth like that to the original version in mlxtend (not sure 
if it was before or after we ported it to sklearn). In at case though, it be 
happy to open a PR about that later today :)

Best,
Sebastian


> On Oct 7, 2017, at 10:53 AM, Andreas Mueller <t3k...@gmail.com> wrote:
> 
> For some reason I thought we had a "prefit" parameter.
> 
> I think we should.
> 
> 
>> On 10/01/2017 07:39 PM, Sebastian Raschka wrote:
>> Hi, Rares,
>> 
>>> vc = VotingClassifier(...)
>>> vc.estimators_ = [e1, e2, ...]
>>> vc.le_ = ...
>>> vc.predict(...)
>>> 
>>> But I am not sure it is recommended to modify the "private" estimators_ and 
>>> le_ attributes.
>> 
>> I think that this may work if you don't call the fit method of the 
>> VotingClassifier after that due to
>> https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/ensemble/voting_classifier.py#L186
>> 
>> Also, I see that we have only added one check in predict(), 
>> "check_is_fitted(self, 'estimators_')", for checking that the 
>> VotingClassifier was fit, so your proposed method could/should work as a 
>> workaround ;)
>> 
>> Best,
>> Sebastian
>> 
>>> On Oct 1, 2017, at 7:22 PM, Rares Vernica <rvern...@gmail.com> wrote:
>>> 
>>>>> I am looking at VotingClassifier but it seems that it is expected that 
>>>>> the estimators are fitted when VotingClassifier.fit() is called. I don't 
>>>>> see how I can have already fitted classifiers combined under a 
>>>>> VotingClassifier.
>>>> I think the opposite is true: The classifiers provided via an `estimators` 
>>>> argument upon initialization will be cloned and fitted if you call 
>>>> VotingClassifier's  fit(). Based on your follow-up question, I think you 
>>>> meant "it is expected that the estimators are *not* fitted when 
>>>> VotingClassifier.fit() is called," right?!
>>> Yes, you are right. Sorry for the confusion. Thanks for the pointer!
>>> 
>>> I am also exploring something like:
>>> 
>>> vc = VotingClassifier(...)
>>> vc.estimators_ = [e1, e2, ...]
>>> vc.le_ = ...
>>> vc.predict(...)
>>> 
>>> But I am not sure it is recommended to modify the "private" estimators_ and 
>>> le_ attributes.
>>> 
>>> --
>>> Rares
>>> 
>>> 
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Combine already fitted models

2017-10-01 Thread Sebastian Raschka
Hi, Rares,

> vc = VotingClassifier(...)
> vc.estimators_ = [e1, e2, ...]
> vc.le_ = ...
> vc.predict(...)
> 
> But I am not sure it is recommended to modify the "private" estimators_ and 
> le_ attributes.


I think that this may work if you don't call the fit method of the 
VotingClassifier after that due to 
https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/ensemble/voting_classifier.py#L186

Also, I see that we have only added one check in predict(), 
"check_is_fitted(self, 'estimators_')", for checking that the VotingClassifier 
was fit, so your proposed method could/should work as a workaround ;)

Best,
Sebastian

> On Oct 1, 2017, at 7:22 PM, Rares Vernica  wrote:
> 
> > > I am looking at VotingClassifier but it seems that it is expected that 
> > > the estimators are fitted when VotingClassifier.fit() is called. I don't 
> > > see how I can have already fitted classifiers combined under a 
> > > VotingClassifier.
> >
> > I think the opposite is true: The classifiers provided via an `estimators` 
> > argument upon initialization will be cloned and fitted if you call 
> > VotingClassifier's  fit(). Based on your follow-up question, I think you 
> > meant "it is expected that the estimators are *not* fitted when 
> > VotingClassifier.fit() is called," right?!
> 
> Yes, you are right. Sorry for the confusion. Thanks for the pointer!
> 
> I am also exploring something like:
> 
> vc = VotingClassifier(...)
> vc.estimators_ = [e1, e2, ...]
> vc.le_ = ...
> vc.predict(...)
> 
> But I am not sure it is recommended to modify the "private" estimators_ and 
> le_ attributes.
> 
> --
> Rares
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Combine already fitted models

2017-10-01 Thread Sebastian Raschka
Hi, Rares,

> I am looking at VotingClassifier but it seems that it is expected that the 
> estimators are fitted when VotingClassifier.fit() is called. I don't see how 
> I can have already fitted classifiers combined under a VotingClassifier.

I think the opposite is true: The classifiers provided via an `estimators` 
argument upon initialization will be cloned and fitted if you call 
VotingClassifier's  fit(). Based on your follow-up question, I think you meant 
"it is expected that the estimators are *not* fitted when 
VotingClassifier.fit() is called," right?!

>  I don't see how I can have already fitted classifiers combined under a 
> VotingClassifier.


The VotingClassifier in scikit-learn is based on the EnsembleVoteClassifier I 
had implemented in mlxtend 
(http://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/#api).
 While I generally recommend using the VotingClassifier in scikit-learn, the 
code base of EnsembleVoteClassifier should be quite similar, and I have added a 
`refit` param which can be set to True or False. If refit=True, it's the same 
behavior as in sklearn. If refit=False, however, it will not refit the 
estimators and will allow you to use pre-fit classifiers, which is what you are 
asking for, I think?

@scikit-learn devs:
Not sure if such a parameter should be added to scikit-learn's VotingClassifier 
as it may cause some weird behavior in GridSearch etc? Otherwise, I am happy to 
add an issue or submit a PR to discuss/work on this further :)

Best,
Sebastian


> On Oct 1, 2017, at 6:53 PM, Rares Vernica  wrote:
> 
> Hello,
> 
> I have a distributed setup where subsets of the data is available at 
> different hosts. I plan to have each host fit a model with the subset of the 
> data it owns. Once these individual models are fitted, how can I go about and 
> combine them under one model.
> 
> I don't have a preference on a specific algorithm, but I am looking into a 
> classification problem.
> 
> I am looking at VotingClassifier but it seems that it is expected that the 
> estimators are fitted when VotingClassifier.fit() is called. I don't see how 
> I can have already fitted classifiers combined under a VotingClassifier.
> 
> Thanks!
> Rares
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Commercial use of ML algorithms and scikit-learn

2017-09-30 Thread Sebastian Raschka
Hi, Paul,

I think there should be no issue with that as scikit-learn is distributed under 
a BSD v3 license as long as you uphold the terms of that license. It's a bit 
tricky to find that license note as it's not called "LICENSE" in the GitHub 
repo like it is usually done for open source projects, but it is there in a 
file called "COPYING" 
(https://github.com/scikit-learn/scikit-learn/blob/master/COPYING):

> New BSD License
> 
> Copyright (c) 2007–2017 The scikit-learn developers.
> All rights reserved.
> 
> 
> Redistribution and use in source and binary forms, with or without
> modification, are permitted provided that the following conditions are met:
> 
>   a. Redistributions of source code must retain the above copyright notice,
>  this list of conditions and the following disclaimer.
>   b. Redistributions in binary form must reproduce the above copyright
>  notice, this list of conditions and the following disclaimer in the
>  documentation and/or other materials provided with the distribution.
>   c. Neither the name of the Scikit-learn Developers  nor the names of
>  its contributors may be used to endorse or promote products
>  derived from this software without specific prior written
>  permission. 
> 
> 
> THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
> AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR
> ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
> SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
> CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> DAMAGE.
> 


In a nutshell, it would mean that you can do anything with scikit-learn except 
that can't use the names of sklearn devs or sklearn itself to promote your 
products, and you have to include the license if you redistribute any parts of 
sklearn. However, I'd still suggest to consult someone in your legal department 
regarding the license to make sure that you don't run into any troubles later 
on.

Best,
Sebastian



> On Oct 1, 2017, at 12:58 AM, Paul Smith  wrote:
> 
> Dear Scikit-learn users:
> 
> My name is Paul and I am working on a large electronics company. Sorry that I 
> cannot reveal the name of company. 
> 
> My boss asked me to improve our business using ML algorithms. However I 
> recently found many of ML algorithms are patented.
> 
> Are there any legal problems if I use ML algorithms like SVM, decision trees, 
> clustering methods, and feature extractions for my company without 
> permissions?
> 
> If there are no problems, can I use scikit-learn for implementation?
> 
> Could anyone advise me on this issue please?
> 
> Thank you a lot and have a nice weekend.
> 
> Best regards,
> Paul
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] anti-correlated predictions by SVR

2017-09-26 Thread Sebastian Raschka
I'd agree with Gael that a potential explanation could be the distribution 
shift upon splitting (usually the smaller the dataset, the more this is of an 
issue). As potential solutions/workarounds, you could try

a) stratified sampling for regression, if you'd like to stick with the 2-way 
holdout method
b) use leave-one-out cross validation for evaluation (your model will likely 
benefit from the additional training samples)
c) use leave-one-out boostrap (at each round, draw a bootstrap sample from the 
dataset for training, then use the points not in the training dataset for 
testing)

Best,
Sebastian

> On Sep 26, 2017, at 12:48 PM, Thomas Evangelidis  wrote:
> 
> I have very small training sets (10-50 observations). Currently, I am working 
> with 16 observations for training and 25 for validation (external test set). 
> And I am doing Regression, not Classification (hence the SVR instead of SVC).
> 
> 
> On 26 September 2017 at 18:21, Gael Varoquaux  
> wrote:
> Hypothesis: you have a very small dataset and when you leave out data,
> you create a distribution shift between the train and the test. A
> simplified example: 20 samples, 10 class a, 10 class b. A leave-one-out
> cross-validation will create a training set of 10 samples of one class, 9
> samples of the other, and the test set is composed of the class that is
> minority on the train set.
> 
> G
> 
> On Tue, Sep 26, 2017 at 06:10:39PM +0200, Thomas Evangelidis wrote:
> > Greetings,
> 
> > I don't know if anyone encountered this before, but sometimes I get
> > anti-correlated predictions by the SVR I that am training. Namely, the
> > Pearson's R and Kendall's tau are negative when I compare the predictions on
> > the external test set with the true values. However, the SVR predictions on 
> > the
> > training set have positive correlations with the experimental values and 
> > hence
> > I can't think of a way to know in advance if the trained SVR will produce
> > anti-correlated predictions in order to change their sign and avoid the
> > disaster. Here is an example of what I mean:
> 
> > Training set predictions: R=0.452422, tau=0.33
> > External test set predictions: R=-0.537420, tau-0.30
> 
> > Obviously, in a real case scenario where I wouldn't have the external test 
> > set
> > I would have used the worst observation instead of the best ones. Has 
> > anybody
> > any idea about how I could prevent this?
> 
> > thanks in advance
> > Thomas
> --
> Gael Varoquaux
> Researcher, INRIA Parietal
> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
> Phone:  ++ 33-1-69-08-79-68
> http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> -- 
> ==
> Dr Thomas Evangelidis
> Post-doctoral Researcher
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/2S049, 
> 62500 Brno, Czech Republic 
> 
> email: tev...@pharm.uoa.gr
>   teva...@gmail.com
> 
> website: https://sites.google.com/site/thomasevangelidishomepage/
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] batch_size for small training sets

2017-09-24 Thread Sebastian Raschka
Small batch sizes are typically used to speed up the training (more iterations) 
and to avoid the issue that training sets usually don’t fit into memory. Okay, 
the additional noise from the stochastic approach may also be helpful to escape 
local minima and/or help with generalization performance (eg as discussed in 
the recent paper where the authors compared SGD to other optimizers). In any 
case, since batch size is effectively a hyper parameter I would just experiment 
with a few values and compare. Also, since you have a small dataset, I would 
maybe also try to just go with batch gradient descent (I.e batch size = n 
training samples).

Best,
Sebastian 

Sent from my iPhone

> On Sep 24, 2017, at 4:35 PM, Thomas Evangelidis  wrote:
> 
> Greetings,
> 
> I traing MLPRegressors using small datasets, usually with 10-50 observations. 
> The default batch_size=min(200, n_samples) for the adam optimizer, and 
> because my n_samples is always < 200, it is eventually batch_size=n_samples. 
> According to the theory, stochastic gradient-based optimizers like adam 
> perform better in the small batch regime. Considering the above, what would 
> be a good batch_size value in my case (e.g. 4)? Is there any rule of thump to 
> select the batch_size when the n_samples is small or must the choice be based 
> on trial and error?
> 
> 
> -- 
> ==
> Dr Thomas Evangelidis
> Post-doctoral Researcher
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/2S049, 
> 62500 Brno, Czech Republic 
> 
> email: tev...@pharm.uoa.gr
>   teva...@gmail.com
> 
> website: https://sites.google.com/site/thomasevangelidishomepage/
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Help needed

2017-09-14 Thread Sebastian Raschka
Honestly not sure what the core dev's preference is, but maybe just submit it 
as a PR and take the discussion (for a potential removal, inclusion, or move of 
these features to the documentation) of the additional plotting features from 
there.

Best,
Sebastian

> On Sep 14, 2017, at 9:42 PM, L Ali <nj.yua...@gmail.com> wrote:
> 
> Hi Sebastian,
>  
> Thanks for your quick response, there are two functions in my code will 
> output a chart using matplotlib. Do you know how can I discussing the PR via 
> an issue with the main devs? Sorry for such stupid questions.  
>  
> Thanks again for your advise.
>  
> Li Yuan
>  
> From: Sebastian Raschka
> Sent: Thursday, September 14, 2017 9:36 PM
> To: Scikit-learn mailing list
> Subject: Re: [scikit-learn] Help needed
>  
> Hi, Li,
>  
> to me, it looks like you are importing matplotlib in your code, but 
> matplotlib is not being installed on the CI instances that are running the 
> scikit-learn unit tests. Or in other words, the Travis instance is trying to 
> execute an "import matplotlib..." and fails because matplotlib is not 
> installed there. Except for the docs, I think matplotlib code is not being 
> tested in scikit-learn's unit tests (and hence, it's not being installed). 
> Does your code/contribution require matplotlib or is it just imported "by 
> accident"? If the latter is true, simply removing matplotlib imports will 
> prob. solve the issue; otherwise, I guess discussing the PR via an issue with 
> the main devs might be the way to go.
>  
> Best,
> Sebastian
>  
> > On Sep 14, 2017, at 9:24 PM, L Ali <nj.yua...@gmail.com> wrote:
> >
> > Hi guys,
> >  
> > I am totally new to the scikit-learn, I am going to submit a pull request 
> > to the repository, but always got following error message, I could not find 
> > any usefully information from Google, my last hope is our community.
> >  
> > Is there anyone can give me some advise about this error: 
> > ModuleNotFoundError: No module named 'matplotlib'
> >  
> > Thanks so much!
> >  
> > 
> >  
> > Li Yuan
> >  
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>  
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>  
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Help needed

2017-09-14 Thread Sebastian Raschka
Hi, Li,

to me, it looks like you are importing matplotlib in your code, but matplotlib 
is not being installed on the CI instances that are running the scikit-learn 
unit tests. Or in other words, the Travis instance is trying to execute an 
"import matplotlib..." and fails because matplotlib is not installed there. 
Except for the docs, I think matplotlib code is not being tested in 
scikit-learn's unit tests (and hence, it's not being installed). Does your 
code/contribution require matplotlib or is it just imported "by accident"? If 
the latter is true, simply removing matplotlib imports will prob. solve the 
issue; otherwise, I guess discussing the PR via an issue with the main devs 
might be the way to go.

Best,
Sebastian

> On Sep 14, 2017, at 9:24 PM, L Ali  wrote:
> 
> Hi guys,
>  
> I am totally new to the scikit-learn, I am going to submit a pull request to 
> the repository, but always got following error message, I could not find any 
> usefully information from Google, my last hope is our community.
>  
> Is there anyone can give me some advise about this error: 
> ModuleNotFoundError: No module named 'matplotlib'
>  
> Thanks so much!
>  
> 
>  
> Li Yuan
>  
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] custom loss function

2017-09-13 Thread Sebastian Raschka
> Is it possible to change the loss function in KerasRegressor? I don't have 
> time right now to experiment with hyperparameters of new ANN architectures. I 
> am in urgent need to reproduce in Keras the results obtained with 
> MLPRegressor and the set of hyperparameters that I have optimized for my 
> problem and later change the loss function

Honestly, I don't have much experience with Keras. It may be easy to do that, I 
don't know.

Alternatively, defining an MLP regressor in TensorFlow is not that hard and 
only few lines of code. E.g., you could copy the mlp classifier from (cell 4) 
here:

https://github.com/rasbt/deep-learning-book/blob/master/code/model_zoo/multilayer-perceptron-lowlevel.ipynb

just delete the two last ops in the output layer

out_act = tf.nn.softmax(out_z, name='predicted_probabilities')
out_labels = tf.argmax(out_z, axis=1, name='predicted_labels'))

and replace the loss/cost by 

tf.losses.mean_squared_error

and you should have a MLP regressor running in a few lines of code. Then, you 
could experiment with your loss function by doing your own function. E.g., the 
usage is quite similar to what you do in NumPy, the mean_squared_error above 
can be manually defined as e.g.,

cost = tf.reduce_sum(tf.pow(pred-y 2))/(2*n_samples)

Best,
Sebastian

> On Sep 13, 2017, at 1:18 PM, Thomas Evangelidis <teva...@gmail.com> wrote:
> 
> ​​
> Thanks again for the clarifications Sebastian!
> 
> Keras has a Scikit-learn API with the KeraRegressor which implements the 
> Scikit-Learn MLPRegressor interface:
> 
> https://keras.io/scikit-learn-api/
> 
> Is it possible to change the loss function in KerasRegressor? I don't have 
> time right now to experiment with hyperparameters of new ANN architectures. I 
> am in urgent need to reproduce in Keras the results obtained with 
> MLPRegressor and the set of hyperparameters that I have optimized for my 
> problem and later change the loss function.
> 
> 
> 
> On 13 September 2017 at 18:14, Sebastian Raschka <se.rasc...@gmail.com> wrote:
> > What about the SVR? Is it possible to change the loss function there?
> 
> Here you would have the same problem; SVR is a constrained optimization 
> problem and you would have to change the calculation of the loss gradient 
> then. Since SVR is a "1-layer" neural net, if you change the cost function to 
> something else, it's not really a SVR anymore.
> 
> 
> > Could you please clarify what the "x" and "x'" parameters in the default 
> > Kernel functions mean? Is "x" a NxM array, where N is the number of 
> > observations and M the number of features?
> 
> Both x and x' should be denoting training examples. The kernel matrix is 
> symmetric (N x N).
> 
> 
> 
> Best,
> Sebastian
> 
> > On Sep 13, 2017, at 5:25 AM, Thomas Evangelidis <teva...@gmail.com> wrote:
> >
> > Thanks Sebastian. Exploring Tensorflow capabilities was in my TODO list, 
> > but now it's in my immediate plans.
> > What about the SVR? Is it possible to change the loss function there? Could 
> > you please clarify what the "x" and "x'" parameters in the default Kernel 
> > functions mean? Is "x" a NxM array, where N is the number of observations 
> > and M the number of features?
> >
> > http://scikit-learn.org/stable/modules/svm.html#kernel-functions
> >
> >
> >
> > On 12 September 2017 at 00:37, Sebastian Raschka <se.rasc...@gmail.com> 
> > wrote:
> > Hi Thomas,
> >
> > > For the MLPRegressor case so far my conclusion was that it is not 
> > > possible unless you modify the source code.
> >
> > Also, I suspect that this would be non-trivial. I haven't looked to closely 
> > at how the MLPClassifier/MLPRegressor are implemented but since you perform 
> > the weight updates based on the gradient of the cost function wrt the 
> > weights, the modification would be non-trivial if the partial derivatives 
> > are not computed based on some autodiff implementation -- you would have to 
> > edit all the partial d's along the backpropagation up to the first hidden 
> > layer. While I think that scikit-learn is by far the best library out there 
> > for machine learning, I think if you want an easy solution, you probably 
> > won't get around TensorFlow or PyTorch or equivalent, here, for your 
> > specific MLP problem unless you want to make your life extra hard :P 
> > (seriously, you can pick up any of the two in about an hour and have your 
> > MLPRegressor up and running so that you can then experiment with your cost 
> > function).
> >
> > Best,
> > Sebastian
> >
> &g

Re: [scikit-learn] custom loss function

2017-09-13 Thread Sebastian Raschka
> What about the SVR? Is it possible to change the loss function there?

Here you would have the same problem; SVR is a constrained optimization problem 
and you would have to change the calculation of the loss gradient then. Since 
SVR is a "1-layer" neural net, if you change the cost function to something 
else, it's not really a SVR anymore.


> Could you please clarify what the "x" and "x'" parameters in the default 
> Kernel functions mean? Is "x" a NxM array, where N is the number of 
> observations and M the number of features?

Both x and x' should be denoting training examples. The kernel matrix is 
symmetric (N x N).



Best,
Sebastian

> On Sep 13, 2017, at 5:25 AM, Thomas Evangelidis <teva...@gmail.com> wrote:
> 
> Thanks Sebastian. Exploring Tensorflow capabilities was in my TODO list, but 
> now it's in my immediate plans.
> What about the SVR? Is it possible to change the loss function there? Could 
> you please clarify what the "x" and "x'" parameters in the default Kernel 
> functions mean? Is "x" a NxM array, where N is the number of observations and 
> M the number of features?
> 
> http://scikit-learn.org/stable/modules/svm.html#kernel-functions
> 
> 
> 
> On 12 September 2017 at 00:37, Sebastian Raschka <se.rasc...@gmail.com> wrote:
> Hi Thomas,
> 
> > For the MLPRegressor case so far my conclusion was that it is not possible 
> > unless you modify the source code.
> 
> Also, I suspect that this would be non-trivial. I haven't looked to closely 
> at how the MLPClassifier/MLPRegressor are implemented but since you perform 
> the weight updates based on the gradient of the cost function wrt the 
> weights, the modification would be non-trivial if the partial derivatives are 
> not computed based on some autodiff implementation -- you would have to edit 
> all the partial d's along the backpropagation up to the first hidden layer. 
> While I think that scikit-learn is by far the best library out there for 
> machine learning, I think if you want an easy solution, you probably won't 
> get around TensorFlow or PyTorch or equivalent, here, for your specific MLP 
> problem unless you want to make your life extra hard :P (seriously, you can 
> pick up any of the two in about an hour and have your MLPRegressor up and 
> running so that you can then experiment with your cost function).
> 
> Best,
> Sebastian
> 
> > On Sep 11, 2017, at 6:13 PM, Thomas Evangelidis <teva...@gmail.com> wrote:
> >
> > Greetings,
> >
> > I know this is a recurrent question, but I would like to use my own loss 
> > function either in a MLPRegressor or in an SVR. For the MLPRegressor case 
> > so far my conclusion was that it is not possible unless you modify the 
> > source code. On the other hand, for the SVR I was looking at setting custom 
> > kernel functions. But I am not sure if this is the same thing. Could 
> > someone please clarify this to me? Finally, I read about the "scoring" 
> > parameter is cross-validation, but this is just to select a Regressor that 
> > has been trained already with the default loss function, so it would be 
> > harder to find one that minimizes my own loss function.
> >
> > For the record, my loss function is the centered root mean square error.
> >
> > Thanks in advance for any advice.
> >
> >
> >
> > --
> > ==
> > Dr Thomas Evangelidis
> > Post-doctoral Researcher
> > CEITEC - Central European Institute of Technology
> > Masaryk University
> > Kamenice 5/A35/2S049,
> > 62500 Brno, Czech Republic
> >
> > email: tev...@pharm.uoa.gr
> >   teva...@gmail.com
> >
> > website: https://sites.google.com/site/thomasevangelidishomepage/
> >
> >
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> -- 
> ==
> Dr Thomas Evangelidis
> Post-doctoral Researcher
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/2S049, 
> 62500 Brno, Czech Republic 
> 
> email: tev...@pharm.uoa.gr
>   teva...@gmail.com
> 
> website: https://sites.google.com/site/thomasevangelidishomepage/
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] custom loss function

2017-09-11 Thread Sebastian Raschka
Hi Thomas,

> For the MLPRegressor case so far my conclusion was that it is not possible 
> unless you modify the source code.

Also, I suspect that this would be non-trivial. I haven't looked to closely at 
how the MLPClassifier/MLPRegressor are implemented but since you perform the 
weight updates based on the gradient of the cost function wrt the weights, the 
modification would be non-trivial if the partial derivatives are not computed 
based on some autodiff implementation -- you would have to edit all the partial 
d's along the backpropagation up to the first hidden layer. While I think that 
scikit-learn is by far the best library out there for machine learning, I think 
if you want an easy solution, you probably won't get around TensorFlow or 
PyTorch or equivalent, here, for your specific MLP problem unless you want to 
make your life extra hard :P (seriously, you can pick up any of the two in 
about an hour and have your MLPRegressor up and running so that you can then 
experiment with your cost function).

Best,
Sebastian

> On Sep 11, 2017, at 6:13 PM, Thomas Evangelidis  wrote:
> 
> Greetings,
> 
> I know this is a recurrent question, but I would like to use my own loss 
> function either in a MLPRegressor or in an SVR. For the MLPRegressor case so 
> far my conclusion was that it is not possible unless you modify the source 
> code. On the other hand, for the SVR I was looking at setting custom kernel 
> functions. But I am not sure if this is the same thing. Could someone please 
> clarify this to me? Finally, I read about the "scoring" parameter is 
> cross-validation, but this is just to select a Regressor that has been 
> trained already with the default loss function, so it would be harder to find 
> one that minimizes my own loss function.
> 
> For the record, my loss function is the centered root mean square error. 
> 
> Thanks in advance for any advice.
> 
> 
> 
> -- 
> ==
> Dr Thomas Evangelidis
> Post-doctoral Researcher
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/2S049, 
> 62500 Brno, Czech Republic 
> 
> email: tev...@pharm.uoa.gr
>   teva...@gmail.com
> 
> website: https://sites.google.com/site/thomasevangelidishomepage/
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] control value range of MLPRegressor predictions

2017-09-10 Thread Sebastian Raschka
You could normalize the outputs (e.g., via min-max scaling). However, I think 
the more intuitive way would be to clip the predictions. E.g., say you are 
predicting house prices, it probably makes no sense to have a negative 
prediction, so you would clip the output at some value  >0$

PS: -820 and -800 sounds a bit extreme if your training data is in a -5 to -9 
range. Is your training data from a different population then the one you use 
for testing/making predictions? Or maybe it's just an extreme case of 
overfitting.

Best,
Sebastian


> On Sep 10, 2017, at 3:13 PM, Thomas Evangelidis  wrote:
> 
> Greetings,
> 
> Is there any way to force the MLPRegressor to make predictions in the same 
> value range as the training data? For example, if the training data range 
> between -5 and -9, I don't want the predictions to range between -820 and 
> -800. In fact, some times I get anti-correlated predictions, for example 
> between 800 and 820 and I have to change the sign in order to calculate 
> correlations with experimental values. Is there a way to control the value 
> range explicitly or implicitly (by post-processing the predictions)?
> 
> thanks
> Thomas
> 
> 
> -- 
> ==
> Dr Thomas Evangelidis
> Post-doctoral Researcher
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/2S049, 
> 62500 Brno, Czech Republic 
> 
> email: tev...@pharm.uoa.gr
>   teva...@gmail.com
> 
> website: https://sites.google.com/site/thomasevangelidishomepage/
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] combining datasets from different sources

2017-09-05 Thread Sebastian Raschka
Another approach would be to pose this as a "ranking" problem to predict 
relative affinities rather than absolute affinities. E.g., if you have data 
from one (or more) molecules that has/have been tested under 2 or more 
experimental conditions, you can rank the other molecules accordingly or 
normalize. E.g. if you observe that the binding affinity of molecule a is -7 
kcal/mol in assay A and -9 kcal/mol in assay to, and say the binding affinities 
of molecule B are -10 and -12 kcal/mol, respectively, that should give you some 
information for normalizing the values from assay 2 (e.g., by adding 2 
kcal/mol). Of course this is not a perfect solution and might be error prone, 
but so are experimental assays ... (when I sometimes look at the std error/CI 
of the data I get from collaborators ... well, it seems that absolute binding 
affinities have always taken with a grain of salt anyway)

Best,
Sebastian

> On Sep 5, 2017, at 1:02 PM, Jason Rudy  wrote:
> 
> Thomas,
> 
> This is sort of related to the problem I did my M.S. thesis on years ago: 
> cross-platform normalization of gene expression data.  If you google that 
> term you'll find some papers.  The situation is somewhat different, though, 
> because with microarrays or RNA-seq you get thousands of data points for each 
> experiment, which makes it easier to estimate the batch effect.  The 
> principle is the similar, however.  
> 
> If I were in your situation, I would consider whether I have any of the 
> following advantages:
> 
> 1. Some molecules that appear in multiple data sets
> 2. Detailed information about the different experimental conditions
> 3. Physical/chemical models of how experimental conditions influence binding 
> affinity
> 
> If you have any of the above, you can potentially use them to improve your 
> estimates.  You could also consider using experiment ID as a categorical 
> predictor in a sufficiently general regression method.
> 
> Lastly, you may already know this, but the term "meta-analysis" is relevant 
> here, and you can google for specific techniques.  Most of these would be 
> more limited than what you are envisioning, I think.
> 
> Best,
> 
> Jason
> 
> On Tue, Sep 5, 2017 at 6:39 AM, Thomas Evangelidis  wrote:
> Greetings,
> 
> I am working on a problem that involves predicting the binding affinity of 
> small molecules on a receptor structure (is regression problem, not 
> classification). I have multiple small datasets of molecules with measured 
> binding affinities on a receptor, but each dataset was measured in different 
> experimental conditions and therefore I cannot use them all together as 
> trainning set. So, instead of using them individually, I was wondering 
> whether there is a method to combine them all into a super training set. The 
> first way I could think of is to convert the binding affinities to Z-scores 
> and then combine all the small datasets of molecules. But this is would be 
> inaccurate because, firstly the datasets are very small (10-50 molecules 
> each), and secondly, the range of binding affinities differs in each 
> experiment (some datasets contain really strong binders, while others do not, 
> etc.). Is there any other approach to combine datasets with values coming 
> from different sources? Maybe if som
 eone points me to the right reference I could read and understand if it is 
applicable to my case.
> 
> Thanks,
> Thomas
> 
> -- 
> ==
> Dr Thomas Evangelidis
> Post-doctoral Researcher
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/2S049, 
> 62500 Brno, Czech Republic 
> 
> email: tev...@pharm.uoa.gr
>   teva...@gmail.com
> 
> website: https://sites.google.com/site/thomasevangelidishomepage/
> 
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Problem found when testing DecisionTreeClassifier within the source folder

2017-09-04 Thread Sebastian Raschka
Hi, Hanna,

I think Joel is right and the renaming is probably causing the issues. Instead 
of renaming the package to sklearn1, consider modifying, compiling, and 
installing sklearn in a virtual environment. I am not sure if you are using 
conda, in this case, creating a new virtual env for development would be really 
straight forward:

conda create -n 'my-sklearn-dev'
source activate my-sklearn-dev

There are also a bunch of Python packages out there that do essentially the 
same thing (https://docs.python.org/3/tutorial/venv.html); I am not sure which 
one people generally recommend/prefer. 

Anyway, to use venv that should be available in Python already, you could do 
e.g.,

python -m venv my-sklearn-dev
source my-sklearn-dev/bin/activate

Best,
Sebastian

> On Sep 4, 2017, at 11:21 PM, Joel Nothman  wrote:
> 
> I suspect this is due to an intricacy of Cython. Despite using relative 
> imports, Cython expects the Criterion instance to come from a package called 
> sklearn, not called sklearn1.
> 
> On 5 September 2017 at 12:42, hanzi mao  wrote:
> 
> Hi,
> 
> I am researching on the source code of DecisionTree recently. Here are the 
> things I tried.
> 
>   • Downloaded source code from github. 
>   • run "python setup.py build_ext --inplace" to compile the sources in 
> the unzipped source folder.
>   • Try the following codes to see whether it works. Here I changed the 
> name of the sklearn folder to sklearn1 to differentiate it from the one 
> installed. 
> 
> 
> 
> >>> from sklearn1 import tree
> 
> 
> >>> from sklearn.datasets import load_iris
> 
> 
> >>> iris = load_iris()
> 
> 
> >>> clf = tree.DecisionTreeClassifier()
> 
> 
> >>> clf = clf.fit(iris.data, iris.target)
> 
> 
> Traceback (most recent call last):
> 
> 
>   File "", line 1, in 
> 
> 
>   File "sklearn1\tree\tree.py", line 790, in fit
> 
> 
> X_idx_sorted=X_idx_sorted)
> 
> 
>   File "sklearn1\tree\tree.py", line 341, in fit
> 
> 
> self.presort)
> 
> 
> TypeError: Argument 'criterion' has incorrect type (expected 
> sklearn.tree._criterion.Criterion, got sklearn.tree._criterion.Gini)
> 
> 
> Then a weird error happened. Actually I also tried the newest stable version 
> of scikit-learn earlier today. It had the same error. So I was thinking maybe 
> try the newest version in github might help. Unlikely, it didn't.
> 
> I have limited knowledge about the source code of scikit-learn. I am 
> wondering if anyone could help me with this.
> 
> Thanks!
> 
> Best,
> Hanna
> 
> 
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] imbalanced-learn 0.3.0 is chasing scikit-learn 0.19.0

2017-08-25 Thread Sebastian Raschka
Just read through the summary of the new features and browsed through the user 
guide. The guide is really well structured and easy to navigate, thanks for 
putting all the work into it. Overall, thanks for this great contribution and 
new version :)

Best,
Sebastian

> On Aug 24, 2017, at 8:14 PM, Guillaume Lemaître  
> wrote:
> 
> We are excited to announce the new release of the scikit-learn-contrib 
> imbalanced-learn, already available through conda and pip (cf. the 
> installation page https://tinyurl.com/y92flbab for more info)
> 
> Notable add-ons are:
> 
> * Support of sparse matrices
> * Support of multi-class resampling for all methods
> * A new BalancedBaggingClassifier using random under-sampling chained with 
> the scikit-learn BaggingClassifier
> * Creation of a didactic user guide
> * New API of the ratio parameter to fit the needs of multi-class resampling
> * Migration from nosetests to pytest
> 
> You can check the full changelog at:
> http://contrib.scikit-learn.org/imbalanced-learn/stable/whats_new.html#version-0-3
> 
> A big thank you to contributors to use, raise issues, and submit PRs to 
> imblearn.
> -- 
> Guillaume Lemaitre
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] scikit-learn 0.19.0 is out!

2017-08-11 Thread Sebastian Raschka
Yay, as an avid user, thanks to all the developers! This is a great release 
indeed -- no breaking changes (at least for my code base) and so many 
improvements and additions (that I need to check out in detail) :)


> On Aug 12, 2017, at 1:14 AM, Gael Varoquaux  
> wrote:
> 
> Hurray, thank you everybody. This is a good one! (as always).
> 
> Gaël
> 
> On Sat, Aug 12, 2017 at 12:16:07AM +0200, Guillaume Lemaître wrote:
>> Congrats guys
> 
>> On 11 August 2017 at 23:57, Andreas Mueller  wrote:
> 
>>Thank you everybody for making the release possible, in particular Olivier
>>and Joel :)
> 
>>Wohoo!
> 
>>___
>>scikit-learn mailing list
>>scikit-learn@python.org
>>https://mail.python.org/mailman/listinfo/scikit-learn
> -- 
>Gael Varoquaux
>Researcher, INRIA Parietal
>NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>Phone:  ++ 33-1-69-08-79-68
>http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] transform categorical data to numerical representation

2017-08-06 Thread Sebastian Raschka
> performance of prediction is pretty lame when there are around 100-150 
> columns used as the input.

you are talking about computational performance when you are calling the 
"transform" method? Have you done some profiling to find out where your bottle 
neck (in the for loop) is? Just one a very quick look, I think this

data.loc[~data[column].isin(fittedLabels), column] = str(replacementForUnseen)

is already very slow because fittedLabels is an array where you have O(n) 
lookup instead of an average O(1) by using a hash table. Or is the isin 
function converting it to a hashtable/set/dict?

In general, would it maybe help to use pandas' factorize? 
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.factorize.html
For predict time, say you have only 1 example for prediction that needs to be 
converted, you could append prototypes of all possible values that could occur, 
do the transformation, and then only pass the 1 transformed sample to the 
classifier. I guess that could be even slow though ... 

Best,
Sebastian

> On Aug 6, 2017, at 6:30 AM, Georg Heiler <georg.kf.hei...@gmail.com> wrote:
> 
> @sebastian: thanks. Indeed, I am aware of this problem.
> 
> I developed something here: 
> https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2ce but realized 
> that the performance of prediction is pretty lame when there are around 
> 100-150 columns used as the input.
> Do you have some ideas how to speed this up?
> 
> Regards,
> Georg
> 
> Joel Nothman <joel.noth...@gmail.com> schrieb am So., 6. Aug. 2017 um 00:49 
> Uhr:
> We are working on CategoricalEncoder in 
> https://github.com/scikit-learn/scikit-learn/pull/9151 to help users more 
> with this kind of thing. Feedback and testing is welcome.
> 
> On 6 August 2017 at 02:13, Sebastian Raschka <se.rasc...@gmail.com> wrote:
> Hi, Georg,
> 
> I bring this up every time here on the mailing list :), and you probably 
> aware of this issue, but it makes a difference whether your categorical data 
> is nominal or ordinal. For instance if you have an ordinal variable like with 
> values like {small, medium, large} you probably want to encode it as {1, 2, 
> 3} or {1, 20, 100} or whatever is appropriate based on your domain knowledge 
> regarding the variable. If you have sth like {blue, red, green} it may make 
> more sense to do a one-hot encoding so that the classifier doesn't assume  a 
> relationship between the variables like blue > red > green or sth like that.
> 
> Now, the DictVectorizer and OneHotEncoder are both doing one hot encoding. 
> The LabelEncoder does convert a variable to integer values, but if you have 
> sth like {small, medium, large}, it wouldn't know the order (if that's an 
> ordinal variable) and it would just assign arbitrary integers in increasing 
> order. Thus, if you are dealing ordinal variables, there's no way around 
> doing this manually; for example you could create mapping dictionaries for 
> that (most conveniently done in pandas).
> 
> Best,
> Sebastian
> 
> > On Aug 5, 2017, at 5:10 AM, Georg Heiler <georg.kf.hei...@gmail.com> wrote:
> >
> > Hi,
> >
> > the LabelEncooder is only meant for a single column i.e. target variable. 
> > Is the DictVectorizeer or a manual chaining of multiple LabelEncoders (one 
> > per categorical column) the desired way to get values which can be fed into 
> > a subsequent classifier?
> >
> > Is there some way I have overlooked which works better and possibly also 
> > can handle unseen values by applying most frequent imputation?
> >
> > regards,
> > Georg
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] transform categorical data to numerical representation

2017-08-05 Thread Sebastian Raschka
Hi, Georg,

I bring this up every time here on the mailing list :), and you probably aware 
of this issue, but it makes a difference whether your categorical data is 
nominal or ordinal. For instance if you have an ordinal variable like with 
values like {small, medium, large} you probably want to encode it as {1, 2, 3} 
or {1, 20, 100} or whatever is appropriate based on your domain knowledge 
regarding the variable. If you have sth like {blue, red, green} it may make 
more sense to do a one-hot encoding so that the classifier doesn't assume  a 
relationship between the variables like blue > red > green or sth like that.

Now, the DictVectorizer and OneHotEncoder are both doing one hot encoding. The 
LabelEncoder does convert a variable to integer values, but if you have sth 
like {small, medium, large}, it wouldn't know the order (if that's an ordinal 
variable) and it would just assign arbitrary integers in increasing order. 
Thus, if you are dealing ordinal variables, there's no way around doing this 
manually; for example you could create mapping dictionaries for that (most 
conveniently done in pandas).

Best,
Sebastian

> On Aug 5, 2017, at 5:10 AM, Georg Heiler  wrote:
> 
> Hi,
> 
> the LabelEncooder is only meant for a single column i.e. target variable. Is 
> the DictVectorizeer or a manual chaining of multiple LabelEncoders (one per 
> categorical column) the desired way to get values which can be fed into a 
> subsequent classifier?
> 
> Is there some way I have overlooked which works better and possibly also can 
> handle unseen values by applying most frequent imputation?
> 
> regards,
> Georg
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Classifiers for dataset with categorical features

2017-07-21 Thread Sebastian Raschka
Maybe because they are genetic algorithms, which are -- for some reason -- not 
very popular in the ML field in general :P. (People in bioinformatics seem to 
use them a lot though.). Also, the name "Learning Classifier Systems" is also a 
bit weird I'd must say: I remember that when Ryan introduced me to those, I was 
like "ah yeah, sure, I know machine learning classifiers" ;)



> On Jul 21, 2017, at 3:01 PM, Stuart Reynolds <stu...@stuartreynolds.net> 
> wrote:
> 
> +1
> LCS and its many many variants seem very practical and adaptable. I'm
> not sure why they haven't gotten traction.
> Overshadowed by GBM & random forests?
> 
> 
> On Fri, Jul 21, 2017 at 11:52 AM, Sebastian Raschka
> <se.rasc...@gmail.com> wrote:
>> Just to throw some additional ideas in here. Based on a conversation with a 
>> colleague some time ago, I think learning classifier systems 
>> (https://en.wikipedia.org/wiki/Learning_classifier_system) are particularly 
>> useful when working with large, sparse binary vectors (like from a one-hot 
>> encoding). I am really not into LCS's, and only know the basics (read 
>> through the first chapters of the Intro to Learning Classifier Systems 
>> draft; the print version will be out later this year).
>> Also, I saw an interesting poster on a Set Covering Machine algorithm once, 
>> which they benchmarked against SVMs, random forests and the like for 
>> categorical (genomics data). Looked promising.
>> 
>> Best,
>> Sebastian
>> 
>> 
>>> On Jul 21, 2017, at 2:37 PM, Raga Markely <raga.mark...@gmail.com> wrote:
>>> 
>>> Thank you, Jacob. Appreciate it.
>>> 
>>> Regarding 'perform better', I was referring to better accuracy, precision, 
>>> recall, F1 score, etc.
>>> 
>>> Thanks,
>>> Raga
>>> 
>>> On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber <jmschreibe...@gmail.com> 
>>> wrote:
>>> Traditionally tree based methods are very good when it comes to categorical 
>>> variables and can handle them appropriately. There is a current WIP PR to 
>>> add this support to sklearn. I'm not exactly sure what you mean that 
>>> "perform better" though. Estimators that ignore the categorical aspect of 
>>> these variables and treat them as discrete will likely perform worse than 
>>> those that treat them appropriately.
>>> 
>>> On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely <raga.mark...@gmail.com> 
>>> wrote:
>>> Hello,
>>> 
>>> I am wondering if there are some classifiers that perform better for 
>>> datasets with categorical features (converted into sparse input matrix with 
>>> pd.get_dummies())? The data for the categorical features are nominal (order 
>>> doesn't matter, e.g. country, occupation, etc).
>>> 
>>> If you could provide me some references (papers, books, website, etc), that 
>>> would be great.
>>> 
>>> Thank you very much!
>>> Raga
>>> 
>>> 
>>> 
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> 
>>> 
>>> 
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> 
>>> 
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Classifiers for dataset with categorical features

2017-07-21 Thread Sebastian Raschka
> Traditionally tree based methods are very good when it comes to categorical 
> variables and can handle them appropriately. There is a current WIP PR to add 
> this support to sklearn.

I think it's also important to distinguish between nominal and ordinal; it can 
make a huge difference imho. I.e., treating ordinal variables like continuous 
variable probably makes more sense than one-hot encoding them. Looking forward 
to the PR  :)

> On Jul 21, 2017, at 2:52 PM, Sebastian Raschka <se.rasc...@gmail.com> wrote:
> 
> Just to throw some additional ideas in here. Based on a conversation with a 
> colleague some time ago, I think learning classifier systems 
> (https://en.wikipedia.org/wiki/Learning_classifier_system) are particularly 
> useful when working with large, sparse binary vectors (like from a one-hot 
> encoding). I am really not into LCS's, and only know the basics (read through 
> the first chapters of the Intro to Learning Classifier Systems draft; the 
> print version will be out later this year). 
> Also, I saw an interesting poster on a Set Covering Machine algorithm once, 
> which they benchmarked against SVMs, random forests and the like for 
> categorical (genomics data). Looked promising.
> 
> Best,
> Sebastian
> 
> 
>> On Jul 21, 2017, at 2:37 PM, Raga Markely <raga.mark...@gmail.com> wrote:
>> 
>> Thank you, Jacob. Appreciate it.
>> 
>> Regarding 'perform better', I was referring to better accuracy, precision, 
>> recall, F1 score, etc.
>> 
>> Thanks,
>> Raga
>> 
>> On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber <jmschreibe...@gmail.com> 
>> wrote:
>> Traditionally tree based methods are very good when it comes to categorical 
>> variables and can handle them appropriately. There is a current WIP PR to 
>> add this support to sklearn. I'm not exactly sure what you mean that 
>> "perform better" though. Estimators that ignore the categorical aspect of 
>> these variables and treat them as discrete will likely perform worse than 
>> those that treat them appropriately.
>> 
>> On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely <raga.mark...@gmail.com> wrote:
>> Hello,
>> 
>> I am wondering if there are some classifiers that perform better for 
>> datasets with categorical features (converted into sparse input matrix with 
>> pd.get_dummies())? The data for the categorical features are nominal (order 
>> doesn't matter, e.g. country, occupation, etc).
>> 
>> If you could provide me some references (papers, books, website, etc), that 
>> would be great.
>> 
>> Thank you very much!
>> Raga
>> 
>> 
>> 
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> 
>> 
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> 
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Classifiers for dataset with categorical features

2017-07-21 Thread Sebastian Raschka
Just to throw some additional ideas in here. Based on a conversation with a 
colleague some time ago, I think learning classifier systems 
(https://en.wikipedia.org/wiki/Learning_classifier_system) are particularly 
useful when working with large, sparse binary vectors (like from a one-hot 
encoding). I am really not into LCS's, and only know the basics (read through 
the first chapters of the Intro to Learning Classifier Systems draft; the print 
version will be out later this year). 
Also, I saw an interesting poster on a Set Covering Machine algorithm once, 
which they benchmarked against SVMs, random forests and the like for 
categorical (genomics data). Looked promising.

Best,
Sebastian


> On Jul 21, 2017, at 2:37 PM, Raga Markely  wrote:
> 
> Thank you, Jacob. Appreciate it.
> 
> Regarding 'perform better', I was referring to better accuracy, precision, 
> recall, F1 score, etc.
> 
> Thanks,
> Raga
> 
> On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber  
> wrote:
> Traditionally tree based methods are very good when it comes to categorical 
> variables and can handle them appropriately. There is a current WIP PR to add 
> this support to sklearn. I'm not exactly sure what you mean that "perform 
> better" though. Estimators that ignore the categorical aspect of these 
> variables and treat them as discrete will likely perform worse than those 
> that treat them appropriately.
> 
> On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely  wrote:
> Hello,
> 
> I am wondering if there are some classifiers that perform better for datasets 
> with categorical features (converted into sparse input matrix with 
> pd.get_dummies())? The data for the categorical features are nominal (order 
> doesn't matter, e.g. country, occupation, etc).
> 
> If you could provide me some references (papers, books, website, etc), that 
> would be great.
> 
> Thank you very much!
> Raga
> 
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Replacing the Boston Housing Prices dataset

2017-07-06 Thread Sebastian Raschka
I think there can be some middle ground. I.e., adding a new, simple dataset to 
demonstrate regression (maybe autmpg, wine quality, or sth like that) and use 
that for the scikit-learn examples in the main documentation etc but leave the 
boston dataset in the code base for now. Whether it's a weak argument or not, 
it would be quite destructive to remove the dataset altogether in the next 
version or so, not only because old tutorials use it but many unit tests in 
many different projects depend on it. I think it might be better to phase it 
out by having a good alternative first, and I am sure that the scikit-learn 
maintainers wouldn't have anything against it if someone would update the 
examples/tutorials with the use of different datasets

Best,
Sebastian

> On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias  wrote:
> 
> For what it's worth: I'm sympathetic to the argument that you can't fix the 
> problem if you don't measure it, but I agree with Tony that "many tutorials 
> use it" is an extremely weak argument. We removed Lena from scikit-image 
> because it was the right thing to do. I very much doubt that Boston house 
> prices is in more widespread use than Lena was in image processing.
> 
> You can argue about whether or not it's morally right or wrong to include the 
> dataset. I see merit to both arguments. But "too many tutorials use it" is 
> very similar in flavour to "the economy of the South would collapse without 
> slavery."
> 
> Regarding fair uses of the feature, I would hope that all sklearn tutorials 
> using the dataset mention such uses. The potential for abuse and 
> misinterpretation is enormous.
> 
> On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber , 
> wrote:
>> Hi Tony
>> 
>> As others have pointed out, I think that you may be misunderstanding the 
>> purpose of that "feature." We are in agreement that discrimination against 
>> protected classes is not OK, and that even outside complying with the law 
>> one should avoid discrimination, in model building or elsewhere. However, I 
>> disagree that one does this by eliminating from all datasets any feature 
>> that may allude to these protected classes. As Andreas pointed out, there is 
>> a growing effort to ensure that machine learning models are fair and benefit 
>> the common good (such as FATML, DSSG, etc..), and from my understanding the 
>> general consensus isn't necessarily that simply eliminating the feature is 
>> sufficient. I think we are in agreement that naively learning a model over a 
>> feature set containing questionable features and calling it a day is not 
>> okay, but as others have pointed out, having these features present and 
>> handling them appropriately can help guard against the model implicitly 
>> learning unfair biases (e
 ven if they are not explicitly exposed to the feature). 
>> 
>> I would welcome the addition of the Ames dataset to the ones supported by 
>> sklearn, but I'm not convinced that the Boston dataset should be removed. As 
>> Andreas pointed out, there is a benefit to having canonical examples present 
>> so that beginners can easily follow along with the many tutorials that have 
>> been written using them. As Sean points out, the paper itself is trying to 
>> pull out the connection between house price and clean air in the presence of 
>> possible confounding variables. In a more general sense, saying that a 
>> feature shouldn't be there because a simple linear regression is unaffected 
>> by the results is a bit odd because it is very common for datasets to 
>> include irrelevant features, and handling them appropriately is important. 
>> In addition, one could argue that having this type of issue arise in a toy 
>> dataset has a benefit because it exposes these types of issues to those 
>> learning data science earlier on and allows them to keep these issues in 
>> mind in the future when the
  data is more serious.
>> 
>> It is important for us all to keep issues of fairness in mind when it comes 
>> to data science. I'm glad that you're speaking out in favor of fairness and 
>> trying to bring attention to it. 
>> 
>> Jacob
>> 
>> On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante  
>> wrote:
>> G Reina 
>> you make a bizarre argument. You argue that you should not even check racism 
>> as a possible factor in house prices? 
>> 
>> But then you yourself check whether its relevant 
>> Then you say 
>> 
>> "but I'd argue that it's more due to the location (near water, near 
>> businesses, near restaurants, near parks and recreation) than to the ethnic 
>> makeup" 
>> 
>> Which  was basically what  the original authors wanted to show too,
>> 
>> Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean 
>> air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
>> 
>>  but unless you measure ethnic make-up you cannot show that it is not a 
>> confounder. 
>> 
>> The term "white flight" 

Re: [scikit-learn] [Feature] drop_one in one hot encoder

2017-06-25 Thread Sebastian Raschka
Hi,

hm, I think that dropping a column in onehot encoded features is quite uncommon 
in machine learning practice -- based on the applications and implementations 
I've seen. My guess is that the onehot encoded features are multicolinear 
anyway!? There may be certain algorithms that benefit from dropping a column, 
though (e.g., linear regression as a simple example). For instance, pandas' 
get_dummies has a "drop_first" parameter ...
I think it would make sense to have such a parameter in the onehotencoder as 
well, e.g., for working with pipelines.

Best,
Sebastian


> On Jun 25, 2017, at 7:48 AM, Parminder Singh  wrote:
> 
> Hy Sci-kittens! :-)
> 
> I was doing machine learning a-z course on Udemy, there they told that every 
> time one-hot encoding is done, one of the columns should be dropped as it is 
> like doubling same category twice and redundant to model. I thought if 
> instead of having user find the index and drop it after preprocessing, 
> OneHotEncoder had a drop_one variable, and it automatically removed the last 
> column. What are your thoughts about this? I am new to this community, would 
> like to contribute this myself if it is possible addition.
> 
> Thanks,
> Trion129
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] R user trying to learn Python

2017-06-18 Thread Sebastian Raschka
Hi,

> I am extremely frustrated using this thing. Everything comes after a dot! Why 
> would you type the sam thing at the beginning of every line. It's not 
> efficient.
> 
> code 1:
> y_sin = np.sin(x)
> y_cos = np.cos(x)
> 
> I know you can import the entire package without the "as np", but I see 
> np.something as the standard. Why?

Because it makes it clear where this function is coming from. Sure, you could 
do 

from numpy import *

but this is NOT!!! recommended. The reason why this is not recommended is that 
it would clutter up your main name space. For instance, numpy has its own sum 
function. If you do from numpy import *, Python's in-built `sum` will be gone 
from your main name space and replaced by NumPy's sum. This is confusing and 
should be avoided. 

> In the code above, sklearn > linear_model > Ridge, one lives inside the 
> other, it feels that there are multiple layer, how deep do I have to dig in?
> 
> Can someone explain the mentality behind this setup?

This is one way to organize your code and package. Sklearn contains many 
things, and organizing it by subpackages (linear_model, svm, ...) makes only 
sense; otherwise, you would end up with code files > 100,000 lines or so, which 
would make life really hard for package developers. 

Here, scikit-learn tries to follow the core principles of good object oriented 
program design, for instance, Abstraction, encapsulation, modularity, 
hierarchy, ...

> What are some good ways and resources to learn Python for data analysis?

I think baed on your questions, a good resource would be an introduction to 
programming book or course. I think that sections on objected oriented 
programming would make the rationale/design/API of scikit-learn and Python 
classes as a whole more accessible and address your concerns and questions.

Best,
Sebastian

> On Jun 18, 2017, at 12:02 PM, C W  wrote:
> 
> Dear Scikit-learn,
> 
> What are some good ways and resources to learn Python for data analysis?
> 
> I am extremely frustrated using this thing. Everything comes after a dot! Why 
> would you type the sam thing at the beginning of every line. It's not 
> efficient.
> 
> code 1:
> y_sin = np.sin(x)
> y_cos = np.cos(x)
> 
> I know you can import the entire package without the "as np", but I see 
> np.something as the standard. Why?
> 
> Code 2:
> model = LogisticRegression()
> model.fit(X_train, y_train)
> model.score(X_test, y_test)
> 
> In R, everything is saved to a variable. In the code above, what if I 
> accidentally ran model.fit(), I would not know.
> 
> Code 3:
> from sklearn import linear_model
> reg = linear_model.Ridge (alpha = .5)
> reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1])
> 
> In the code above, sklearn > linear_model > Ridge, one lives inside the 
> other, it feels that there are multiple layer, how deep do I have to dig in?
> 
> Can someone explain the mentality behind this setup?
> 
> Thank you very much!
> 
> M
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Cross-validation & cross-testing

2017-06-04 Thread Sebastian Raschka
> Is it possible for somebody to take a look and give any feedback?

Just looked over your repo would have some feedback:

Definitely cite the original research paper that your implementation is based 
on. Right now it just says "The cross-validation & cross-testing method was 
developed by Korjus et al." (year, journal, title, ... are missing)
Like Joel mentioned, I'd add unit tests and also consider CI services like 
Travis to check if the code indeed works (produces the same results) for 
package versions newer than the one you listed since you use ">=" 
Maybe a good, explanatory figure would help -- often, a good figure can make 
things much more clear and intuitive for a user. For new algorithms, it is also 
helpful to explain them in a procedural way using a numeric list of steps. In 
addition to describing the package, also consider stating the problem this 
approach is going to address.


Just a few general comment on the paper (which I only skimmed over I have to 
admit). Not sure what to think of this, it might be an interesting idea, but 
showing empirical results on only 2 datasets and a simulated one does not 
convince me that this is useful in practice, yet. Also, a discussion/analysis 
on bias and variance seems to be missing from that paper. Another thing is that 
I think in practice, one would also consider LOOCV or bootstrap approaches for 
"very" small datasets, which is not even mentioned in this paper. While I think 
there might be some interesting idea here, I'd say there needs to be additional 
research to make a judgement whether this approach should be used in practice 
or not -- I would say it's a bit too early too include something like this in 
scikit-learn?

Best,
Sebastian


> On Jun 4, 2017, at 9:53 PM, Joel Nothman  wrote:
> 
> And when I mean testing it, I mean writing tests that live with the code so 
> that they can be re-executed, and so that someone else can see what your 
> tests assert about your code's correctness.
> 
> On 5 June 2017 at 11:52, Joel Nothman  wrote:
> Hi Rain,
> 
> I would suggest that you start by documenting what your code is meant to do 
> (the structure of the Korjus et al paper makes it pretty difficult to even 
> determine what this technique is, for you to then not to describe it in your 
> own words in your repository), testing it with diverse inputs and ensuring 
> that it is correct. At a glance I can see at least two sources of bugs, and 
> some API design choices which I think could be improved.
> 
> Cheers,
> 
> Joel
> 
> On 5 June 2017 at 07:04, Rain Vagel  wrote:
> Hey,
> 
> I am a bachelor’s student and for my thesis I implemented a cross-testing 
> function in a scikit-learn compatible way and published it on Github. The 
> paper on which I based my own thesis can be found here: 
> http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0161788.
> 
> My project can be found here: 
> https://github.com/RainVagel/cross-val-cross-test.
> 
> Our original plan was to try and get the algorithm into scikit-learn, but it 
> doesn’t meet the requirements yet. So instead we thought about maybe having 
> it listed in the “Related Projects” page. Is it possible for somebody to take 
> a look and give any feedback?
> 
> Sincerely,
> Rain
> 
> 
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Failing check_estimator on py-earth

2017-05-19 Thread Sebastian Raschka
Hm, I am actually not sure; could be a bug. When I see it correctly, the 
problem is in 
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/estimator_checks.py#L1519
 which could be related to the 'astype' calls in 
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/estimator_checks.py#L458
 but maybe the scikit-devs know more about this.



> On May 19, 2017, at 6:33 PM, Jason Rudy <jcr...@gmail.com> wrote:
> 
> Thanks, Sebastian.  I'll consider using that platform check trick to disable 
> the test for 32 bit windows.  It is a small difference, and perhaps not worth 
> all the effort of tracking down.  It's part of check_estimator, so I'd have 
> to disable the entirety of check_estimator I think.  However, testing on 32 
> bit windows is probably not terribly important.
> 
> On Fri, May 19, 2017 at 3:22 PM, Sebastian Raschka <se.rasc...@gmail.com 
> <mailto:se.rasc...@gmail.com>> wrote:
> > I'll probably have to set up a 32 bit environment with a debugger and drill 
> > down to find the bug,
> 
> Must not be a bug but can simply be due to floating point imprecision. If you 
> checked that this is expected behavior, you could you do sth like
> 
> import numpy.distutils.system_info as sysinfo
> if sysinfo.platform_bits == 32:
> numpy.testing.assert_array_almost_equal(..., precision=0)
> else:
> numpy.testing.assert_array_almost_equal(..., precision=2)
> 
> or sth like that?
> 
> Best,
> Sebastian
> 
> > On May 19, 2017, at 6:10 PM, Jason Rudy <jcr...@gmail.com 
> > <mailto:jcr...@gmail.com>> wrote:
> >
> > I'm pushing to get py-earth ready for a release, but I'm having an issue 
> > with the check_estimator function on 32 bit windows machines.  Here is a 
> > link to the failing build on appveyor:
> >
> > https://ci.appveyor.com/project/jcrudy/py-earth/build/job/21r6838yh1bgwxw4 
> > <https://ci.appveyor.com/project/jcrudy/py-earth/build/job/21r6838yh1bgwxw4>
> >
> > It appears that array conversion is producing some small differences that 
> > make check_estimators_data_not_an_array fail.  I'll probably have to set up 
> > a 32 bit environment with a debugger and drill down to find the bug, but 
> > I'm wondering if anybody here has tips or experience that might help me 
> > guess the problem without doing that.  I am pretty ignorant about numpy 
> > type standards and conversions, so even something that seems obvious to you 
> > might help me.
> >
> > Best,
> >
> > Jason
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org <mailto:scikit-learn@python.org>
> > https://mail.python.org/mailman/listinfo/scikit-learn 
> > <https://mail.python.org/mailman/listinfo/scikit-learn>
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org <mailto:scikit-learn@python.org>
> https://mail.python.org/mailman/listinfo/scikit-learn 
> <https://mail.python.org/mailman/listinfo/scikit-learn>
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Broken c dependencies

2017-05-09 Thread Sebastian Raschka
Hi,
How did you install scikit-learn, from source or via pip? Not sure since it's 
been a long time since I set up my macOS from scratch, but I think you need to 
install Xcode command line tools at least. Have you checked that it is 
available? E.g. Via xcode-select -p
BTW does NumPy / SciPy work on your install or is it just sklearn?

Best,
Sebastian



Sent from my iPhone
> On May 9, 2017, at 11:36 AM, Georg Heiler  wrote:
> 
> Hi,
> 
> unfortunately, the c dependencies of my scikit-learn installation broke and I 
> get the following error on osx:
> dlopen(/usr/local/lib/python3.6/site-packages/sklearn/svm/libsvm.cpython-36m-darwin.so,
>  2): Symbol not found: __ZdlPvm
>   Referenced from: 
> /usr/local/lib/python3.6/site-packages/sklearn/svm/libsvm.cpython-36m-darwin.so
>  (which was built for Mac OS X 10.12)
>   Expected in: /usr/lib/libstdc++.6.dylib
>  in 
> /usr/local/lib/python3.6/site-packages/sklearn/svm/libsvm.cpython-36m-darwin.so
> Even removing my python installation and re-installing does not seem to get 
> this library back.
> 
> Regards,
> Georg
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] RFE/RFECV parameter suggestion

2017-04-30 Thread Sebastian Raschka
For RFECV, I think that a min_features parameter could be useful. 

Alternatively, making XGBoost more scikit-learn compatible instead of making 
scikit-learn more XGBoost compatible could be another take on this.

Best,
Sebastian

> On Apr 30, 2017, at 3:13 PM, George Fisher  wrote:
> 
> I found that xgboost generates an exception under RFECV when the number of 
> features remaining falls below 3. I fixed this for myself by adding a 
> 'stop_at' parameter (default=1) that stops the process in RFE when the 
> remaining features falls below this number. I think it might be a useful 
> feature more broadly than simply as a hacked work-around so I offer it as a 
> pull request.
> 
> George Fisher
> geo...@georgefisher.com
> +1 917-514-8204
> https://github.com/grfiv
> 
> Ubuntu 17.04 Desktop
> Python 3.5.3
> IPython 6.0.0
> sklearn 0.18.1
> (xgboost 0.6)
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] How to dump a model to txt file?

2017-04-13 Thread Sebastian Raschka
Hi,

not sure how this could generally work. However, you could at least dump the 
model parameters for e.g., linear models and compute the prediction via

w_1 * x1 + w_2 * x_2 + … + w_n * x_n + bias

over the n features. 

To write various model attributes to text files, you could use json, e.g., see 
https://cmry.github.io/notes/serialize
However, I don’t think that this approach will solve the problem of loading the 
model into C++.

Best,
Sebastian

> On Apr 13, 2017, at 4:58 PM, 老陈 <26743...@qq.com> wrote:
> 
> Hi,
> 
> I am working on GradientBoostingRegressor these days and I am wondering if 
> there is a way to dump the model into txt file, or any other format that can 
> be processed by c++
> 
> My production system is in c++, so I want use the python-trained tree model 
> in c++ for production.
> 
> Has anyone ever done this before?
> 
> thanks
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Random forest prediction probability value is limited to a single decimal point

2017-04-13 Thread Sebastian Raschka
Hi,
Have you tried to set numpy.set_printoptions(precision=8) ? Maybe that helps 
already.
Best,
Sebastian 



Sent from my iPhone

> On Apr 13, 2017, at 1:54 PM, Suranga Kasthurirathne  
> wrote:
> 
> 
> Hi all,
> 
> I'm using scikit-learn to build a number of random forrest models using the 
> default number of trees.
> 
> However, when I print out the prediction probability 
> (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_proba)
>  for each outcome, its presented to me as a single decimal point (0.1, 0.2, 
> 0.5 etc.). Only perhaps 5% of the data has more than a single decimal point.
> 
> Is this normal behavior? is there some way I can increase the number of 
> decimal points in the prediction probability outcomes? why arent I seeing 
> more probabilities such as 0.231, 0.1, 0.462156 etc.?
> 
> 
> -- 
> Thanks and best Regards,
> Suranga
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] urgent help in scikit-learn

2017-04-03 Thread Sebastian Raschka
Don’t get me wrong, but you’d have to either manually label them yourself, 
asking domain experts, or use platforms like Amazon Turk (or collect them in 
some other way). 

> On Apr 3, 2017, at 7:38 AM, Shuchi Mala <shuchi...@gmail.com> wrote:
> 
> How can I get  ground truth labels of the training examples in my dataset?
> 
> With Best Regards,
> Shuchi  Mala
> Research Scholar
> Department of Civil Engineering
> MNIT Jaipur
> 
> 
> On Fri, Mar 31, 2017 at 8:17 PM, Sebastian Raschka <se.rasc...@gmail.com> 
> wrote:
> Hi, Shuchi,
> 
> regarding labels_true: you’d only be able to compute the rand index adjusted 
> for chance if you have the ground truth labels iof the training examples in 
> your dataset.
> 
> The second parameter, labels_pred, takes in the predicted cluster labels 
> (indices) that you got from the clustering. E.g,
> 
> dbscn = DBSCAN()
> labels_pred = dbscn.fit(X).predict(X)
> 
> Best,
> Sebastian
> 
> 
> > On Mar 31, 2017, at 12:02 AM, Shuchi Mala <shuchi...@gmail.com> wrote:
> >
> > Thank you so much for your quick reply. I have one more doubt. The below 
> > statement is used to calculate rand score.
> >
> > metrics.adjusted_rand_score(labels_true, labels_pred)
> >  In my case what will be labels_true and labels_pred and how I will 
> > calculate labels_pred?
> >
> > With Best Regards,
> > Shuchi  Mala
> > Research Scholar
> > Department of Civil Engineering
> > MNIT Jaipur
> >
> >
> > On Thu, Mar 30, 2017 at 8:38 PM, Shane Grigsby <shane.grig...@colorado.edu> 
> > wrote:
> > Since you're using lat / long coords, you'll also want to convert them to 
> > radians and specify 'haversine' as your distance metric; i.e. :
> >
> >coords = np.vstack([lats.ravel(),longs.ravel()]).T
> >coords *= np.pi / 180. # to radians
> >
> > ...and:
> >
> >db = DBSCAN(eps=0.3, min_samples=10, metric='haversine')
> ># replace eps and min_samples as appropriate
> >db.fit(coords)
> >
> > Cheers,
> > Shane
> >
> >
> > On 03/30, Sebastian Raschka wrote:
> > Hi, Shuchi,
> >
> > 1. How can I add data to the data set of the package?
> >
> > You don’t need to add your dataset to the dataset module to run your 
> > analysis. A convenient way to load it into a numpy array would be via 
> > pandas. E.g.,
> >
> > import pandas as pd
> > df = pd.read_csv(‘your_data.txt', delimiter=r"\s+”)
> > X = df.values
> >
> > 2. How I can calculate Rand index for my data?
> >
> > After you ran the clustering, you can use the “adjusted_rand_score” 
> > function, e.g., see
> > http://scikit-learn.org/stable/modules/clustering.html#adjusted-rand-score
> >
> > 3. How to use make_blobs command for my data?
> >
> > The make_blobs command is just a utility function to create toydatasets, 
> > you wouldn’t need it in your case since you already have “real” data.
> >
> > Best,
> > Sebastian
> >
> >
> > On Mar 30, 2017, at 4:51 AM, Shuchi Mala <shuchi...@gmail.com> wrote:
> >
> > Hi everyone,
> >
> > I have the data with following attributes: (Latitude, Longitude). Now I am 
> > performing clustering using DBSCAN for my data. I have following doubts:
> >
> > 1. How can I add data to the data set of the package?
> > 2. How I can calculate Rand index for my data?
> > 3. How to use make_blobs command for my data?
> >
> > Sample of my data is :
> > LatitudeLongitude
> > 37.76901-122.429299
> > 37.76904-122.42913
> > 37.76878-122.429092
> > 37.7763 -122.424249
> > 37.77627-122.424657
> >
> >
> > With Best Regards,
> > Shuchi  Mala
> > Research Scholar
> > Department of Civil Engineering
> > MNIT Jaipur
> >
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > --
> > *PhD candidate & Research Assistant*
> > *Cooperative Institute for Research in Environmental Sciences (CIRES)*
> > *University of Colorado at Boulder*
> >
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] urgent help in scikit-learn

2017-03-31 Thread Sebastian Raschka
Hi, Shuchi,

regarding labels_true: you’d only be able to compute the rand index adjusted 
for chance if you have the ground truth labels iof the training examples in 
your dataset. 

The second parameter, labels_pred, takes in the predicted cluster labels 
(indices) that you got from the clustering. E.g, 

dbscn = DBSCAN()
labels_pred = dbscn.fit(X).predict(X)

Best,
Sebastian


> On Mar 31, 2017, at 12:02 AM, Shuchi Mala <shuchi...@gmail.com> wrote:
> 
> Thank you so much for your quick reply. I have one more doubt. The below 
> statement is used to calculate rand score.
>  
> metrics.adjusted_rand_score(labels_true, labels_pred) 
>  In my case what will be labels_true and labels_pred and how I will calculate 
> labels_pred?
> 
> With Best Regards,
> Shuchi  Mala
> Research Scholar
> Department of Civil Engineering
> MNIT Jaipur
> 
> 
> On Thu, Mar 30, 2017 at 8:38 PM, Shane Grigsby <shane.grig...@colorado.edu> 
> wrote:
> Since you're using lat / long coords, you'll also want to convert them to 
> radians and specify 'haversine' as your distance metric; i.e. :
> 
>coords = np.vstack([lats.ravel(),longs.ravel()]).T
>coords *= np.pi / 180. # to radians
> 
> ...and:
> 
>db = DBSCAN(eps=0.3, min_samples=10, metric='haversine')
># replace eps and min_samples as appropriate
>db.fit(coords)
> 
> Cheers,
> Shane
> 
> 
> On 03/30, Sebastian Raschka wrote:
> Hi, Shuchi,
> 
> 1. How can I add data to the data set of the package?
> 
> You don’t need to add your dataset to the dataset module to run your 
> analysis. A convenient way to load it into a numpy array would be via pandas. 
> E.g.,
> 
> import pandas as pd
> df = pd.read_csv(‘your_data.txt', delimiter=r"\s+”)
> X = df.values
> 
> 2. How I can calculate Rand index for my data?
> 
> After you ran the clustering, you can use the “adjusted_rand_score” function, 
> e.g., see
> http://scikit-learn.org/stable/modules/clustering.html#adjusted-rand-score
> 
> 3. How to use make_blobs command for my data?
> 
> The make_blobs command is just a utility function to create toydatasets, you 
> wouldn’t need it in your case since you already have “real” data.
> 
> Best,
> Sebastian
> 
> 
> On Mar 30, 2017, at 4:51 AM, Shuchi Mala <shuchi...@gmail.com> wrote:
> 
> Hi everyone,
> 
> I have the data with following attributes: (Latitude, Longitude). Now I am 
> performing clustering using DBSCAN for my data. I have following doubts:
> 
> 1. How can I add data to the data set of the package?
> 2. How I can calculate Rand index for my data?
> 3. How to use make_blobs command for my data?
> 
> Sample of my data is :
> LatitudeLongitude
> 37.76901-122.429299
> 37.76904-122.42913
> 37.76878-122.429092
> 37.7763 -122.424249
> 37.77627-122.424657
> 
> 
> With Best Regards,
> Shuchi  Mala
> Research Scholar
> Department of Civil Engineering
> MNIT Jaipur
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> -- 
> *PhD candidate & Research Assistant*
> *Cooperative Institute for Research in Environmental Sciences (CIRES)*
> *University of Colorado at Boulder*
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] urgent help in scikit-learn

2017-03-30 Thread Sebastian Raschka
Hi, Shuchi,

> 1. How can I add data to the data set of the package?

You don’t need to add your dataset to the dataset module to run your analysis. 
A convenient way to load it into a numpy array would be via pandas. E.g.,

import pandas as pd
df = pd.read_csv(‘your_data.txt', delimiter=r"\s+”)
X = df.values

> 2. How I can calculate Rand index for my data?

After you ran the clustering, you can use the “adjusted_rand_score” function, 
e.g., see
http://scikit-learn.org/stable/modules/clustering.html#adjusted-rand-score

> 3. How to use make_blobs command for my data?

The make_blobs command is just a utility function to create toydatasets, you 
wouldn’t need it in your case since you already have “real” data.

Best,
Sebastian


> On Mar 30, 2017, at 4:51 AM, Shuchi Mala  wrote:
> 
> Hi everyone,
> 
> I have the data with following attributes: (Latitude, Longitude). Now I am 
> performing clustering using DBSCAN for my data. I have following doubts:
> 
> 1. How can I add data to the data set of the package?
> 2. How I can calculate Rand index for my data?
> 3. How to use make_blobs command for my data?
> 
> Sample of my data is :
> Latitude  Longitude
> 37.76901  -122.429299
> 37.76904  -122.42913
> 37.76878  -122.429092
> 37.7763   -122.424249
> 37.77627  -122.424657
> 
> 
> With Best Regards,
> Shuchi  Mala
> Research Scholar
> Department of Civil Engineering
> MNIT Jaipur
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] recommended feature selection method to train an MLPRegressor

2017-03-19 Thread Sebastian Raschka
Hm, that’s tricky. I think the other methods listed on 
http://scikit-learn.org/stable/modules/feature_selection.html could help 
regarding a computationally cheap solution, but the problem would be that they 
probably wouldn’t work that well for an MLP due to the linear assumption. And 
an exhaustive sampling of all subsets would also be impractical/impossible. For 
all 50 feature subsets, you already have 
73353053308199416032348518540326808282134507009732998441913227684085760 
combinations :P. A greedy solution like forward or backward selection would be 
more feasible, but still very expensive in combination with an MLP. On top of 
that, you also have to consider that neural networks are generally pretty 
sensitive to hyperparameter settings. So even if you fix the architecture, you 
probably still want to check if the learning rate etc. is appropriate for each 
combination of features (by checking the cost and validation error during 
training).

PS: I wouldn’t dismiss dropout, imho. Especially because your training set is 
small, it could be even crucial to reduce overfitting. I mean it doesn’t remove 
features from your dataset but just helps the network to rely on particular 
combinations of features to be always present during training. Your final 
network will still process all features and dropout will effectively cause your 
network to “use” more of those features in your ~50 feature subset compared to 
no dropout (because otherwise, it may just learn to rely of a subset of these 
50 features).

> On Mar 19, 2017, at 6:23 PM, Andreas Mueller  wrote:
> 
> 
> 
> On 03/19/2017 03:47 PM, Thomas Evangelidis wrote:
>> Which of the following methods would you recommend to select good features 
>> (<=50) from a set of 534 features in order to train a MLPregressor? Please 
>> take into account that the datasets I use for training are small.
>> 
>> http://scikit-learn.org/stable/modules/feature_selection.html
>> 
>> And please don't tell me to use a neural network that supports the dropout 
>> or any other algorithm for feature elimination. This is not applicable in my 
>> case because I want to know the best 50 features in order to append them to 
>> other types of feature that I am confident that are important.
>> 
> You can always use forward or backward selection as implemented in mlxtend if 
> you're patient. As your dataset is small that might work.
> However, it might be hard tricky to get the MLP to run consistently - though 
> maybe not...
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Is something wrong with this gridsearchCV?

2017-03-16 Thread Sebastian Raschka
I am not using Keras and don’t know how nicely it plays with sklearn objects 
these days, but you are not giving all the data to the grid search object, 
which is why your model doesn’t get to see the whole dataset during grid 
search; i.e., you have `np.asarray(input_train[:-(len(input_train)/1000)]`

> On Mar 16, 2017, at 11:50 AM, Carlton Banks  wrote:
> 
> I am currently using grid search to optimize my keras model… 
> 
> Something seemed  a bit off during the training?
> 
> https://www.dropbox.com/s/da0ztv2kqtkrfpu/Screenshot%20from%202017-03-16%2016%3A43%3A42.png?dl=0
> 
> For some reason is the training for each epoch not done for all datapoints?… 
> 
> What could be wrong?
> 
> Here is the code:
> 
> http://pastebin.com/raw/itJFm5a6
> 
> Anything that seems off?
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] GridsearchCV

2017-03-15 Thread Sebastian Raschka
the “-1” means that it will run on all processors that are available

> On Mar 16, 2017, at 1:01 AM, Carlton Banks <nofl...@gmail.com> wrote:
> 
> Oh… totally forgot about that.. why -1?
>> Den 16. mar. 2017 kl. 05.58 skrev Joel Nothman <joel.noth...@gmail.com>:
>> 
>> If you're using something like n_jobs=-1, that will explode memory usage in 
>> proportion to the number of cores, and particularly so if you're passing the 
>> data as a list rather than array and hence can't take advantage of memmapped 
>> data parallelism.
>> 
>> On 16 March 2017 at 15:46, Carlton Banks <nofl...@gmail.com> wrote:
>> The ndarray (6,3,3) => (row, col,color channels)
>> 
>> I tried fixing it converting the list of numpy.ndarray to numpy.asarray(list)
>> 
>> but this causes a different problem:
>> 
>> is grid use a lot a memory.. I am running on a super computer, and seem to 
>> have problems with memory.. already used 62 gb ram..
>> 
>> > Den 16. mar. 2017 kl. 05.30 skrev Sebastian Raschka <se.rasc...@gmail.com>:
>> >
>> > Sklearn estimators typically assume 2d inputs (as numpy arrays) with 
>> > shape=[n_samples, n_features].
>> >
>> >> list of Np.ndarrays of shape (6,3,3)
>> >
>> > I assume you mean a 3D tensor (3D numpy array) with shape=[n_samples, 
>> > n_pixels, n_pixels]? What you could do is to reshape it before you put it 
>> > in, i.e.,
>> >
>> > data_ary = your_ary.reshape(n_samples, -1).shape
>> >
>> > then, you need to add a line at the beginning your CNN class that does the 
>> > reverse, i.e., data_ary.reshape(6, n_pixels, n_pixels).shape. Numpy’s 
>> > reshape typically returns view objects, so that these additional steps 
>> > shouldn’t be “too” expensive.
>> >
>> > Best,
>> > Sebastian
>> >
>> >
>> >
>> >> On Mar 16, 2017, at 12:00 AM, Carlton Banks <nofl...@gmail.com> wrote:
>> >>
>> >> Hi…
>> >>
>> >> I currently trying to optimize my CNN model using gridsearchCV, but seem 
>> >> to have some problems feading my input data..
>> >>
>> >> My training data is stored as a list of Np.ndarrays of shape (6,3,3) and 
>> >> my output is stored as a list of np.array with one entry.
>> >>
>> >> Why am I having problems parsing my data to it?
>> >>
>> >> best regards
>> >> Carl B.
>> >> ___
>> >> scikit-learn mailing list
>> >> scikit-learn@python.org
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> > ___
>> > scikit-learn mailing list
>> > scikit-learn@python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] GridsearchCV

2017-03-15 Thread Sebastian Raschka
Sklearn estimators typically assume 2d inputs (as numpy arrays) with 
shape=[n_samples, n_features]. 

> list of Np.ndarrays of shape (6,3,3)

I assume you mean a 3D tensor (3D numpy array) with shape=[n_samples, n_pixels, 
n_pixels]? What you could do is to reshape it before you put it in, i.e., 

data_ary = your_ary.reshape(n_samples, -1).shape

then, you need to add a line at the beginning your CNN class that does the 
reverse, i.e., data_ary.reshape(6, n_pixels, n_pixels).shape. Numpy’s reshape 
typically returns view objects, so that these additional steps shouldn’t be 
“too” expensive.

Best,
Sebastian



> On Mar 16, 2017, at 12:00 AM, Carlton Banks  wrote:
> 
> Hi… 
> 
> I currently trying to optimize my CNN model using gridsearchCV, but seem to 
> have some problems feading my input data.. 
> 
> My training data is stored as a list of Np.ndarrays of shape (6,3,3) and my 
> output is stored as a list of np.array with one entry. 
> 
> Why am I having problems parsing my data to it?
> 
> best regards 
> Carl B. 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


  1   2   >