Re: [scikit-learn] Normalization in ridge regression when there is no intercept

2019-06-07 Thread Roman Yurchak via scikit-learn
On 06/06/2019 14:56, ahmetcik wrote:
> I have just recognized that when using ridge regression without an
> intercept no normalization is performed even if the argument "normalize"
> is set to True.

It's a known longstanding issue 
https://github.com/scikit-learn/scikit-learn/issues/3020 It would be 
indeed good to find a solution.

-- 
Roman

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Starting to contribute

2019-04-07 Thread Roman Yurchak via scikit-learn
Hello Heitor,

yes, you can chose an issue, comment there that you plan to work on it 
(to avoid redundant work by other contributors) and if no one objects 
make a PR. If you have any questions you can ask them by commenting on 
that issue (for specific questions) or on the scikit-learn Gitter 
https://gitter.im/scikit-learn/scikit-learn (for general questions about 
how to contribute).

See https://scikit-learn.org/stable/developers/contributing.html for 
more information.

Roman

On 06/04/2019 19:07, Heitor Boschirolli wrote:
> Hello!
> 
> First of all, I'm apologize if this email is not for such questions, but 
> I never contributed to open source code before and I'm not sure how to 
> proceed, could someone help me with that?
> 
> Should I just pick an issue, solve it following the guidelines described 
> in the website and open a PR?
> If I have any trouble, can I post it on the mailing list?
> 
> Att, Heitor Boschirolli


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] API Discussion: Where shall we put the plotting functions?

2019-04-03 Thread Roman Yurchak via scikit-learn
+1 for options 1 and +0.5 for 3. Do we anticipate that many plotting 
functions will be added? If it's just a dozen or less, putting them all 
into a single namespace sklearn.plot might be easier.

This also would avoid discussion about where to put some generic 
plotting functions (e.g. 
https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479).

Roman

On 03/04/2019 12:06, Trevor Stephens wrote:
> I think #1 if any of these... Plotting functions should hopefully be as 
> general as possible, so tagging with a specific type of estimator will, 
> in some scikit-learn utopia, be unnecessary.
> 
> If a general plotter is built, where does it live in other 
> estimator-specific namespace options? Feels awkward to put it under 
> every estimator's namespace.
> 
> Then again, there might be a #4 where there is no plot module and 
> plotting classes live under groups of utilities like introspection, 
> cross-validation or something?...
> 
> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe  > wrote:
> 
> My preference would be for (1). I don't think the sub-namespace in
> (2) is necessary, and don't like (3), as I would prefer the plotting
> functions to be all in the same namespace sklearn.plot.
> 
> Andrew
> 
> <~~~>
> J. Andrew Howe, PhD
> LinkedIn Profile 
> ResearchGate Profile 
> Open Researcher and Contributor ID (ORCID)
> 
> Github Profile 
> Personal Website 
> I live to learn, so I can learn to live. - me
> <~~~>
> 
> 
> On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin  > wrote:
> 
> See https://github.com/scikit-learn/scikit-learn/issues/13448
> 
> We've introduced several plotting functions (e.g., plot_tree and
> plot_partial_dependence) and will introduce more (e.g.,
> plot_decision_boundary) in the future. Consequently, we need to
> decide where to put these functions. Currently, there're 3
> proposals:
> 
> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
> 
> (2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree)
> 
> (3) sklearn.XXX.plot.plot_YYY (e.g.,
> sklearn.tree.plot.plot_tree, note that we won't support from
> sklearn.XXX import plot_YYY)
> 
> Joel Nothman, Gael Varoquaux and I decided to post it on the
> mailing list to invite opinions.
> 
> Thanks
> 
> Hanmin Qin
> ___
> scikit-learn mailing list
> scikit-learn@python.org 
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org 
> https://mail.python.org/mailman/listinfo/scikit-learn
> 


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Sprint discussion points?

2019-02-19 Thread Roman Yurchak via scikit-learn
Thanks for putting the draft schedule together!

Personally I will be there 3 days out of 5 and wouldn't want to miss the 
discussion on euclidean distance issues. Maybe we could adjust the 
schedule during the sprint (say on Tuesday) based on people's interest 
and availability? That might be easier than trying to figure it out for 
29 participants over email..

Also IMO it would makes sense to have some discussions (that are not 
that controversial or about high level API but still useful) earlier 
during the week to be able to work on them during the sprint.

-- 
Roman

On 20/02/2019 02:30, Joel Nothman wrote:
> I don't think I'll be able to stay for the Friday 10am discussion, but 
> have a PR open on "efficient grid search" so should probably be involved.
> 
> Perhaps the fit_transform discussion can happen without you, Andy?
> 
> On Wed, 20 Feb 2019 at 10:17, Andreas Mueller  > wrote:
> 
> I put a draft schedule here:
> 
> https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events#technical-discussions-schedule
> 
> it's obviously somewhat opinionated ;)
> Happy to reprioritize.
> Basically I wouldn't like to miss any of the big API discussions
> because coming late to the party.
> 
> The two things on (grid?) searches are somewhat related: one is
> about specifying search-spaces, the other about executing a given
> search space efficiently. They probably warrant separate discussions.
> 
> I haven't added plotting or sample props on it, which are maybe two
> other discussion points.
> I tried to cover most controversial things from the roadmap.
> 
> Not sure if discussing the schedule via the mailing list is the best
> way? Don't want to create even more traffic  than I already am ;)
> 
> On 2/19/19 5:48 PM, Guillaume Lemaître wrote:
>> > Not sure if Guillaume had ideas about the schedule, given that
>> he seems to be running the show?
>>
>> Mostly running behind the show ...
>>
>> For the moment, we only have a 30 minutes presentation of
>> introduction planned on Monday.
>> For the rest of the week, this is pretty much opened and I think
>> that we can propose a schedule such that we can be efficient.
>> IMO, two meetings of an hour per day look good to me.
>>
>> Shall we prioritize the list of the issues? Maybe that some issues
>> could be packed together.
>> I would not be against having a rough schedule on the wiki
>> directly and I think that having it before Monday could be better.
>>
>> Let me know how I can help.
>>
>> On Tue, 19 Feb 2019 at 22:23, Andreas Mueller > > wrote:
>>
>> Yeah, sounds good.
>> I didn't want to unilaterally post a schedule, but doing some
>> google form or similar seems a bit heavy-handed?
>> Not sure if Guillaume had ideas about the schedule, given that
>> he seems to be running the show?
>>
>> On 2/19/19 4:17 PM, Joel Nothman wrote:
>>> I don't think optics requires a large meeting, just a few
>>> people.
>>>
>>> I'm happy with your proposal generally, Andy. Do we schedule
>>> specific topics at this point?
>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org  
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org 
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>> -- 
>> Guillaume Lemaitre
>> INRIA Saclay - Parietal team
>> Center for Data Science Paris-Saclay
>> https://glemaitre.github.io/
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org  
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org 
> https://mail.python.org/mailman/listinfo/scikit-learn
> 


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Next Sprint

2018-12-22 Thread Roman Yurchak via scikit-learn
That works for me as well.

On 21/12/2018 16:00, Olivier Grisel wrote:
> Ok for me. The last 3 weeks of February are fine for me.
> 
> Le jeu. 20 déc. 2018 à 21:21, Alexandre Gramfort 
> mailto:alexandre.gramf...@inria.fr>> a écrit :
> 
> ok for me
> 
> Alex
> 
> On Thu, Dec 20, 2018 at 8:35 PM Adrin  > wrote:
>  >
>  > It'll be the least favourable week of February for me, but I can
> make do.
>  >
>  > On Thu, 20 Dec 2018 at 18:45 Andreas Mueller  > wrote:
>  >>
>  >> Works for me!
>  >>
>  >> On 12/19/18 5:33 PM, Gael Varoquaux wrote:
>  >> > I would propose  the week of Feb 25th, as I heard people say
> that they
>  >> > might be available at this time. It is good for many people,
> or should we
>  >> > organize a doodle?
>  >> >
>  >> > G
>  >> >
>  >> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote:
>  >> >> Can we please nail down dates for a sprint?
>  >> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote:
>  >> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote:
>  >>  We can also do Paris in April / May or June if that's ok
> with Joel and better
>  >>  for Andreas.
>  >> >>> Absolutely.
>  >> >>> My thoughts here are that I want to minimize transportation,
> partly
>  >> >>> because flying has a large carbon footprint. Also, for
> personal reasons,
>  >> >>> I am not sure that I will be able to make it to Austin in
> July, but I
>  >> >>> realize that this is a pretty bad argument.
>  >> >>> We're happy to try to host in Paris whenever it's most
> convenient and to
>  >> >>> try to help with travel for those not in Paris.
>  >> >>> Gaël
>  >> >>> ___
>  >> >>> scikit-learn mailing list
>  >> >>> scikit-learn@python.org 
>  >> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>  >> >> ___
>  >> >> scikit-learn mailing list
>  >> >> scikit-learn@python.org 
>  >> >> https://mail.python.org/mailman/listinfo/scikit-learn
>  >>
>  >> ___
>  >> scikit-learn mailing list
>  >> scikit-learn@python.org 
>  >> https://mail.python.org/mailman/listinfo/scikit-learn
>  >
>  > ___
>  > scikit-learn mailing list
>  > scikit-learn@python.org 
>  > https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> scikit-learn@python.org 
> https://mail.python.org/mailman/listinfo/scikit-learn
> 


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Recurrent questions about speed for TfidfVectorizer

2018-11-26 Thread Roman Yurchak via scikit-learn
Tries are interesting, but it appears that while they use less memory 
that dicts/maps they are generally slower than dicts for a large number 
of elements. See e.g. 
https://github.com/pytries/marisa-trie/blob/master/docs/benchmarks.rst. 
This is also consistent with the results in the below linked 
CountVectorizer PR that aimed to use tries, I think.

Though maybe e.g. MARISA-Trie (and generally trie libraries available in 
python) did improve significantly in 5 years since 
https://github.com/scikit-learn/scikit-learn/issues/2639 was done.

The thing is also that even HashingVecorizer that doesn't need to handle 
the vocabulary is only a moderately faster, so using a better data 
structure for the vocabulary might give us its performance at best..

-- 
Roman

On 26/11/2018 16:f28, Andreas Mueller wrote:
> I think tries might be an interesting datastructure, but it really
> depends on where the bottleneck is.
> I'm really surprised they are not used more, but maybe that's just
> because implementations are missing?
> 
> On 11/26/18 8:39 AM, Roman Yurchak via scikit-learn wrote:
>> Hi Matthieu,
>>
>> if you are interested in general questions regarding improving
>> scikit-learn performance, you might be want to have a look at the draft
>> roadmap
>> https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 --
>> there is a lot topics where suggestions / PRs on improving performance
>> would be very welcome.
>>
>> For the particular case of TfidfVectorizer, it is a bit different from
>> the rest of the scikit-learn code base in the sense that it's not
>> limited by the performance of numerical calculation but rather that of
>> string processing and counting. TfidfVectorizer is equivalent to
>> CountVectorizer + TfidfTransformer and the later  has only a marginal
>> computational cost. As to CountVectorizer, last time I checked, its
>> profiling was something along the lines of,
>> - part regexp for tokenization (see token_pattern.findall)
>> - part token counting (see CountVectorizer._count_vocab)
>> - and a comparable part for all the rest
>>
>> Because of that, porting it to Cython is not that immediate, as one is
>> still going to use CPython regexp and token counting in a dict. For
>> instance, HashingVectorizer implements token counting in Cython -- it's
>> faster but not that much faster. Using C++ maps or some less common
>> structures have been discussed in
>> https://github.com/scikit-learn/scikit-learn/issues/2639
>>
>> Currently, I think, there are ~3 main ways performance could be improved,
>> 1. Optimize the current implementation while remaining in Python.
>> Possible but IMO would require some effort, because there are not much
>> low hanging fruits left there. Though a new look would definitely be good.
>>
>> 2. Parallelize computations. There was some earlier discussion about
>> this in scikit-learn issues, but at present, the better way would
>> probably be to add it in dask-ml (see
>> https://github.com/dask/dask-ml/issues/5). HashingVectorizer is already
>> supported. Someone would need to implement CountVectorizer.
>>
>> 3. Rewrite part of the implementation in a lower level language (e.g.
>> Cython). The question is how maintainable that would be, and whether the
>> performance gains would be worth it.  Now that Python 2 will be dropped,
>> at least not having to deal with Py2/3 compatibility for strings in
>> Cython might make things a bit easier. Though, if the processing is in
>> Cython it might also make using custom tokenizers/analyzers more difficult.
>>
>>   On a related topic, I have been experimenting with implementing part
>> of this processing in Rust lately:
>> https://github.com/rth/text-vectorize. So far it looks promising.
>> Though, of course, it will remain a separate project because of language
>> constraints in scikit-learn.
>>
>> In general if you have thoughts on things that can be improved, don't
>> hesitate to open issues,
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Recurrent questions about speed for TfidfVectorizer

2018-11-26 Thread Roman Yurchak via scikit-learn
Hi Matthieu,

if you are interested in general questions regarding improving 
scikit-learn performance, you might be want to have a look at the draft 
roadmap
https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 -- 
there is a lot topics where suggestions / PRs on improving performance 
would be very welcome.

For the particular case of TfidfVectorizer, it is a bit different from 
the rest of the scikit-learn code base in the sense that it's not 
limited by the performance of numerical calculation but rather that of 
string processing and counting. TfidfVectorizer is equivalent to 
CountVectorizer + TfidfTransformer and the later  has only a marginal 
computational cost. As to CountVectorizer, last time I checked, its 
profiling was something along the lines of,
  - part regexp for tokenization (see token_pattern.findall)
  - part token counting (see CountVectorizer._count_vocab)
  - and a comparable part for all the rest

Because of that, porting it to Cython is not that immediate, as one is 
still going to use CPython regexp and token counting in a dict. For 
instance, HashingVectorizer implements token counting in Cython -- it's 
faster but not that much faster. Using C++ maps or some less common 
structures have been discussed in 
https://github.com/scikit-learn/scikit-learn/issues/2639

Currently, I think, there are ~3 main ways performance could be improved,
  1. Optimize the current implementation while remaining in Python. 
Possible but IMO would require some effort, because there are not much 
low hanging fruits left there. Though a new look would definitely be good.

  2. Parallelize computations. There was some earlier discussion about 
this in scikit-learn issues, but at present, the better way would 
probably be to add it in dask-ml (see 
https://github.com/dask/dask-ml/issues/5). HashingVectorizer is already 
supported. Someone would need to implement CountVectorizer.

  3. Rewrite part of the implementation in a lower level language (e.g. 
Cython). The question is how maintainable that would be, and whether the 
performance gains would be worth it.  Now that Python 2 will be dropped, 
at least not having to deal with Py2/3 compatibility for strings in 
Cython might make things a bit easier. Though, if the processing is in 
Cython it might also make using custom tokenizers/analyzers more difficult.

On a related topic, I have been experimenting with implementing part 
of this processing in Rust lately: 
https://github.com/rth/text-vectorize. So far it looks promising. 
Though, of course, it will remain a separate project because of language 
constraints in scikit-learn.

In general if you have thoughts on things that can be improved, don't 
hesitate to open issues,
-- 
Roman


On 25/11/2018 10:59, Matthieu Brucher wrote:
> Hi all,
> 
> I've noticed a few questions online (mainly SO) on TfidfVectorizer 
> speed, and I was wondering about the global effort on speeding up sklearn.
> Is there something I can help on this topic (Cython?), as well as a 
> discussion on this tough subject?
> 
> Cheers,
> 
> Matthieu
> -- 
> Quantitative analyst, Ph.D.
> Blog: http://blog.audio-tk.com/
> LinkedIn: http://www.linkedin.com/in/matthieubrucher


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn