Re: [scikit-learn] Vote on SLEP009: keyword only arguments

2019-09-16 Thread Vlad Niculae
I vote +1

Hopefully keyword-only args become normalized and a future will come where
I won't see `x.sum(0)` anymore

VN

On Sat, Sep 14, 2019 at 11:23 PM Thomas J Fan  wrote:

> +1 from me
>
> On Sat, Sep 14, 2019 at 8:12 AM Joel Nothman 
> wrote:
>
>> I am +1 for this change.
>>
>> I agree that users will accommodate the syntax sooner or later.
>>
>> On Fri., 13 Sep. 2019, 7:54 pm Jeremie du Boisberranger, <
>> jeremie.du-boisberran...@inria.fr> wrote:
>>
>>> I don't know what is the policy about a sklearn 1.0 w.r.t api changes.
>>>
>>> If it's meant to be a special release with possible api changes without
>>> deprecation cycles, I think this change is a good candidate for 1.0
>>>
>>>
>>> Otherwise I'm +1 and agree with Guillaume, people will get used to it by
>>> using it.
>>>
>>> Jérémie
>>>
>>>
>>>
>>> On 12/09/2019 10:06, Guillaume Lemaître wrote:
>>>
>>> To the question: do we want to utilise Python 3's force-keyword-argument
>>> syntax
>>> and to change existing APIs which support arguments positionally to use
>>> this
>>> syntax, via a deprecation period?
>>>
>>> I am +1.
>>>
>>> IMO, even if the syntax might be unknown, it will remain unknown until
>>> projects
>>> from the ecosystem are not using it.
>>>
>>> To the question: which methods should be impacted?
>>>
>>> I think we should be as gentle as possible at first. I am a little
>>> concerned about
>>> breaking some codes which were working fine before.
>>>
>>> On Thu, 12 Sep 2019 at 04:43, Joel Nothman 
>>> wrote:
>>>
 These there details of specific API changes to be decided:

 The question being put, as per the SLEP, is:
 do we want to utilise Python 3's force-keyword-argument syntax
 and to change existing APIs which support arguments positionally to use
 this syntax, via a deprecation period?
 ___
 scikit-learn mailing list
 scikit-learn@python.org
 https://mail.python.org/mailman/listinfo/scikit-learn

>>>
>>>
>>> --
>>> Guillaume Lemaitre
>>> INRIA Saclay - Parietal team
>>> Center for Data Science Paris-Saclay
>>> https://glemaitre.github.io/
>>>
>>> ___
>>> scikit-learn mailing 
>>> listscikit-learn@python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Fit and predict method

2019-02-24 Thread Vlad Niculae
Hi,

The `classifier` object in your code _is_ the model. In other words, after
`fit`, the classifier object will have some new attributes (for instance
`classifier.coef_` in the case of linear models), which are used to make
predictions when you call `predict`.

Hope this helps,
Vlad

On Sun, Feb 24, 2019, 05:34 Venkataraman B 
wrote:

> Hi, I had a question on the predict and fit methods
>
> The fit method is used to build the model ie classifier.fit(X,y). But when
> the predict method is called the model that is built is never passed.  You
> only pass the test set. So what model does the predict function use to
> predict the output
>
> I am picking python after working on R and the predict function in R made
> more sense because the model that was built is passed along with the test
> set that has to be predicted
>
> Any response would be greatly appreciated
> --
> Regards, Venkataraman B
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] VOTE: scikit-learn governance document

2019-02-10 Thread Vlad Niculae
+1

Thank you for the effort to formalize this!

Best,
Vlad

On Mon, Feb 11, 2019, 02:47 Noel Dawe  Hi Andy,
>
> +1 from me as well :)
>
> On Sun, Feb 10, 2019 at 8:54 PM Jacob Schreiber 
> wrote:
>
>> +1 from me as well. Thanks for putting in the time to write this all out.
>>
>> On Sun, Feb 10, 2019 at 4:54 PM Hanmin Qin 
>> wrote:
>>
>>> +1 (personally I still think it's better to keep the flow chart, it
>>> seems useful for beginners)
>>>
>>> Hanmin Qin
>>>
>>> - Original Message -
>>> From: Alexandre Gramfort 
>>> To: Scikit-learn mailing list 
>>> Subject: Re: [scikit-learn] VOTE: scikit-learn governance document
>>> Date: 2019-02-11 01:29
>>>
>>> +1 for me too
>>>
>>> Alex
>>>
>>>
>>> On Sat, Feb 9, 2019 at 10:06 PM Gilles Louppe 
>>> wrote:
>>>
>>> Hi Andy,
>>>
>>> I read through to document. Even though I have not been really active
>>> these past months/years, I think it summarizes well our governance
>>> model.
>>>
>>> +1.
>>>
>>> Gilles
>>>
>>> On Sat, 9 Feb 2019 at 12:01, Adrin  wrote:
>>> >
>>> > +1
>>> >
>>> > Thanks for the work you've put in it!
>>> >
>>> > On Sat, Feb 9, 2019, 03:00 Andreas Mueller >> >>
>>> >> Hey all.
>>> >>
>>> >> I want to call a vote on the final version on the scikit-learn
>>> >> governance document, which can be found in this PR:
>>> >>
>>> >> https://github.com/scikit-learn/scikit-learn/pull/12878
>>> >>
>>> >> It underwent some significant changes in the last couple of weeks.
>>> >>
>>> >> The two-sentence summary is: conflicts are resolved by vote among core
>>> >> devs, with a technical committee resolving anything that can not be
>>> >> decided by at least a 2/3 majority. The initial technical committee is
>>> >> Alexander Gramfort, Olivier Grisel, Joel Nothman, Hanmin Qin, Gaël
>>> >> Varoquaux and myself (Andreas Müller).
>>> >>
>>> >> I would ask all of the *core developers* to either vote +1 for the
>>> >> governance doc, -1 against it, or to explicitly abstain here on the
>>> >> public mailing list (which is the way any vote will be conducted
>>> >> according to the new governance document).
>>> >>
>>> >> I suggest we leave the vote open for two weeks, so that the decision
>>> is
>>> >> made before the sprint and we can take actions.
>>> >>
>>> >> Anyone can still comment on the PR or here, though I would rather not
>>> >> make more changes as this has already been discussed to some length.
>>> >>
>>> >> Thank you for participating,
>>> >>
>>> >> Andy
>>> >>
>>> >> ___
>>> >> scikit-learn mailing list
>>> >> scikit-learn@python.org
>>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>>> >
>>> > ___
>>> > scikit-learn mailing list
>>> > scikit-learn@python.org
>>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] New core dev: Joris Van den Bossche

2018-06-23 Thread Vlad Niculae
Congratulations Joris, very well deserved!

Vlad

On Sat, Jun 23, 2018, 11:15 Sebastian Raschka 
wrote:

> That's great news! I am glad to hear that you joined the project, Joris
> Van den Bossche!  I am a scikit-learn user (and sometimes contributor) and
> really appreciate all the time and effort that the core developers and
> contributors spend on maintaining and extending the library.
>
> Best regards,
> Sebastian
>
>
> > On Jun 23, 2018, at 6:42 AM, Olivier Grisel 
> wrote:
> >
> > Hi everyone!
> >
> > Let's welcome Joris Van den Bossche (@jorisvdbossche) officially as a
> scikit-learn core developer!
> >
> > Joris is one of the maintainers of the pandas project and recently
> contributed many new great PRs to scikit-learn (notably the
> ColumnTransformer and a refactoring of the categorical variable
> preprocessing tools).
> >
> > Cheers!
> >
> > --
> > Olivier
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Contribution

2017-07-10 Thread Vlad Niculae
On Mon, Jul 10, 2017 at 04:10:09PM +, federico vaggi wrote:
> There is a fantastic library called lightning where the optimization
> routines are first class citizens:
> http://contrib.scikit-learn.org/lightning/ - you can take a look there.
> However, lightning focuses on convex optimization, so most algorithms have
> provable convergence rates.

Hi,

I fully agree that lightning is fantastic :) but it might not be what Gürhan
wants.

It's true that lightning's api is designed around optimizers rather
than around models. So where in scikit-learn we usually have, e.g., 

  LogisticRegression(solver='sag')

in lightning you would have

  SAGClassifier(loss='log')

to achieve something close. But neither library has the oo-style
separation between freeform models and optimizers such as you might
find in deep learning frameworks.  So, for instance, it's relatively
easy to add a new loss function to the lightning SAGClassifier, but
you would still be able to only use it with a linear model.

This is by design in both scikit-learn and lightning, at least at the
moment: by making these kinds of assumptions about the models,
implementations can be much more efficient in terms of computation and
storage, especially when sparse data is involved.

Yours,
Vlad

> 
> Good luck!
> 
> On Mon, 10 Jul 2017 at 09:05 Jacob Schreiber 
> wrote:
> 
> > Howdy
> >
> > This question and the one right after in the FAQ are probably relevant re:
> > inclusion of new algorithms:
> > http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms.
> > The gist is that we only include well established algorithms, and there are
> > no end to those. I think it is unlikely that a PR will get merged with a
> > cutting edge new algorithm, as the scope of scikit-learn isn't necessary
> > "the latest" as opposed to "the classics." You may also consider writing a
> > scikit-contrib package that basically creates what you're interested in in
> > scikit-learn format, but external to the project. We'd be more than happy
> > to link to it. If the algorithm becomes a smashing success over time, we'd
> > reconsider adding it to the main code base.
> >
> > As to your first question, you should check out how the current optimizers
> > are written for the algorithm you're interested in. I don't think there's a
> > plug and play way to drop in your own optimizer like many deep learning
> > packages support, unfortunately. You'd probably have to modify the code
> > directly to support your own.
> >
> > Let me know if you have any other questions.
> >
> > Jacob
> >
> > On Mon, Jul 10, 2017 at 7:58 AM, Gürhan Ceylan 
> > wrote:
> >
> >> Hi everyone,
> >>
> >> I am wondering, How can I  use external optimization algorithms with 
> >> scikit-learn,
> >> for instance neural network
> >> 
> >> , instead of defined algorithms ( Stochastic Gradient Descent, Adam, or
> >> L-BFGS).
> >>
> >> Furthermore, I want to introduce a new unconstrained optimization
> >> algorithm to scikit-learn, implementation of the algorithm and related 
> >> paper
> >> can be found here .
> >>
> >> I couldn't find any explanation
> >> , about the
> >> situation. Do you have defined procedure to make such kind of
> >> contributions? If this is not the case, How should I start to make such a
> >> proposal/contribution ?
> >>
> >>
> >> Kind regards,
> >>
> >> Gürhan C.
> >>
> >>
> >> ___
> >> scikit-learn mailing list
> >> scikit-learn@python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >>
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >

> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] OMP ended prematurely due to linear dependence in the dictionary

2017-02-14 Thread Vlad Niculae
Hi Ben,

This actually sounds like a bug in this case! At a glance, the code
should use the correct BLAS calls for the data type you provide. Can
you reproduce this with a simple small example that gets different
results if the data is 32 vs 64 bit? Would you mind filing an issue?

Thanks,
Vlad


On Tue, Feb 14, 2017 at 8:19 PM, Benjamin Merkt
<benjamin.me...@bcf.uni-freiburg.de> wrote:
> OK, the issue is resolved. My dictionary was still in 32bit float from
> saving. When I convert it to 64float before calling fit it works fine.
>
> Sorry to bother.
>
>
>
> On 14.02.2017 11:00, Benjamin Merkt wrote:
>>
>> Hi,
>>
>> I tried that with no effect. The fit still breaks after two iterations.
>>
>> If I set precompute=True I get three coefficients instead of only two.
>> My Dictionary is fairly large (currently 128x42000). Is it even feasible
>> to use OMP with such a big Matrix (even with ~120GB ram)?
>>
>> -Ben
>>
>>
>>
>> On 13.02.2017 23:31, Vlad Niculae wrote:
>>>
>>> Hi,
>>>
>>> Are the columns of your matrix normalized? Try setting `normalized=True`.
>>>
>>> Yours,
>>> Vlad
>>>
>>> On Mon, Feb 13, 2017 at 6:55 PM, Benjamin Merkt
>>> <benjamin.me...@bcf.uni-freiburg.de> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> I'm using OrthogonalMatchingPursuit to get a sparse coding of a
>>>> signal using
>>>> a dictionary learned by a KSVD algorithm (pyksvd). However, during
>>>> the fit I
>>>> get the following RuntimeWarning:
>>>>
>>>> /usr/local/lib/python2.7/dist-packages/sklearn/linear_model/omp.py:391:
>>>> RuntimeWarning:  Orthogonal matching pursuit ended prematurely due to
>>>> linear
>>>> dependence in the dictionary. The requested precision might not have
>>>> been
>>>> met.
>>>>
>>>>   copy_X=copy_X, return_path=return_path)
>>>>
>>>> In those cases the results are indeed not satisfactory. I don't get the
>>>> point of this warning as it is common in sparse coding to have an
>>>> overcomplete dictionary an thus also linear dependency within it. That
>>>> should not be an issue for OMP. In fact, the warning is also raised
>>>> if the
>>>> dictionary is a square matrix.
>>>>
>>>> Might this Warning also point to other issues in the application?
>>>>
>>>>
>>>> Thanks, Ben
>>>>
>>>> ___
>>>> scikit-learn mailing list
>>>> scikit-learn@python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] OMP ended prematurely due to linear dependence in the dictionary

2017-02-13 Thread Vlad Niculae
Hi,

Are the columns of your matrix normalized? Try setting `normalized=True`.

Yours,
Vlad

On Mon, Feb 13, 2017 at 6:55 PM, Benjamin Merkt
 wrote:
> Hi everyone,
>
> I'm using OrthogonalMatchingPursuit to get a sparse coding of a signal using
> a dictionary learned by a KSVD algorithm (pyksvd). However, during the fit I
> get the following RuntimeWarning:
>
> /usr/local/lib/python2.7/dist-packages/sklearn/linear_model/omp.py:391:
> RuntimeWarning:  Orthogonal matching pursuit ended prematurely due to linear
> dependence in the dictionary. The requested precision might not have been
> met.
>
>   copy_X=copy_X, return_path=return_path)
>
> In those cases the results are indeed not satisfactory. I don't get the
> point of this warning as it is common in sparse coding to have an
> overcomplete dictionary an thus also linear dependency within it. That
> should not be an issue for OMP. In fact, the warning is also raised if the
> dictionary is a square matrix.
>
> Might this Warning also point to other issues in the application?
>
>
> Thanks, Ben
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Why do DTs have a different fit protocol than NB and SVMs?

2016-12-13 Thread Vlad Niculae
I agree; if you're not actually doing daisy-chaining, the stateful and
more concise form `clf.fit(X.y)` looks more pythonic in my opinion.

Also it seems that the "fit returns self" convention is not documented
here [1], maybe we should briefly mention it?

http://scikit-learn.org/stable/tutorial/basic/tutorial.html

On Tue, Dec 13, 2016 at 3:45 PM, Andreas Mueller <t3k...@gmail.com> wrote:
>
>
> On 12/13/2016 03:38 PM, Vlad Niculae wrote:
>>
>> It is part of the API and enforced with tests, if I'm not mistaken. So you
>> could use either form with all sklearn estimators.
>
>
> It is indeed enforced.
> Though I feel clf = clf.fit(X, y)
> is somewhat ugly and I would rather not have it in the docs.
> Alsok this example uses a capital Y,so two reasons to change it ;)
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Why do DTs have a different fit protocol than NB and SVMs?

2016-12-13 Thread Vlad Niculae
It is part of the API and enforced with tests, if I'm not mistaken. So you 
could use either form with all sklearn estimators.

Vlad

On December 13, 2016 3:33:48 PM EST, Stuart Reynolds 
 wrote:
>I think he's asking whether returning the model is part of the API
>(i.e. is
>it a bug that SVM and NB don't return self?).
>
>On Tue, Dec 13, 2016 at 12:23 PM, Jacob Schreiber
>
>wrote:
>
>> The fit method returns the object itself, so regardless of which way
>you
>> do it, it will work. The reason the fit method returns itself is so
>that
>> you can chain methods, like "preds = clf.fit(X, y).predict(X)"
>>
>> On Tue, Dec 13, 2016 at 12:14 PM, Graham Arthur Mackenzie <
>> graham.arthur.macken...@gmail.com> wrote:
>>
>>> Hello All,
>>>
>>> I hope this is the right way to ask a question about documentation.
>>>
>>> In the doc for Decision Trees
>>> , the fit
>>> statement is assigned back to the classifier:
>>>
>>> clf = clf.fit(X, Y)
>>>
>>> Whereas, for Naive Bayes
>>>
>
>>>  and Support Vector Machines
>>>
>,
>>> it's just:
>>>
>>> clf.fit(X, Y)
>>>
>>> I assumed this was a typo, but thought I should try and verify such
>>> before proceeding under that assumption. I appreciate any feedback
>you can
>>> provide.
>>>
>>> Thank You and Be Well,
>>> Graham
>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
>
>
>___
>scikit-learn mailing list
>scikit-learn@python.org
>https://mail.python.org/mailman/listinfo/scikit-learn

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] random forests using grouped data

2016-12-01 Thread Vlad Niculae
I don't think there are any such estimators in scikit-learn directly,
but the model selection machinery is there to help.  Check out
GroupKFold [1] so you can do cross-validation after concatenating all
the samples, while ensuring that training and validation groups are
separate.

The setup of this problem looks a lot like query results reranking in
information retrieval, where you need to find relevant and
non-relevant results among the set of retrieved docs for each search
query. A simple approach you can build using scikit-learn tools is
RankSVM, where you take, within each group, all possible pairs between
a positive and a negative sample, and take the difference of their
features as your input. This is the same as optimizing within-group
AUC. Unfortunately the trick doesn't work in the same way for
nonlinear models, but it's another baseline you could try. Fabian had
an example of this, with some VERY enlightening illustrations, here
[2].

HTH,
Vlad

[1] 
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html
[2] 
https://github.com/fabianp/minirank/blob/master/notebooks/pairwise_transform.ipynb

On Thu, Dec 1, 2016 at 8:16 AM, Brown J.B.  wrote:
> Hello Thomas,
>
> I don't personally know of any algorithm that works on collections of
> groupings, but why not first test a simple control model, meaning
> can you achieve a satisfactory model by simply concatenating all 48 scores
> per sample and building a forest the standard way?
> If not, what context or reasons dictate that the groupings need to stay
> retained as you have presented them?
>
> Hope this helps,
> J.B.
>
> 2016-12-01 22:05 GMT+09:00 Thomas Evangelidis :
>>
>> Sorry, the previous email was incomplete. Below is how the grouped data
>> look like:
>>
>>
>> Group1:
>> score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...]
>> score2 = [0.34, 0.27, 0.24, 0.05, 0.13, 0,14, ...]
>> y=[1,1,1,0,0,0, ...]  # 1 indicates "active" and 0 "inactive"
>>
>> Group2:
>> score1 = [0.34, 0.38, 0.48, 0.18, 0.12, 0.19, ...]
>> score2 = [0.28, 0.41, 0.34, 0.13, 0.09, 0,1, ...]
>> y=[1,1,1,0,0,0, ...]  # 1 indicates "active" and 0 "inactive"
>>
>> ..
>> Group24:
>> score1 = [0.67, 0.54, 0.59, 0.23, 0.24, 0.08, ...]
>> score2 = [0.41, 0.31, 0.28, 0.23, 0.18, 0,22, ...]
>> y=[1,1,1,0,0,0, ...]  # 1 indicates "active" and 0 "inactive"
>>
>>
>> On 1 December 2016 at 14:01, Thomas Evangelidis  wrote:
>>>
>>> Greetings
>>>
>>> I have grouped data which are divided into actives and inactives. The
>>> features are two different types of normalized scores (0-1), where the
>>> higher the score the most probable is an observation to be an "active". My
>>> data look like this:
>>>
>>>
>>> Group1:
>>> score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...]
>>> score2 = [
>>> y=[1,1,1,0,0,0, ...]
>>>
>>> Group2:
>>> score1 = [0
>>> score2 = [
>>> y=[1,1,1,1,1]
>>>
>>> ..
>>> Group24:
>>> score1 = [0
>>> score2 = [
>>> y=[1,1,1,1,1]
>>>
>>>
>>> I searched in the documentation about treatment of grouped data, but the
>>> only thing I found was how do do cross-validation. My question is whether
>>> there is any special algorithm that creates random forests from these type
>>> of grouped data.
>>>
>>> thanks in advance
>>> Thomas
>>>
>>>
>>>
>>> --
>>>
>>> ==
>>>
>>> Thomas Evangelidis
>>>
>>> Research Specialist
>>>
>>> CEITEC - Central European Institute of Technology
>>> Masaryk University
>>> Kamenice 5/A35/1S081,
>>> 62500 Brno, Czech Republic
>>>
>>> email: tev...@pharm.uoa.gr
>>>
>>>   teva...@gmail.com
>>>
>>>
>>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>>
>>>
>>
>>
>>
>> --
>>
>> ==
>>
>> Thomas Evangelidis
>>
>> Research Specialist
>>
>> CEITEC - Central European Institute of Technology
>> Masaryk University
>> Kamenice 5/A35/1S081,
>> 62500 Brno, Czech Republic
>>
>> email: tev...@pharm.uoa.gr
>>
>>   teva...@gmail.com
>>
>>
>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>
>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Bm25

2016-07-01 Thread Vlad Niculae
For the first question, look up the possible ways to construct 
scipy.sparse.csr_matrix objects; one of them will take (data, indices, indptr). 
Just pass a new array for data, and take the latter two from X.

For the second question, you can just do the elementwise operation in place on 
the data array, since they have the same shape in this case.

You can try playing around with these operations in a notebook and benchmarking 
them with %timeit/%memit, to see how to best organize them. I find such 
exercises very rewarding.

Cheers,
Vlad

On July 1, 2016 6:47:40 PM EDT, Basil Beirouti <basilbeiro...@gmail.com> wrote:
>Oh yes that's exactly what I was looking for. So how do I initialize an
>array with the same sparsity pattern as X? And then how do I do an
>element wise divide of the numerator over the denominator, when both
>are sparse matrices? Like you said it should only do this operation on
>the non zero elements of the numerator.
>
>Sent from my iPhone
>
>> On Jul 1, 2016, at 5:36 PM, Vlad Niculae <zephy...@gmail.com> wrote:
>> 
>> In the denominator you mean? It looks like you only need to add that
>to nonzero elements, since the others would all have a 0 in the
>numerator, right? So the final value would be zero there. Or am I
>missing something?
>> 
>> You can initialize an array with the same sparsity pattern as X, but
>its data is k everywhere. Then use inplace_row_scale to multiply it by
>B, then add this to X to get the denominator.
>> 
>>> On July 1, 2016 6:27:41 PM EDT, Basil Beirouti
><basilbeiro...@gmail.com> wrote:
>>> Hi Vlad,
>>> 
>>> Thanks for the quick reply. Unfortunately there's still the question
>of adding a scalar to every element in sparse matrix, which is not
>allowed for sparse matrices, and which is not possible to avoid in the
>equation.
>>> 
>>> Sincerely,
>>> Basil Beirouti 
>>> 
>>> 
>>>>  On Jul 1, 2016, at 4:36 PM, scikit-learn-requ...@python.org wrote:
>>>>  
>>>>  Send scikit-learn mailing list submissions to
>>>> scikit-learn@python.org
>>>>  
>>>>  To subscribe or unsubscribe via the World Wide Web, visit
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>  or, via email, send a message with subject or body 'help' to
>>>> scikit-learn-requ...@python.org
>>>>  
>>>>  You can reach the person managing the list at
>>>>
>>>> scikit-learn-ow...@python.org
>>>>  
>>>>  When replying, please edit your Subject line so it is more
>specific
>>>>  than "Re: Contents of scikit-learn digest..."
>>>>  
>>>>  
>>>>  Today's Topics:
>>>>  
>>>>1. Adding BM25 to scikit-learn.feature_extraction.text
>>>>   (Basil Beirouti)
>>>>2. Re: Adding BM25 to scikit-learn.feature_extraction.text
>>>>   (Vlad Niculae)
>>>>  
>>>>  
>>>> 
>>>>  
>>>>  Message: 1
>>>>  Date: Fri, 1 Jul 2016 16:17:43 -0500
>>>>  From: Basil Beirouti <basilbeiro...@gmail.com>
>>>>  To: scikit-learn@python.org
>>>>  Subject: [scikit-learn] Adding BM25 to
>>>> scikit-learn.feature_extraction.text
>>>>  Message-ID:
>>>>
><cab4mtg8805nndaja5cscf+phrjyq0btc-agzegd8cqb95sv...@mail.gmail.com>
>>>>  Content-Type: text/plain; charset="utf-8"
>>>>  
>>>>  Hi everyone,
>>>>  
>>>>  to put it succinctly, here's the BM25 equation:
>>>>  
>>>>  f(w,D) * (k+1) / (k*B + f(w,D))
>>>>  
>>>>  where w is the word, and D is the
>>>> document (corresponding to rows and
>>>>  columns, respectively). f is a sparse matrix because only a
>fraction of the
>>>>  whole vocabulary of words appears in any given single document.
>>>>  
>>>>  B is a function of only the document, but it doesn't matter, you
>can think
>>>>  of it as a constant if you want.
>>>>  
>>>>  The problem is since f(w,D) is almost always zero, I only need to
>do the
>>>>  calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D)))
>when
>>>>  f(w,D) is not zero. Is there a clever way to do this with masks?
>>>>  
>>>>  You can refactor the above equation to get this:
>>>>  
>>>>  (k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in
>a
>>>>  denominator, which 

Re: [scikit-learn] Bm25

2016-07-01 Thread Vlad Niculae
For the first question, look up the possible ways to construct 
scipy.sparse.csr_matrix objects; one of them will take (data, indices, indptr). 
Just pass a new array for data, and take the latter two from X.

For the second question, you can just do the elementwise operation in place on 
the data array, since they have the same shape in this case.

You can try playing around with these operations in a notebook and benchmarking 
them with %timeit/%memit, to see how to best organize them. I find such 
exercises very rewarding.

Cheers,
Vlad

On July 1, 2016 6:47:40 PM EDT, Basil Beirouti <basilbeiro...@gmail.com> wrote:
>Oh yes that's exactly what I was looking for. So how do I initialize an
>array with the same sparsity pattern as X? And then how do I do an
>element wise divide of the numerator over the denominator, when both
>are sparse matrices? Like you said it should only do this operation on
>the non zero elements of the numerator.
>
>Sent from my iPhone
>
>> On Jul 1, 2016, at 5:36 PM, Vlad Niculae <zephy...@gmail.com> wrote:
>> 
>> In the denominator you mean? It looks like you only need to add that
>to nonzero elements, since the others would all have a 0 in the
>numerator, right? So the final value would be zero there. Or am I
>missing something?
>> 
>> You can initialize an array with the same sparsity pattern as X, but
>its data is k everywhere. Then use inplace_row_scale to multiply it by
>B, then add this to X to get the denominator.
>> 
>>> On July 1, 2016 6:27:41 PM EDT, Basil Beirouti
><basilbeiro...@gmail.com> wrote:
>>> Hi Vlad,
>>> 
>>> Thanks for the quick reply. Unfortunately there's still the question
>of adding a scalar to every element in sparse matrix, which is not
>allowed for sparse matrices, and which is not possible to avoid in the
>equation.
>>> 
>>> Sincerely,
>>> Basil Beirouti 
>>> 
>>> 
>>>>  On Jul 1, 2016, at 4:36 PM, scikit-learn-requ...@python.org wrote:
>>>>  
>>>>  Send scikit-learn mailing list submissions to
>>>> scikit-learn@python.org
>>>>  
>>>>  To subscribe or unsubscribe via the World Wide Web, visit
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>  or, via email, send a message with subject or body 'help' to
>>>> scikit-learn-requ...@python.org
>>>>  
>>>>  You can reach the person managing the list at
>>>>
>>>> scikit-learn-ow...@python.org
>>>>  
>>>>  When replying, please edit your Subject line so it is more
>specific
>>>>  than "Re: Contents of scikit-learn digest..."
>>>>  
>>>>  
>>>>  Today's Topics:
>>>>  
>>>>1. Adding BM25 to scikit-learn.feature_extraction.text
>>>>   (Basil Beirouti)
>>>>2. Re: Adding BM25 to scikit-learn.feature_extraction.text
>>>>   (Vlad Niculae)
>>>>  
>>>>  
>>>> 
>>>>  
>>>>  Message: 1
>>>>  Date: Fri, 1 Jul 2016 16:17:43 -0500
>>>>  From: Basil Beirouti <basilbeiro...@gmail.com>
>>>>  To: scikit-learn@python.org
>>>>  Subject: [scikit-learn] Adding BM25 to
>>>> scikit-learn.feature_extraction.text
>>>>  Message-ID:
>>>>
><cab4mtg8805nndaja5cscf+phrjyq0btc-agzegd8cqb95sv...@mail.gmail.com>
>>>>  Content-Type: text/plain; charset="utf-8"
>>>>  
>>>>  Hi everyone,
>>>>  
>>>>  to put it succinctly, here's the BM25 equation:
>>>>  
>>>>  f(w,D) * (k+1) / (k*B + f(w,D))
>>>>  
>>>>  where w is the word, and D is the
>>>> document (corresponding to rows and
>>>>  columns, respectively). f is a sparse matrix because only a
>fraction of the
>>>>  whole vocabulary of words appears in any given single document.
>>>>  
>>>>  B is a function of only the document, but it doesn't matter, you
>can think
>>>>  of it as a constant if you want.
>>>>  
>>>>  The problem is since f(w,D) is almost always zero, I only need to
>do the
>>>>  calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D)))
>when
>>>>  f(w,D) is not zero. Is there a clever way to do this with masks?
>>>>  
>>>>  You can refactor the above equation to get this:
>>>>  
>>>>  (k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in
>a
>>>>  denominator, which 

Re: [scikit-learn] Bm25

2016-07-01 Thread Vlad Niculae
In the denominator you mean? It looks like you only need to add that to nonzero 
elements, since the others would all have a 0 in the numerator, right? So the 
final value would be zero there. Or am I missing something?

You can initialize an array with the same sparsity pattern as X, but its data 
is k everywhere. Then use inplace_row_scale to multiply it by B, then add this 
to X to get the denominator.

On July 1, 2016 6:27:41 PM EDT, Basil Beirouti <basilbeiro...@gmail.com> wrote:
>Hi Vlad,
>
>Thanks for the quick reply. Unfortunately there's still the question of
>adding a scalar to every element in sparse matrix, which is not allowed
>for sparse matrices, and which is not possible to avoid in the
>equation.
>
>Sincerely,
>Basil Beirouti 
>
>
>> On Jul 1, 2016, at 4:36 PM, scikit-learn-requ...@python.org wrote:
>> 
>> Send scikit-learn mailing list submissions to
>>scikit-learn@python.org
>> 
>> To subscribe or unsubscribe via the World Wide Web, visit
>>https://mail.python.org/mailman/listinfo/scikit-learn
>> or, via email, send a message with subject or body 'help' to
>>scikit-learn-requ...@python.org
>> 
>> You can reach the person managing the list at
>>scikit-learn-ow...@python.org
>> 
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of scikit-learn digest..."
>> 
>> 
>> Today's Topics:
>> 
>>   1. Adding BM25 to scikit-learn.feature_extraction.text
>>  (Basil Beirouti)
>>   2. Re: Adding BM25 to scikit-learn.feature_extraction.text
>>  (Vlad Niculae)
>> 
>> 
>>
>--
>> 
>> Message: 1
>> Date: Fri, 1 Jul 2016 16:17:43 -0500
>> From: Basil Beirouti <basilbeiro...@gmail.com>
>> To: scikit-learn@python.org
>> Subject: [scikit-learn] Adding BM25 to
>>scikit-learn.feature_extraction.text
>> Message-ID:
>>   
><cab4mtg8805nndaja5cscf+phrjyq0btc-agzegd8cqb95sv...@mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>> 
>> Hi everyone,
>> 
>> to put it succinctly, here's the BM25 equation:
>> 
>> f(w,D) * (k+1) / (k*B + f(w,D))
>> 
>> where w is the word, and D is the document (corresponding to rows and
>> columns, respectively). f is a sparse matrix because only a fraction
>of the
>> whole vocabulary of words appears in any given single document.
>> 
>> B is a function of only the document, but it doesn't matter, you can
>think
>> of it as a constant if you want.
>> 
>> The problem is since f(w,D) is almost always zero, I only need to do
>the
>> calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D)))
>when
>> f(w,D) is not zero. Is there a clever way to do this with masks?
>> 
>> You can refactor the above equation to get this:
>> 
>> (k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a
>> denominator, which is bad (because of dividing by zero).
>> 
>> So anyway, currently I am converting to a coo_matrix and iterator
>through
>> the non-zero values like this:
>> 
>>cx = x.tocoo()
>>for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
>>(i,j,v)
>> 
>> 
>> That iterator is incredibly fast, but unfortunately coo_matrix does
>> not support assignment. So I create a new copy of either a dok sparse
>> matrix or a regular numpy array and assign to that.
>> 
>> I could also deal directly with the .data, .indptr, and indices
>> attributes of csr_matrix, and see if it's possible to create a copy
>of
>> .data attribute and update the values accordingly. I was hoping
>> somebody had encountered this type of issue before.
>> 
>> Sincerely,
>> 
>> Basil Beirouti
>> -- next part --
>> An HTML attachment was scrubbed...
>> URL:
><http://mail.python.org/pipermail/scikit-learn/attachments/20160701/8970d05a/attachment-0001.html>
>> 
>> --
>> 
>> Message: 2
>> Date: Fri, 01 Jul 2016 17:35:49 -0400
>> From: Vlad Niculae <zephy...@gmail.com>
>> To: Scikit-learn user and developer mailing list
>><scikit-learn@python.org>
>> Subject: Re: [scikit-learn] Adding BM25 to
>>scikit-learn.feature_extraction.text
>> Message-ID: <d4036481-5ac4-44a6-810b-f34733955...@gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>> 
>> Hi Basil,
>> 
>> If B were just a constant, you cou