subject:"Re\: \[scikit\-learn\] random forests using grouped data"

Re: [scikit-learn] random forests using grouped data

2016-12-01 Thread Vlad Niculae

I don't think there are any such estimators in scikit-learn directly,
but the model selection machinery is there to help.  Check out
GroupKFold [1] so you can do cross-validation after concatenating all
the samples, while ensuring that training and validation groups are
separate.

The setup of this problem looks a lot like query results reranking in
information retrieval, where you need to find relevant and
non-relevant results among the set of retrieved docs for each search
query. A simple approach you can build using scikit-learn tools is
RankSVM, where you take, within each group, all possible pairs between
a positive and a negative sample, and take the difference of their
features as your input. This is the same as optimizing within-group
AUC. Unfortunately the trick doesn't work in the same way for
nonlinear models, but it's another baseline you could try. Fabian had
an example of this, with some VERY enlightening illustrations, here
[2].

HTH,
Vlad

[1] 
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html
[2] 
https://github.com/fabianp/minirank/blob/master/notebooks/pairwise_transform.ipynb

On Thu, Dec 1, 2016 at 8:16 AM, Brown J.B.  wrote:
> Hello Thomas,
>
> I don't personally know of any algorithm that works on collections of
> groupings, but why not first test a simple control model, meaning
> can you achieve a satisfactory model by simply concatenating all 48 scores
> per sample and building a forest the standard way?
> If not, what context or reasons dictate that the groupings need to stay
> retained as you have presented them?
>
> Hope this helps,
> J.B.
>
> 2016-12-01 22:05 GMT+09:00 Thomas Evangelidis :
>>
>> Sorry, the previous email was incomplete. Below is how the grouped data
>> look like:
>>
>>
>> Group1:
>> score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...]
>> score2 = [0.34, 0.27, 0.24, 0.05, 0.13, 0,14, ...]
>> y=[1,1,1,0,0,0, ...]  # 1 indicates "active" and 0 "inactive"
>>
>> Group2:
>> score1 = [0.34, 0.38, 0.48, 0.18, 0.12, 0.19, ...]
>> score2 = [0.28, 0.41, 0.34, 0.13, 0.09, 0,1, ...]
>> y=[1,1,1,0,0,0, ...]  # 1 indicates "active" and 0 "inactive"
>>
>> ..
>> Group24:
>> score1 = [0.67, 0.54, 0.59, 0.23, 0.24, 0.08, ...]
>> score2 = [0.41, 0.31, 0.28, 0.23, 0.18, 0,22, ...]
>> y=[1,1,1,0,0,0, ...]  # 1 indicates "active" and 0 "inactive"
>>
>>
>> On 1 December 2016 at 14:01, Thomas Evangelidis  wrote:
>>>
>>> Greetings
>>>
>>> I have grouped data which are divided into actives and inactives. The
>>> features are two different types of normalized scores (0-1), where the
>>> higher the score the most probable is an observation to be an "active". My
>>> data look like this:
>>>
>>>
>>> Group1:
>>> score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...]
>>> score2 = [
>>> y=[1,1,1,0,0,0, ...]
>>>
>>> Group2:
>>> score1 = [0
>>> score2 = [
>>> y=[1,1,1,1,1]
>>>
>>> ..
>>> Group24:
>>> score1 = [0
>>> score2 = [
>>> y=[1,1,1,1,1]
>>>
>>>
>>> I searched in the documentation about treatment of grouped data, but the
>>> only thing I found was how do do cross-validation. My question is whether
>>> there is any special algorithm that creates random forests from these type
>>> of grouped data.
>>>
>>> thanks in advance
>>> Thomas
>>>
>>>
>>>
>>> --
>>>
>>> ==
>>>
>>> Thomas Evangelidis
>>>
>>> Research Specialist
>>>
>>> CEITEC - Central European Institute of Technology
>>> Masaryk University
>>> Kamenice 5/A35/1S081,
>>> 62500 Brno, Czech Republic
>>>
>>> email: tev...@pharm.uoa.gr
>>>
>>>   teva...@gmail.com
>>>
>>>
>>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>>
>>>
>>
>>
>>
>> --
>>
>> ==
>>
>> Thomas Evangelidis
>>
>> Research Specialist
>>
>> CEITEC - Central European Institute of Technology
>> Masaryk University
>> Kamenice 5/A35/1S081,
>> 62500 Brno, Czech Republic
>>
>> email: tev...@pharm.uoa.gr
>>
>>   teva...@gmail.com
>>
>>
>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>
>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] random forests using grouped data

2016-12-01 Thread Brown J.B.

Hello Thomas,

I don't personally know of any algorithm that works on collections of
groupings, but why not first test a simple control model, meaning
can you achieve a satisfactory model by simply concatenating all 48 scores
per sample and building a forest the standard way?
If not, what context or reasons dictate that the groupings need to stay
retained as you have presented them?

Hope this helps,
J.B.

2016-12-01 22:05 GMT+09:00 Thomas Evangelidis :

> Sorry, the previous email was incomplete. Below is how the grouped data
> look like:
>
>
> Group1:
> score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...]
> score2 = [0.34, 0.27, 0.24, 0.05, 0.13, 0,14, ...]
> y=[1,1,1,0,0,0, ...]  # 1 indicates "active" and 0 "inactive"
>
> Group2:
> score1 = [0.34, 0.38, 0.48, 0.18, 0.12, 0.19, ...]
> score2 = [0.28, 0.41, 0.34, 0.13, 0.09, 0,1, ...]
> y=[1,1,1,0,0,0, ...]  # 1 indicates "active" and 0 "inactive"
>
> ..
> Group24:
> score1 = [0.67, 0.54, 0.59, 0.23, 0.24, 0.08, ...]
> score2 = [0.41, 0.31, 0.28, 0.23, 0.18, 0,22, ...]
> y=[1,1,1,0,0,0, ...]  # 1 indicates "active" and 0 "inactive"
>
>
> On 1 December 2016 at 14:01, Thomas Evangelidis  wrote:
>
>> Greetings
>>
>> I have grouped data which are divided into actives and inactives. The
>> features are two different types of normalized scores (0-1), where the
>> higher the score the most probable is an observation to be an "active". My
>> data look like this:
>>
>>
>> Group1:
>> score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...]
>> score2 = [
>> y=[1,1,1,0,0,0, ...]
>>
>> Group2:
>> score1 = [0
>> score2 = [
>> y=[1,1,1,1,1]
>>
>> ..
>> Group24:
>> score1 = [0
>> score2 = [
>> y=[1,1,1,1,1]
>>
>>
>> I searched in the documentation about treatment of grouped data, but the
>> only thing I found was how do do cross-validation. My question is whether
>> there is any special algorithm that creates random forests from these type
>> of grouped data.
>>
>> thanks in advance
>> Thomas
>>
>>
>>
>> --
>>
>> ==
>>
>> Thomas Evangelidis
>>
>> Research Specialist
>> CEITEC - Central European Institute of Technology
>> Masaryk University
>> Kamenice 5/A35/1S081,
>> 62500 Brno, Czech Republic
>>
>> email: tev...@pharm.uoa.gr
>>
>>   teva...@gmail.com
>>
>>
>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>
>>
>
>
> --
>
> ==
>
> Thomas Evangelidis
>
> Research Specialist
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/1S081,
> 62500 Brno, Czech Republic
>
> email: tev...@pharm.uoa.gr
>
>   teva...@gmail.com
>
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] random forests using grouped data

2016-12-01 Thread Thomas Evangelidis

Sorry, the previous email was incomplete. Below is how the grouped data
look like:


Group1:
score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...]
score2 = [0.34, 0.27, 0.24, 0.05, 0.13, 0,14, ...]
y=[1,1,1,0,0,0, ...]  # 1 indicates "active" and 0 "inactive"

Group2:
score1 = [0.34, 0.38, 0.48, 0.18, 0.12, 0.19, ...]
score2 = [0.28, 0.41, 0.34, 0.13, 0.09, 0,1, ...]
y=[1,1,1,0,0,0, ...]  # 1 indicates "active" and 0 "inactive"

..
Group24:
score1 = [0.67, 0.54, 0.59, 0.23, 0.24, 0.08, ...]
score2 = [0.41, 0.31, 0.28, 0.23, 0.18, 0,22, ...]
y=[1,1,1,0,0,0, ...]  # 1 indicates "active" and 0 "inactive"


On 1 December 2016 at 14:01, Thomas Evangelidis  wrote:

> Greetings
>
> I have grouped data which are divided into actives and inactives. The
> features are two different types of normalized scores (0-1), where the
> higher the score the most probable is an observation to be an "active". My
> data look like this:
>
>
> Group1:
> score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...]
> score2 = [
> y=[1,1,1,0,0,0, ...]
>
> Group2:
> score1 = [0
> score2 = [
> y=[1,1,1,1,1]
>
> ..
> Group24:
> score1 = [0
> score2 = [
> y=[1,1,1,1,1]
>
>
> I searched in the documentation about treatment of grouped data, but the
> only thing I found was how do do cross-validation. My question is whether
> there is any special algorithm that creates random forests from these type
> of grouped data.
>
> thanks in advance
> Thomas
>
>
>
> --
>
> ==
>
> Thomas Evangelidis
>
> Research Specialist
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/1S081,
> 62500 Brno, Czech Republic
>
> email: tev...@pharm.uoa.gr
>
>   teva...@gmail.com
>
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>


-- 

==

Thomas Evangelidis

Research Specialist
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/1S081,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] random forests using grouped data

Re: [scikit-learn] random forests using grouped data

Re: [scikit-learn] random forests using grouped data

3 matches

Site Navigation

Mail list logo

Footer information