Re: [scikit-learn] random forests using grouped data
I don't think there are any such estimators in scikit-learn directly, but the model selection machinery is there to help. Check out GroupKFold [1] so you can do cross-validation after concatenating all the samples, while ensuring that training and validation groups are separate. The setup of this problem looks a lot like query results reranking in information retrieval, where you need to find relevant and non-relevant results among the set of retrieved docs for each search query. A simple approach you can build using scikit-learn tools is RankSVM, where you take, within each group, all possible pairs between a positive and a negative sample, and take the difference of their features as your input. This is the same as optimizing within-group AUC. Unfortunately the trick doesn't work in the same way for nonlinear models, but it's another baseline you could try. Fabian had an example of this, with some VERY enlightening illustrations, here [2]. HTH, Vlad [1] http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html [2] https://github.com/fabianp/minirank/blob/master/notebooks/pairwise_transform.ipynb On Thu, Dec 1, 2016 at 8:16 AM, Brown J.B. wrote: > Hello Thomas, > > I don't personally know of any algorithm that works on collections of > groupings, but why not first test a simple control model, meaning > can you achieve a satisfactory model by simply concatenating all 48 scores > per sample and building a forest the standard way? > If not, what context or reasons dictate that the groupings need to stay > retained as you have presented them? > > Hope this helps, > J.B. > > 2016-12-01 22:05 GMT+09:00 Thomas Evangelidis : >> >> Sorry, the previous email was incomplete. Below is how the grouped data >> look like: >> >> >> Group1: >> score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] >> score2 = [0.34, 0.27, 0.24, 0.05, 0.13, 0,14, ...] >> y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" >> >> Group2: >> score1 = [0.34, 0.38, 0.48, 0.18, 0.12, 0.19, ...] >> score2 = [0.28, 0.41, 0.34, 0.13, 0.09, 0,1, ...] >> y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" >> >> .. >> Group24: >> score1 = [0.67, 0.54, 0.59, 0.23, 0.24, 0.08, ...] >> score2 = [0.41, 0.31, 0.28, 0.23, 0.18, 0,22, ...] >> y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" >> >> >> On 1 December 2016 at 14:01, Thomas Evangelidis wrote: >>> >>> Greetings >>> >>> I have grouped data which are divided into actives and inactives. The >>> features are two different types of normalized scores (0-1), where the >>> higher the score the most probable is an observation to be an "active". My >>> data look like this: >>> >>> >>> Group1: >>> score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] >>> score2 = [ >>> y=[1,1,1,0,0,0, ...] >>> >>> Group2: >>> score1 = [0 >>> score2 = [ >>> y=[1,1,1,1,1] >>> >>> .. >>> Group24: >>> score1 = [0 >>> score2 = [ >>> y=[1,1,1,1,1] >>> >>> >>> I searched in the documentation about treatment of grouped data, but the >>> only thing I found was how do do cross-validation. My question is whether >>> there is any special algorithm that creates random forests from these type >>> of grouped data. >>> >>> thanks in advance >>> Thomas >>> >>> >>> >>> -- >>> >>> == >>> >>> Thomas Evangelidis >>> >>> Research Specialist >>> >>> CEITEC - Central European Institute of Technology >>> Masaryk University >>> Kamenice 5/A35/1S081, >>> 62500 Brno, Czech Republic >>> >>> email: tev...@pharm.uoa.gr >>> >>> teva...@gmail.com >>> >>> >>> website: https://sites.google.com/site/thomasevangelidishomepage/ >>> >>> >> >> >> >> -- >> >> == >> >> Thomas Evangelidis >> >> Research Specialist >> >> CEITEC - Central European Institute of Technology >> Masaryk University >> Kamenice 5/A35/1S081, >> 62500 Brno, Czech Republic >> >> email: tev...@pharm.uoa.gr >> >> teva...@gmail.com >> >> >> website: https://sites.google.com/site/thomasevangelidishomepage/ >> >> >> >> ___ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] random forests using grouped data
Hello Thomas, I don't personally know of any algorithm that works on collections of groupings, but why not first test a simple control model, meaning can you achieve a satisfactory model by simply concatenating all 48 scores per sample and building a forest the standard way? If not, what context or reasons dictate that the groupings need to stay retained as you have presented them? Hope this helps, J.B. 2016-12-01 22:05 GMT+09:00 Thomas Evangelidis : > Sorry, the previous email was incomplete. Below is how the grouped data > look like: > > > Group1: > score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] > score2 = [0.34, 0.27, 0.24, 0.05, 0.13, 0,14, ...] > y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" > > Group2: > score1 = [0.34, 0.38, 0.48, 0.18, 0.12, 0.19, ...] > score2 = [0.28, 0.41, 0.34, 0.13, 0.09, 0,1, ...] > y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" > > .. > Group24: > score1 = [0.67, 0.54, 0.59, 0.23, 0.24, 0.08, ...] > score2 = [0.41, 0.31, 0.28, 0.23, 0.18, 0,22, ...] > y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" > > > On 1 December 2016 at 14:01, Thomas Evangelidis wrote: > >> Greetings >> >> I have grouped data which are divided into actives and inactives. The >> features are two different types of normalized scores (0-1), where the >> higher the score the most probable is an observation to be an "active". My >> data look like this: >> >> >> Group1: >> score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] >> score2 = [ >> y=[1,1,1,0,0,0, ...] >> >> Group2: >> score1 = [0 >> score2 = [ >> y=[1,1,1,1,1] >> >> .. >> Group24: >> score1 = [0 >> score2 = [ >> y=[1,1,1,1,1] >> >> >> I searched in the documentation about treatment of grouped data, but the >> only thing I found was how do do cross-validation. My question is whether >> there is any special algorithm that creates random forests from these type >> of grouped data. >> >> thanks in advance >> Thomas >> >> >> >> -- >> >> == >> >> Thomas Evangelidis >> >> Research Specialist >> CEITEC - Central European Institute of Technology >> Masaryk University >> Kamenice 5/A35/1S081, >> 62500 Brno, Czech Republic >> >> email: tev...@pharm.uoa.gr >> >> teva...@gmail.com >> >> >> website: https://sites.google.com/site/thomasevangelidishomepage/ >> >> > > > -- > > == > > Thomas Evangelidis > > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tev...@pharm.uoa.gr > > teva...@gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] random forests using grouped data
Sorry, the previous email was incomplete. Below is how the grouped data look like: Group1: score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] score2 = [0.34, 0.27, 0.24, 0.05, 0.13, 0,14, ...] y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" Group2: score1 = [0.34, 0.38, 0.48, 0.18, 0.12, 0.19, ...] score2 = [0.28, 0.41, 0.34, 0.13, 0.09, 0,1, ...] y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" .. Group24: score1 = [0.67, 0.54, 0.59, 0.23, 0.24, 0.08, ...] score2 = [0.41, 0.31, 0.28, 0.23, 0.18, 0,22, ...] y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" On 1 December 2016 at 14:01, Thomas Evangelidis wrote: > Greetings > > I have grouped data which are divided into actives and inactives. The > features are two different types of normalized scores (0-1), where the > higher the score the most probable is an observation to be an "active". My > data look like this: > > > Group1: > score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] > score2 = [ > y=[1,1,1,0,0,0, ...] > > Group2: > score1 = [0 > score2 = [ > y=[1,1,1,1,1] > > .. > Group24: > score1 = [0 > score2 = [ > y=[1,1,1,1,1] > > > I searched in the documentation about treatment of grouped data, but the > only thing I found was how do do cross-validation. My question is whether > there is any special algorithm that creates random forests from these type > of grouped data. > > thanks in advance > Thomas > > > > -- > > == > > Thomas Evangelidis > > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tev...@pharm.uoa.gr > > teva...@gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > -- == Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn