[scikit-learn] Pipegraph is on its way!

2018-02-07 Thread Manuel Castejón Limas
Dear all,

after some playing with the concept we have developed a module for
implementing the functionality of Pipeline in more general contexts as
first introduced in a former thread ( https://mail.python.org/
pipermail/scikit-learn/2018-January/002158.html )

In order to expand the possibilities of Pipeline for non linearly
sequential workflows a graph like structure has been deployed while keeping
as much as possible the already known syntax we all love and honor:

X = pd.DataFrame(dict(X=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]))
y = 2 * X
sc = MinMaxScaler()
lm = LinearRegression()
steps = [('scaler', sc),
 ('linear_model', lm)]
connections = {'scaler': dict(X='X'),
   'linear_model': dict(X=('scaler', 'predict'),
y='y')}
pgraph = PipeGraph(steps=steps,
   connections=connections,
   use_for_fit='all',
   use_for_predict='all')

As you can see the biggest difference for the final user is the dictionary
describing the connections.

Another major contribution for developers wanting to expand scikit learn is
a collection of adapters for scikit learn models in order to provide them a
common API irrespectively of whether they originally implemented predict,
transform or fit_predict as an atomic operation without predict. These
adapters accept as many positional or keyword parameters in their fit
predict methods through *pargs and **kwargs.

As general as PipeGraph is, it cannot work under the restrictions imposed
by GridSearchCV on the input parameters, namely X and y since PipeGraph can
accept as many input signals as needed. Thus, an adhoc GridSearchCv version
is also needed and we will provide a basic initial version in a later
version.

We need to write the documentation and we will propose it as a
contrib-project in a few days.

Best wishes,
Manuel Castejón-Limas
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] clustering on big dataset

2018-02-07 Thread Manuel Castejón Limas
Hope this helps!

Manuel


@Article{Ciampi2008,

author="Ciampi, Antonio

and Lechevallier, Yves

and Limas, Manuel Castej{\'o}n

and Marcos, Ana Gonz{\'a}lez",

title="Hierarchical clustering of subpopulations with a dissimilarity based
on the likelihood ratio statistic: application to clustering massive data
sets",

journal="Pattern Analysis and Applications",

year="2008",

month="Jun",

day="01",

volume="11",

number="2",

pages="199--220",

abstract="The problem of clustering subpopulations on the basis of samples
is considered within a statistical framework: a distribution for the
variables is assumed for each subpopulation and the dissimilarity between
any two populations is defined as the likelihood ratio statistic which
compares the hypothesis that the two subpopulations differ in the parameter
of their distributions to the hypothesis that they do not. A general
algorithm for the construction of a hierarchical classification is
described which has the important property of not having inversions in the
dendrogram. The essential elements of the algorithm are specified for the
case of well-known distributions (normal, multinomial and Poisson) and an
outline of the general parametric case is also discussed. Several
applications are discussed, the main one being a novel approach to dealing
with massive data in the context of a two-step approach. After clustering
the data in a reasonable number of `bins' by a fast algorithm such as
k-Means, we apply a version of our algorithm to the resulting bins.
Multivariate normality for the means calculated on each bin is assumed:
this is justified by the central limit theorem and the assumption that each
bin contains a large number of units, an assumption generally justified
when dealing with truly massive data such as currently found in modern data
analysis. However, no assumption is made about the data generating
distribution.",

issn="1433-755X",

doi="10.1007/s10044-007-0088-4",

url="https://doi.org/10.1007/s10044-007-0088-4;

}





2018-01-04 12:55 GMT+01:00 Joel Nothman :

> Can you use nearest neighbors with a KD tree to build a distance matrix
> that is sparse, in that distances to all but the nearest neighbors of a
> point are (near-)infinite? Yes, this again has an additional parameter
> (neighborhood size), just as BIRCH has its threshold. I suspect you will
> not be able to improve on having another, approximating, parameter. You do
> not need to set n_clusters to a fixed value for BIRCH. You only need to
> provide another clusterer, which has its own parameters, although you
> should be able to experiment with different "global clusterers".
>
> On 4 January 2018 at 11:04, Shiheng Duan  wrote:
>
>> Yes, it is an efficient method, still, we need to specify the number of
>> clusters or the threshold. Is there another way to run hierarchy clustering
>> on the big dataset? The main problem is the distance matrix.
>> Thanks.
>>
>> On Tue, Jan 2, 2018 at 6:02 AM, Olivier Grisel 
>> wrote:
>>
>>> Have you had a look at BIRCH?
>>>
>>> http://scikit-learn.org/stable/modules/clustering.html#birch
>>>
>>> --
>>> Olivier
>>> ​
>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Pipegraph is on its way!

2018-02-07 Thread Andreas Mueller

Thanks Manuel, that looks pretty cool.
Do you have a write-up about it? I don't entirely understand the 
connections setup.

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Pipegraph is on its way!

2018-02-07 Thread Joel Nothman
cool! We have been talking for a while about how to pass other things
around grid search and other meta-analysis estimators. This injection
approach looks pretty neat as a way to express it. Will need to mull on it.

On 8 Feb 2018 2:51 am, "Manuel Castejón Limas" 
wrote:

> Dear all,
>
> after some playing with the concept we have developed a module for
> implementing the functionality of Pipeline in more general contexts as
> first introduced in a former thread ( https://mail.python.org/piperm
> ail/scikit-learn/2018-January/002158.html )
>
> In order to expand the possibilities of Pipeline for non linearly
> sequential workflows a graph like structure has been deployed while keeping
> as much as possible the already known syntax we all love and honor:
>
> X = pd.DataFrame(dict(X=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]))
> y = 2 * X
> sc = MinMaxScaler()
> lm = LinearRegression()
> steps = [('scaler', sc),
>  ('linear_model', lm)]
> connections = {'scaler': dict(X='X'),
>'linear_model': dict(X=('scaler', 'predict'),
> y='y')}
> pgraph = PipeGraph(steps=steps,
>connections=connections,
>use_for_fit='all',
>use_for_predict='all')
>
> As you can see the biggest difference for the final user is the dictionary
> describing the connections.
>
> Another major contribution for developers wanting to expand scikit learn
> is a collection of adapters for scikit learn models in order to provide
> them a common API irrespectively of whether they originally implemented
> predict, transform or fit_predict as an atomic operation without predict.
> These adapters accept as many positional or keyword parameters in their fit
> predict methods through *pargs and **kwargs.
>
> As general as PipeGraph is, it cannot work under the restrictions imposed
> by GridSearchCV on the input parameters, namely X and y since PipeGraph can
> accept as many input signals as needed. Thus, an adhoc GridSearchCv version
> is also needed and we will provide a basic initial version in a later
> version.
>
> We need to write the documentation and we will propose it as a
> contrib-project in a few days.
>
> Best wishes,
> Manuel Castejón-Limas
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Pipegraph is on its way!

2018-02-07 Thread Andrew Howe
Very cool!  Thanks for all the great work.

Andrew

<~~~>
J. Andrew Howe, PhD
www.andrewhowe.com
http://orcid.org/-0002-3553-1990
http://www.linkedin.com/in/ahowe42
https://www.researchgate.net/profile/John_Howe12/
I live to learn, so I can learn to live. - me
<~~~>

On Wed, Feb 7, 2018 at 6:49 PM, Manuel Castejón Limas <
manuel.caste...@gmail.com> wrote:

> Dear all,
>
> after some playing with the concept we have developed a module for
> implementing the functionality of Pipeline in more general contexts as
> first introduced in a former thread ( https://mail.python.org/piperm
> ail/scikit-learn/2018-January/002158.html )
>
> In order to expand the possibilities of Pipeline for non linearly
> sequential workflows a graph like structure has been deployed while keeping
> as much as possible the already known syntax we all love and honor:
>
> X = pd.DataFrame(dict(X=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]))
> y = 2 * X
> sc = MinMaxScaler()
> lm = LinearRegression()
> steps = [('scaler', sc),
>  ('linear_model', lm)]
> connections = {'scaler': dict(X='X'),
>'linear_model': dict(X=('scaler', 'predict'),
> y='y')}
> pgraph = PipeGraph(steps=steps,
>connections=connections,
>use_for_fit='all',
>use_for_predict='all')
>
> As you can see the biggest difference for the final user is the dictionary
> describing the connections.
>
> Another major contribution for developers wanting to expand scikit learn
> is a collection of adapters for scikit learn models in order to provide
> them a common API irrespectively of whether they originally implemented
> predict, transform or fit_predict as an atomic operation without predict.
> These adapters accept as many positional or keyword parameters in their fit
> predict methods through *pargs and **kwargs.
>
> As general as PipeGraph is, it cannot work under the restrictions imposed
> by GridSearchCV on the input parameters, namely X and y since PipeGraph can
> accept as many input signals as needed. Thus, an adhoc GridSearchCv version
> is also needed and we will provide a basic initial version in a later
> version.
>
> We need to write the documentation and we will propose it as a
> contrib-project in a few days.
>
> Best wishes,
> Manuel Castejón-Limas
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn