from:"Joel Nothman"

Re: [Scikit-learn-general] gridsearchCV - overfitting

2016-05-12 Thread Joel Nothman

This would be much clearer if you provided some code, but I think I get
what you're saying.

The final GridSearchCV model is trained on the full training set, so the
fact that it perfectly fits that data with random forests is not altogether
surprising. What you can say about the parameters is that they are also the
best parameters (among those searched) for the RF classifier to predict the
held-out samples under cross-validation.

On 12 May 2016 at 19:53, A neuman  wrote:

> Hello everyone,
>
> I'm having a bit trouble with the parameters that I've got from
> gridsearchCV.
>
>
> For example:
>
> If i'm using the parameter what i've got from grid seardh CV for example
> on RF oder k-nn and i test the model on the train set, i get everytime an
> AUC value about 1.00 or 0.99.
> The dataset have 1200 samples.
>
> Does that mean that i can't use the Parameters that i've got from the
> gridsearchCV? Cause it was in actually every case. I already tried the
> nested-CV to compare the algorithms.
>
>
> example for RF with the values i have got from gridsearchCV (10-fold):
>
> RandomForestClassifier(n_estimators=200,oob_score=True,max_features=None,random_state=1,min_samples_leaf=
> 2,class_weight='balanced_subsample')
>
>
> and then i'm just using*clf.predict(X_train) *and test it on the*
> y_train set. *
>
> the AUC-value from the  clf.predict(X_test)  are about 0.73, so there is a
> big difference from  the train and test dataset.
>
> best,
>
>
> --
> Mobile security can be enabling, not merely restricting. Employees who
> bring their own devices (BYOD) to work are irked by the imposition of MDM
> restrictions. Mobile Device Manager Plus allows you to control only the
> apps on BYO-devices by containerizing them, leaving personal data
> untouched!
> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Suggestions for the model selection module

2016-05-07 Thread Joel Nothman

On 7 May 2016 at 19:12, Matthias Feurer 
wrote:

> 1. Return the fit and predict time in `grid_scores_`
>

This has been proposed for many years as part of an overhaul of
grid_scores_. The latest attempt is currently underway at
https://github.com/scikit-learn/scikit-learn/pull/6697, and has a good
chance of being merged.

> 2. Add distribution objects to scikit-learn which have get_params and
> set_params attributes
>

Your use of get_params to perform serialisation is certainly not what
get_params is designed for, though I understand your use of it that way...
as long as all your parameters are either primitives or objects supporting
get_params. However, this is not by design. Further, param_distributions is
a dict whose values are scipy.stats rvs; get_params currently does not
traverse dicts, so this is already unfamiliar territory requiring a lot of
design, even once we were convinced that this were a valuable use-case,
which I am not certain of.

> 3. Add get_params and set_params to CV objects
>

get_params and set_params are intended to allow programmatic search over
those parameter settings. This is not often what one does with the
parameters of CV splitting methods, but I acknowledge that supporting this
would not be difficult. Still, if serialisation is the purpose of this,
it's not really the point.
--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-04-12 Thread Joel Nothman

I don't think we can deny this is strange, certainly for real-world, IID
data!

On 13 April 2016 at 10:31, Juan Nunez-Iglesias <jni.s...@gmail.com> wrote:

> Yes but would you expect sampling 280K / 3M to be qualitatively different
> from sampling 70K / 3M?
>
> At any rate I'll attempt a more rigorous test later this week and report
> back. Thanks!
>
> Juan.
>
> On Wed, Apr 13, 2016 at 10:21 AM, Joel Nothman <joel.noth...@gmail.com>
> wrote:
>
>> It's hard to believe this is a software problem rather than a data
>> problem. If your data was accidentally a duplicate of the dataset, you
>> could certainly get 100%.
>>
>> On 13 April 2016 at 10:10, Juan Nunez-Iglesias <jni.s...@gmail.com>
>> wrote:
>>
>>> Hallelujah! I'd given up on this thread. Thanks for resurrecting it,
>>> Andy! =)
>>>
>>> However, I don't think data distribution can explain the result, since
>>> GridSearchCV gives the expected result (~0.8 accuracy) with 3K and 70K
>>> random samples but changes to perfect classification for 280K samples. I
>>> don't have the data on this computer so I can't test it right now, though.
>>>
>>> Juan.
>>>
>>> On Wed, Apr 13, 2016 at 8:51 AM, Andreas Mueller <t3k...@gmail.com>
>>> wrote:
>>>
>>>> Have you tried to "score" the grid-search on the non-training set?
>>>> The cross-validation is using stratified k-fold while your confirmation
>>>> used the beginning of the dataset vs the rest.
>>>> Your data is probably not IID.
>>>>
>>>>
>>>>
>>>> On 03/10/2016 01:08 AM, Juan Nunez-Iglesias wrote:
>>>>
>>>> Hi all,
>>>>
>>>> TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"
>>>> samples (280K), it falsely shows accuracy of 1.0 for full trees
>>>> (max_depth=None). This doesn't happen for fewer samples.
>>>>
>>>> Longer version:
>>>>
>>>> I'm trying to optimise RF hyperparameters using GridSearchCV for the
>>>> first time. I have a lot of data (~3M samples, 140 features), so I
>>>> subsampled it to do this. First I subsampled to 3000 samples, which
>>>> finished in 5min, so I ran 70K samples to see if result would still hold.
>>>> This resulted in completely different parameter choices, so I ran 280K
>>>> samples overnight, to see whether at least the choices would stabilise as n
>>>> -> inf. Then when I printed the top 10 models, I got the following:
>>>>
>>>> In [7]: bests = sorted(random_search.grid_scores_, reverse=True,
>>>> key=lambda x: x
>>>> [1])
>>>>
>>>> In [8]: bests[:10]
>>>> Out[8]:
>>>> [mean: 1.0, std: 0.0, params: {'n_estimators': 500,
>>>> 'bootstrap': True, '
>>>> max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
>>>>  mean: 1.0, std: 0.0, params: {'n_estimators': 500,
>>>> 'bootstrap': True, '
>>>> max_features': 5, 'max_depth': None, 'criterion': 'gini'},
>>>>  mean: 1.0, std: 0.0, params: {'n_estimators': 200,
>>>> 'bootstrap': True, '
>>>> max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},
>>>>  mean: 1.0, std: 0.0, params: {'n_estimators': 200,
>>>> 'bootstrap': True, '
>>>> max_features': 5, 'max_depth': None, 'criterion': 'entropy'},
>>>>  mean: 1.0, std: 0.0, params: {'n_estimators': 200,
>>>> 'bootstrap': True, '
>>>> max_features': 20, 'max_depth': None, 'criterion': 'entropy'},
>>>>  mean: 1.0, std: 0.0, params: {'n_estimators': 20, 'bootstrap':
>>>> False, '
>>>> max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
>>>>  mean: 1.0, std: 0.0, params: {'n_estimators': 100,
>>>> 'bootstrap': False,
>>>> 'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
>>>>  mean: 1.0, std: 0.0, params: {'n_estimators': 20, 'bootstrap':
>>>> False, '
>>>> max_features': 5, 'max_depth': None, 'criterion': 'gini'},
>>>>  mean: 1.0, std: 0.0, params: {'n_estimators': 100,
>>>> 'bootstrap': False,
>>>> 'max_features': 5, 'max_depth': None, 'criterion': 'gini'},
>>>>  mean: 1.0, std: 0.0, params: {'n_estimators': 500,
>>>> 'bootstrap': False,
>>>> 'max_features': 5, 'max_depth': None, 'criterion': 'g

Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-04-12 Thread Joel Nothman

It's hard to believe this is a software problem rather than a data problem.
If your data was accidentally a duplicate of the dataset, you could
certainly get 100%.

On 13 April 2016 at 10:10, Juan Nunez-Iglesias  wrote:

> Hallelujah! I'd given up on this thread. Thanks for resurrecting it, Andy!
> =)
>
> However, I don't think data distribution can explain the result, since
> GridSearchCV gives the expected result (~0.8 accuracy) with 3K and 70K
> random samples but changes to perfect classification for 280K samples. I
> don't have the data on this computer so I can't test it right now, though.
>
> Juan.
>
> On Wed, Apr 13, 2016 at 8:51 AM, Andreas Mueller  wrote:
>
>> Have you tried to "score" the grid-search on the non-training set?
>> The cross-validation is using stratified k-fold while your confirmation
>> used the beginning of the dataset vs the rest.
>> Your data is probably not IID.
>>
>>
>>
>> On 03/10/2016 01:08 AM, Juan Nunez-Iglesias wrote:
>>
>> Hi all,
>>
>> TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"
>> samples (280K), it falsely shows accuracy of 1.0 for full trees
>> (max_depth=None). This doesn't happen for fewer samples.
>>
>> Longer version:
>>
>> I'm trying to optimise RF hyperparameters using GridSearchCV for the
>> first time. I have a lot of data (~3M samples, 140 features), so I
>> subsampled it to do this. First I subsampled to 3000 samples, which
>> finished in 5min, so I ran 70K samples to see if result would still hold.
>> This resulted in completely different parameter choices, so I ran 280K
>> samples overnight, to see whether at least the choices would stabilise as n
>> -> inf. Then when I printed the top 10 models, I got the following:
>>
>> In [7]: bests = sorted(random_search.grid_scores_, reverse=True,
>> key=lambda x: x
>> [1])
>>
>> In [8]: bests[:10]
>> Out[8]:
>> [mean: 1.0, std: 0.0, params: {'n_estimators': 500, 'bootstrap':
>> True, '
>> max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
>>  mean: 1.0, std: 0.0, params: {'n_estimators': 500, 'bootstrap':
>> True, '
>> max_features': 5, 'max_depth': None, 'criterion': 'gini'},
>>  mean: 1.0, std: 0.0, params: {'n_estimators': 200, 'bootstrap':
>> True, '
>> max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},
>>  mean: 1.0, std: 0.0, params: {'n_estimators': 200, 'bootstrap':
>> True, '
>> max_features': 5, 'max_depth': None, 'criterion': 'entropy'},
>>  mean: 1.0, std: 0.0, params: {'n_estimators': 200, 'bootstrap':
>> True, '
>> max_features': 20, 'max_depth': None, 'criterion': 'entropy'},
>>  mean: 1.0, std: 0.0, params: {'n_estimators': 20, 'bootstrap':
>> False, '
>> max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
>>  mean: 1.0, std: 0.0, params: {'n_estimators': 100, 'bootstrap':
>> False,
>> 'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
>>  mean: 1.0, std: 0.0, params: {'n_estimators': 20, 'bootstrap':
>> False, '
>> max_features': 5, 'max_depth': None, 'criterion': 'gini'},
>>  mean: 1.0, std: 0.0, params: {'n_estimators': 100, 'bootstrap':
>> False,
>> 'max_features': 5, 'max_depth': None, 'criterion': 'gini'},
>>  mean: 1.0, std: 0.0, params: {'n_estimators': 500, 'bootstrap':
>> False,
>> 'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]
>>
>> Needless to say, perfect accuracy is suspicious, and indeed, in this
>> case, completely spurious:
>>
>> In [16]: rftop = ensemble.RandomForestClassifier(**{'n_estimators': 20,
>> 'bootstr
>> ap': False, 'max_features': 'auto', 'max_depth': None, 'criterion':
>> 'gini'})
>>
>> In [17]: rftop.fit(X[:20], y[:20])
>>
>> In [20]: np.mean(rftop.predict(X[20:]) == y[20:])
>> Out[20]: 0.826125
>>
>> That's more in line with what's expected for this dataset, and what was
>> found by the search with 72K samples (top model: [mean: 0.82640, std:
>> 0.00341, params: {'n_estimators': 500, 'bootstrap': False, 'max_features':
>> 20, 'max_depth': 20, 'criterion': 'gini'},)
>>
>> Anyway, here's my code, any idea why more samples would cause this
>> overfitting / testing on training data?
>>
>> # [omitted: boilerplate to load full data in X0, y0]
>> import numpy as np
>> idx = np.random.choice(len(y0), size=28, replace=False)
>> X, y = X0[idx], y0[idx]
>> param_dist = {'n_estimators': [20, 100, 200, 500],
>>   'max_depth': [3, 5, 20, None],
>>   'max_features': ['auto', 5, 10, 20],
>>   'bootstrap': [True, False],
>>   'criterion': ['gini', 'entropy']}
>> from sklearn import grid_search as gs
>> from time import time
>> from sklearn import ensemble
>> rf = ensemble.RandomForestClassifier()
>> random_search = gs.GridSearchCV(rf, param_grid=param_dist, refit=False,
>> verbose=2, n_jobs=12)
>> start=time(); random_search.fit(X, y); stop=time()
>>
>> Thank you!
>>
>> Juan.

Re: [Scikit-learn-general] [scikit-learn-general] Why sklearn RandomForest model take a lot of disk space after save?

2016-04-11 Thread Joel Nothman

Yes, there are no doubt more efficient ways to store forests, but it
seems unlikely to be a worthwhile investment.

I think this is a documentation rather than an engineering issue. We
frequently get issues raised that relate to "size": runtime, memory
consumption, model size on disk, (in)effectiveness of parallelism.

We could provide methods on models that estimate these costs (analytically
or, indeed, via a pre-fit GP regressor!), but merely documenting them more
clearly up front in the general case (even just "parameters can affect
model size drastically") would be worthwhile.

On 12 April 2016 at 02:47, Sebastian Raschka  wrote:

> Just curious how it could be made more efficient. ~14.9 Mb for 50 trees on
> a 20 mb dataset doesn't sound too bad actually since we are not pruning the
> trees in Random Forests. Sth I could think would be to summarize similar
> trees in buckets or building a "fragment" library of shared decision rules.
> However, I am not sure how much effort it would be to implement such a
> thing plus the computational efficiency may suffer. Hm, I am curious, how
> large would 1 single, fully grown decision tree be based on your dataset?
>
>
> On Apr 11, 2016, at 12:17 PM, Piotr Płoński  wrote:
>
> I am using 0.17.1, did you consider writing custom save methods for this
> classifier?
>
>
> 2016-04-11 18:11 GMT+02:00 Andreas Mueller :
>
>> Which version of scikit-learn are you using?
>> We recently (0.17) removed storing of data point indices in trees which
>> greatly reduced the size in some cases.
>>
>>
>>
>> On 04/10/2016 09:28 AM, Piotr Płoński wrote:
>>
>> Thanks for comments! I put more details of my problem here
>> 
>> http://stackoverflow.com/questions/36523989/why-sklearn-randomforest-model-take-a-lot-of-disk-space-after-save
>>
>>
>> Indeed, saving with joblib takes less space but there is still a lot of
>> space used on the disk.
>>
>> Best,
>> Piotr
>>
>> 2016-04-10 15:24 GMT+02:00 Mathieu Blondel :
>>
>>> You may also want to save your model using joblib (possibly with
>>> compression enabled) instead of cPickle.
>>>
>>> Mathieu
>>>
>>> On Sun, Apr 10, 2016 at 9:13 AM, Piotr Płoński < 
>>> pplonsk...@gmail.com> wrote:
>>>
 Hi All,

 I am saving RandomForestClassifier model from sklearn library with code
 below

 with open('/tmp/rf.model', 'wb') as f: cPickle.dump(RF_model, f)

 It takes a lot of space on my hard drive. There are only 50 trees in
 the model, however it takes over 50 MB on disk (analyzed dataset is ~ 20MB,
 with 21 features). Does anybody have idea why? I observe similar behavior
 for ExtraTreesClassifier.

 Best,

 Piotr



 --
 Find and fix application performance issues faster with Applications
 Manager
 Applications Manager provides deep performance insights into multiple
 tiers of
 your business applications. It resolves application problems quickly and
 reduces your MTTR. Get your free trial!
 
 http://pubads.g.doubleclick.net/
 gampad/clk?id=1444514301=/ca-pub-7940484522588532
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


>>>
>>>
>>> --
>>> Find and fix application performance issues faster with Applications
>>> Manager
>>> Applications Manager provides deep performance insights into multiple
>>> tiers of
>>> your business applications. It resolves application problems quickly and
>>> reduces your MTTR. Get your free trial!
>>> 
>>> http://pubads.g.doubleclick.net/
>>> gampad/clk?id=1444514301=/ca-pub-7940484522588532
>>> ___
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> --
>> Find and fix application performance issues faster with Applications Manager
>> Applications Manager provides deep performance insights into multiple tiers 
>> of
>> your business applications. It resolves application problems quickly and
>> reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/
>> gampad/clk?id=1444514301=/ca-pub-7940484522588532
>>
>>
>>
>> ___
>>

Re: [Scikit-learn-general] weighted kernel density estimation

2016-04-10 Thread Joel Nothman

I think you should submit these changes as a pull request. Thanks, Jared.

On 8 April 2016 at 21:17, Jared Gabor  wrote:

> I recently modified the kernel density estimation routines in
> sklearn/neighbors to include optional weighting of the training samples (to
> make analogs to weighted histograms).  I'd be interested in contributing
> this to scikit-learn, but it's mostly edits of existing code (as opposed to
> new source files), and I'm not sure what's the policy in that case.
>
> Here's the code, which could use some more testing and validation (and
> documentation):
>
> https://github.com/jaredgabor/scikit-learn/tree/weighted_kde
>
> Thanks for any input,
> Jared Gabor
>
>
> --
>
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301=/ca-pub-7940484522588532___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Binary Classifier Evaluation Metrics

2016-03-26 Thread Joel Nothman

It looks like you should use the
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html
to convert y_train into a binary indicator matrix format that scikit-learn
can work with.

On 25 March 2016 at 18:42, Enise Basaran <basaranen...@gmail.com> wrote:

> Hi,
>
> I'm studying on web page classification and I have 32 categories like
> 'Adult', 'Business', 'Education', etc.
>
> OneVsRestClassifier example is below :
>
> X_train = np.array(["new york is a hell of a town",
> "new york was originally dutch",
> "the big apple is great",
> "new york is also called the big apple",
> "nyc is nice",
> "people abbreviate new york city as nyc",
> "the capital of great britain is london",
> "london is in the uk",
> "london is in england",
> "london is in great britain",
> "it rains a lot in london",
> "london hosts the british museum",
> "new york is great and so is london",
> "i like london better than new york"])
> y_train = [[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1],[1],*[**0,1],[0,1**]*]
>
> But I don't want to label data as above [0,1], because as you know *it's very 
> difficult to find multilabelled data*. So that I generated 32 binary dataset 
> for 32 category. When a test content came for prediction, test content is 
> being sent to all classifiers and I'm taking into account only classifiers 
> that are returning 'Yes'. So I could make multilabelled classification with 
> my own dataset.
>
> I can evaluate precision, recall and f-measure values for each classifier(for 
> each category) but how can I test my all dataset(all classifiers) ? Thanks 
> for your help in advance.
>
>
>
> On Thu, Mar 24, 2016 at 10:26 PM, Joel Nothman <joel.noth...@gmail.com>
> wrote:
>
>> OneVsRestClassifier already implements Binary Relevance. What is unclear
>> about our documentation on model evaluation and metrics?
>>
>> On 25 March 2016 at 00:13, Enise Basaran <basaranen...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I want to learn binary classifier evaluation metrics please. I
>>> implemented "Binary Relevance" method for multilabel classification.
>>> *[1] * My classifiers say "Yes" or "No". How can I calculate accuracy
>>> score of my dataset, what metrics can I use for my binary classifiers?
>>> Thanks in advance.
>>>
>>>
>>> *[1] Binary Relevance (BR)* is one of the most popular approaches as a
>>> trans-formation method that actually creates k datasets (k = |L|, total
>>> number of classes), each for one
>>> class label and trains a classifier on each of these datasets. Each of
>>> these datasets contains the same number of instances as the original data,
>>> but each dataset D λ j , 1 ≤ j ≤ k positively labels instances that belong
>>> to class λ j and negative otherwise.
>>>
>>> Sincerely,
>>>
>>>
>>> --
>>> Transform Data into Opportunity.
>>> Accelerate data analysis in your applications with
>>> Intel Data Analytics Acceleration Library.
>>> Click to learn more.
>>> http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140
>>> ___
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> --
>> Transform Data into Opportunity.
>> Accelerate data analysis in your applications with
>> Intel Data Analytics Acceleration Library.
>> Click to learn more.
>> http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140
>> ___
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> *Enise Başaran*
> *Software Developer*
>
>
> --
> Transform Data into Opportunity.
&g

Re: [Scikit-learn-general] Binary Classifier Evaluation Metrics

2016-03-24 Thread Joel Nothman

OneVsRestClassifier already implements Binary Relevance. What is unclear
about our documentation on model evaluation and metrics?

On 25 March 2016 at 00:13, Enise Basaran  wrote:

> Hi everyone,
>
> I want to learn binary classifier evaluation metrics please. I implemented
> "Binary Relevance" method for multilabel classification.*[1] * My
> classifiers say "Yes" or "No". How can I calculate accuracy score of my
> dataset, what metrics can I use for my binary classifiers? Thanks in
> advance.
>
>
> *[1] Binary Relevance (BR)* is one of the most popular approaches as a
> trans-formation method that actually creates k datasets (k = |L|, total
> number of classes), each for one
> class label and trains a classifier on each of these datasets. Each of
> these datasets contains the same number of instances as the original data,
> but each dataset D λ j , 1 ≤ j ≤ k positively labels instances that belong
> to class λ j and negative otherwise.
>
> Sincerely,
>
>
> --
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Scikit-learn standards for serializing/saving objects

2016-03-23 Thread Joel Nothman

I think all the scikit-learn devs know that the serialisation available in
scikit-learn is inadequate, and recommend storing training data and model
parameters.

Designing a serialisation format that is robust to future changes is a huge
engineering effort, and is likely to result in one of: (a) a framework that
has all the power and hence faults of pickling; (b) an implementation that
is limited to only some parameter values on some estimators; or (c) a
specialised, over-engineered monolith that we can't afford to maintain.

One approach mooted time and again is supporting export to a
framework-independent model description language, like PMML. For this see
the work begun at https://github.com/alex-pirozhenko/sklearn-pmml. The
intention here, however, is not especially to re-load the models in
scikit-learn, but to perform prediction with scikit-learn-fitted models in
other frameworks.

On 24 March 2016 at 13:04, Chris Hausler  wrote:

> We also have similar issues. It'd be great to hear any cool solutions :-)
>
> On Thu, 24 Mar 2016 at 12:47 Keith Lehman 
> wrote:
>
>> Thanks Sebastian.
>>
>> This is basically what we are doing too. The hard/time consuming part is
>> determining what attributes of each sckikit-learn object need to be saved
>> and how best to extract them.
>>
>> - Keith
>>
>> -Original Message-
>> From: Sebastian Raschka [mailto:se.rasc...@gmail.com]
>> Sent: Wednesday, March 23, 2016 4:05 PM
>> To: scikit-learn-general@lists.sourceforge.net
>> Subject: Re: [Scikit-learn-general] Scikit-learn standards for
>> serializing/saving objects
>>
>> I also had some issues with Pickle in the past and have to admit that I
>> actually don't trust pickle files ;). Maybe, I am too paranoid, but I am
>> always afraid of corrupting or losing the data.
>> Probably not the most elegant solution, but I typically store estimator
>> settings and model parameters as JSON files (since they are human readable
>> in the worst case scenario having "reproducible research" in mind ;)).
>>
>>
>> For example:
>>
>>
>> # Model fitting and saving params to JSON
>>
>> from sklearn.linear_model import LinearRegression from sklearn.datasets
>> import load_diabetes
>>
>> diabetes = load_diabetes()
>> X, y = diabetes.data, diabetes.target
>> regr = LinearRegression()
>> regr.fit(X, y)
>>
>> import json
>>
>> with open('./params.json', 'w', encoding='utf-8') as outfile:
>> json.dump(regr.get_params(), outfile)
>>
>> with open('./weights.json', 'w', encoding='utf-8') as outfile:
>> json.dump(regr.coef_.tolist(), outfile, separators=(',', ':'),
>> sort_keys=True, indent=4)
>>
>> with open('./intercept.json', 'w', encoding='utf-8') as outfile:
>> json.dump(regr.intercept_, outfile)
>>
>>
>> # In a new session: load the params from the JSON files
>>
>>
>> import json
>> import codecs
>> from sklearn.linear_model import LinearRegression from sklearn.datasets
>> import load_diabetes import numpy as np
>>
>> diabetes = load_diabetes()
>> X, y = diabetes.data, diabetes.target
>>
>> obj_text = codecs.open('./params.json', 'r', encoding='utf-8').read()
>> params = json.loads(obj_text)
>>
>> obj_text = codecs.open('./weights.json', 'r', encoding='utf-8').read()
>> weights = json.loads(obj_text)
>>
>> obj_text = codecs.open('./intercept.json', 'r', encoding='utf-8').read()
>> intercept = json.loads(obj_text)
>>
>> regr = LinearRegression()
>> regr.set_params(**params)
>> regr.intercept_, regr.coef_ = intercept, np.array(weights)
>>
>> regr.predict(X[:10])
>>
>> array([ 206.11706979,   68.07234761,  176.88406035,  166.91796559,
>> 128.45984241,  106.34908972,   73.89417947,  118.85378669,
>> 158.81033076,  213.58408893])
>>
>>
>> In any case, I know that this isn't pretty, and I would also be looking
>> forward to a better solution!
>>
>> Best,
>> Sebastian Raschka
>>
>>
>> > On Mar 23, 2016, at 12:47 PM, Keith Lehman 
>> wrote:
>> >
>> > Hi:
>> >
>> > I’m fairly new to scikit-learn, python, and machine learning. This
>> community has built a great set of libraries though, and is actually a
>> large part of the reason why my company has selected python to experiment
>> with ML.
>> >
>> > As we are developing our product, however, we keep running into trouble
>> saving various objects. When possible, we use pickle to save the objects,
>> but this can cause problems in development – objects saved during a debug
>> session can not be loaded outside of the debugger. The reason appears to be
>> because even when pickling a “pickleable” object (such as a trained
>> LinearRegression), pickle finds and saves more primitive objects that have
>> been instantiated within the debug environment. Dill and cpickle have the
>> same issue. My question is, does the scikit-learn community plan to add
>> standard load/save or dump/dumps and load/loads methods that would not
>> create these dependencies?
>> >
>> > If there is a better forum for

Re: [Scikit-learn-general] Subclassing vectorizers

2016-03-22 Thread Joel Nothman

And I lied that none of the scikit-learn estimators define their own
get_params. Of course the following do: VotingClassifier, Kernel (and
subclasses), Pipeline and FeatureUnion

On 23 March 2016 at 15:04, Joel Nothman <joel.noth...@gmail.com> wrote:

> something like the following may suffice:
>
> def get_params(self, deep=True):
> out = super(WordCooccurrenceVectorizer, self).get_params(deep=deep)
> out['w2v_clusters'] = self.w2v_clusters
> return out
>
> On 23 March 2016 at 15:01, Joel Nothman <joel.noth...@gmail.com> wrote:
>
>> Hi Fred,
>>
>> We use the __init__ signature to get the list of parameters that (a) can
>> be set by grid search; (b) need to be copied to a cloned instance of the
>> estimator (with any fitted model discarded) in constructing ensembles,
>> cross validation, etc. While none of the scikit-learn library of estimators
>> do this, in practice you can overload get_params to define your own
>> parameter listing. See
>> http://scikit-learn.org/stable/developers/contributing.html#get-params-and-set-params
>>
>> On 23 March 2016 at 14:45, Fred Mailhot <fred.mail...@gmail.com> wrote:
>>
>>> Hello list,
>>>
>>> Firstly, thanks for this incredible package; I use it daily at work. Now
>>> on to the meat: I'm trying to subclass TfidfVectorizer and running into
>>> issues. I want to specify an extra param for __init__() that points to a
>>> file that gets used in build_analyzer(). Skipping irrelevant bits, I've got
>>> the following:
>>>
>>> #==
>>> class WordCooccurrenceVectorizer(TfidfVectorizer):
>>>
>>> ### override __init__ to add w2v_clusters arg
>>> # see
>>> http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass
>>> # for explanation of syntax
>>> def __init__(self, *args, **kwargs):
>>> try:
>>> self.w2v_cluster_path = kwargs.pop("w2v_clusters")
>>> except KeyError:
>>> pass
>>> super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs)
>>>
>>> def build_analyzer(self):
>>> preprocess = self.build_preprocessor()
>>> stopwords = self.get_stop_words()
>>> w2v_clusters = self.load_w2v_clusters()
>>> tokenize = self.build_tokenizer()
>>> return lambda doc:
>>> self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters)
>>> [...]
>>> #==
>>>
>>> I can instantiate this, but when I want to inspect it, I get the
>>> following (this is in ipython, in a script it just hangs):
>>>
>>> #==
>>> In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2),
>>> stop_words="english", max_df=0.5, min_df=1, max_features=1,
>>> w2v_clusters="clusters.20160322_1803.w2v", binary=True)
>>>
>>> In [3]: vec
>>> Out[3]:
>>> ---
>>> RuntimeError  Traceback (most recent call
>>> last)
>>> /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc
>>> in __call__(self, obj)
>>> 697 type_pprinters=self.type_printers,
>>> 698 deferred_pprinters=self.deferred_printers)
>>> --> 699 printer.pretty(obj)
>>> 700 printer.flush()
>>> 701 return stream.getvalue()
>>>
>>> [...]
>>>
>>> /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc
>>> in _get_param_names(cls)
>>> 193" %s with constructor %s
>>> doesn't "
>>> 194" follow this convention."
>>> --> 195% (cls, init_signature))
>>> 196 # Extract and sort argument names excluding 'self'
>>> 197 return sorted([p.name for p in parameters])
>>>
>>> RuntimeError: scikit-learn estimators should always specify their
>>> parameters in the signature of their __init__ (no varargs). >> 'cooc_vectorizer.WordCooccurrenceVectorizer'> with constructor (,
>>> *args, **kwargs) doesn't  follow this convention.
>&g

Re: [Scikit-learn-general] Subclassing vectorizers

2016-03-22 Thread Joel Nothman

something like the following may suffice:

def get_params(self, deep=True):
out = super(WordCooccurrenceVectorizer, self).get_params(deep=deep)
out['w2v_clusters'] = self.w2v_clusters
return out

On 23 March 2016 at 15:01, Joel Nothman <joel.noth...@gmail.com> wrote:

> Hi Fred,
>
> We use the __init__ signature to get the list of parameters that (a) can
> be set by grid search; (b) need to be copied to a cloned instance of the
> estimator (with any fitted model discarded) in constructing ensembles,
> cross validation, etc. While none of the scikit-learn library of estimators
> do this, in practice you can overload get_params to define your own
> parameter listing. See
> http://scikit-learn.org/stable/developers/contributing.html#get-params-and-set-params
>
> On 23 March 2016 at 14:45, Fred Mailhot <fred.mail...@gmail.com> wrote:
>
>> Hello list,
>>
>> Firstly, thanks for this incredible package; I use it daily at work. Now
>> on to the meat: I'm trying to subclass TfidfVectorizer and running into
>> issues. I want to specify an extra param for __init__() that points to a
>> file that gets used in build_analyzer(). Skipping irrelevant bits, I've got
>> the following:
>>
>> #==
>> class WordCooccurrenceVectorizer(TfidfVectorizer):
>>
>> ### override __init__ to add w2v_clusters arg
>> # see
>> http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass
>> # for explanation of syntax
>> def __init__(self, *args, **kwargs):
>> try:
>> self.w2v_cluster_path = kwargs.pop("w2v_clusters")
>> except KeyError:
>> pass
>> super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs)
>>
>> def build_analyzer(self):
>> preprocess = self.build_preprocessor()
>> stopwords = self.get_stop_words()
>> w2v_clusters = self.load_w2v_clusters()
>> tokenize = self.build_tokenizer()
>> return lambda doc:
>> self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters)
>> [...]
>> #==
>>
>> I can instantiate this, but when I want to inspect it, I get the
>> following (this is in ipython, in a script it just hangs):
>>
>> #==
>> In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2),
>> stop_words="english", max_df=0.5, min_df=1, max_features=1,
>> w2v_clusters="clusters.20160322_1803.w2v", binary=True)
>>
>> In [3]: vec
>> Out[3]:
>> ---
>> RuntimeError  Traceback (most recent call
>> last)
>> /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc
>> in __call__(self, obj)
>> 697 type_pprinters=self.type_printers,
>> 698 deferred_pprinters=self.deferred_printers)
>> --> 699 printer.pretty(obj)
>> 700 printer.flush()
>> 701 return stream.getvalue()
>>
>> [...]
>>
>> /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc
>> in _get_param_names(cls)
>> 193" %s with constructor %s
>> doesn't "
>> 194" follow this convention."
>> --> 195% (cls, init_signature))
>> 196 # Extract and sort argument names excluding 'self'
>> 197 return sorted([p.name for p in parameters])
>>
>> RuntimeError: scikit-learn estimators should always specify their
>> parameters in the signature of their __init__ (no varargs). > 'cooc_vectorizer.WordCooccurrenceVectorizer'> with constructor (,
>> *args, **kwargs) doesn't  follow this convention.
>>
>> In [4]:
>> #==
>>
>> The error is clear enough -- I can't use *args and **kwargs in a sklearn
>> estimator's __init__() -- but I'm not sure what the correct way is to do
>> what I need to do. Do I literally need to specify all of the __init__
>> params in my subclass and then pass them on to the __init__ of super()? If
>> so, what's the reason for setting this up this way?
>>
>>
>> Thanks for any pointers/guidance,
>> Fred.
>>
>>
>>
>> --
>> Transform Data into Oppo

Re: [Scikit-learn-general] Subclassing vectorizers

2016-03-22 Thread Joel Nothman

Hi Fred,

We use the __init__ signature to get the list of parameters that (a) can be
set by grid search; (b) need to be copied to a cloned instance of the
estimator (with any fitted model discarded) in constructing ensembles,
cross validation, etc. While none of the scikit-learn library of estimators
do this, in practice you can overload get_params to define your own
parameter listing. See
http://scikit-learn.org/stable/developers/contributing.html#get-params-and-set-params

On 23 March 2016 at 14:45, Fred Mailhot  wrote:

> Hello list,
>
> Firstly, thanks for this incredible package; I use it daily at work. Now
> on to the meat: I'm trying to subclass TfidfVectorizer and running into
> issues. I want to specify an extra param for __init__() that points to a
> file that gets used in build_analyzer(). Skipping irrelevant bits, I've got
> the following:
>
> #==
> class WordCooccurrenceVectorizer(TfidfVectorizer):
>
> ### override __init__ to add w2v_clusters arg
> # see
> http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass
> # for explanation of syntax
> def __init__(self, *args, **kwargs):
> try:
> self.w2v_cluster_path = kwargs.pop("w2v_clusters")
> except KeyError:
> pass
> super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs)
>
> def build_analyzer(self):
> preprocess = self.build_preprocessor()
> stopwords = self.get_stop_words()
> w2v_clusters = self.load_w2v_clusters()
> tokenize = self.build_tokenizer()
> return lambda doc:
> self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters)
> [...]
> #==
>
> I can instantiate this, but when I want to inspect it, I get the following
> (this is in ipython, in a script it just hangs):
>
> #==
> In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2),
> stop_words="english", max_df=0.5, min_df=1, max_features=1,
> w2v_clusters="clusters.20160322_1803.w2v", binary=True)
>
> In [3]: vec
> Out[3]:
> ---
> RuntimeError  Traceback (most recent call last)
> /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc
> in __call__(self, obj)
> 697 type_pprinters=self.type_printers,
> 698 deferred_pprinters=self.deferred_printers)
> --> 699 printer.pretty(obj)
> 700 printer.flush()
> 701 return stream.getvalue()
>
> [...]
>
> /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc
> in _get_param_names(cls)
> 193" %s with constructor %s
> doesn't "
> 194" follow this convention."
> --> 195% (cls, init_signature))
> 196 # Extract and sort argument names excluding 'self'
> 197 return sorted([p.name for p in parameters])
>
> RuntimeError: scikit-learn estimators should always specify their
> parameters in the signature of their __init__ (no varargs).  'cooc_vectorizer.WordCooccurrenceVectorizer'> with constructor (,
> *args, **kwargs) doesn't  follow this convention.
>
> In [4]:
> #==
>
> The error is clear enough -- I can't use *args and **kwargs in a sklearn
> estimator's __init__() -- but I'm not sure what the correct way is to do
> what I need to do. Do I literally need to specify all of the __init__
> params in my subclass and then pass them on to the __init__ of super()? If
> so, what's the reason for setting this up this way?
>
>
> Thanks for any pointers/guidance,
> Fred.
>
>
>
> --
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Feature selection != feature elimination?

2016-03-14 Thread Joel Nothman

Currently there is no automatic mechanism for eliminating the generation of
features that are not selected downstream. It needs to be achieved manually.

On 15 March 2016 at 08:05, Philip Tully  wrote:

> Hi,
>
> I'm trying to optimize the time it takes to make a prediction with my
> model(s). I realized that when I perform feature selection during the
> model fit(), that these features are likely still computed when I go
> to predict() or predict_proba(). An optimization would then involve
> actually eliminating those features that aren't selected from my
> Pipeline altogether, instead of just selecting them.
>
> Does sklearn already do this automatically? Or does this readjustment
> need to be done manually before serialization?
>
> thanks,
> Philip
>
>
> --
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785231=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Restrictions on feature names when drawing decision tree

2016-03-13 Thread Joel Nothman

We should probably be escaping feature names internally. It's easy to
forget that graphviz supports HTML-like markup.

On 14 March 2016 at 08:00, Andreas Mueller  wrote:

> Try escaping the &.
>
> On 03/12/2016 02:57 PM, Raphael C wrote:
> > The code snippet should have been
> >
> >
> > reg = DecisionTreeRegressor(max_depth=None,min_samples_split=1)
> > reg.fit(X,Y)
> > scores = cross_val_score(reg, X, Y)
> > print scores
> > dot_data = StringIO()
> > tree.export_graphviz(reg, out_file=dot_data,
> >   feature_names=feature_names,
> >   filled=True, rounded=True,
> >   special_characters=True)
> > graph = pydot.graph_from_dot_data(dot_data.getvalue())
> > Image(graph.create_png()
> >
> > Raphael
> >
> > On 12 March 2016 at 13:56, Raphael C  wrote:
> >> I am attempting to draw a decision tree using:
> >>
> >> reg = DecisionTreeRegressor(max_depth=None,min_samples_split=1)
> >> reg.fit(X,Y)
> >> dot_data = StringIO()
> >> tree.export_graphviz(reg, out_file=dot_data,
> >>   feature_names=feature_names,
> >>   filled=True, rounded=True,
> >>   special_characters=True)
> >> graph = pydot.graph_from_dot
> >>
> >>
> >> This gives me the error message
> >>
> >>
> >>File "/usr/lib/python2.7/dist-packages/pydot.py", line 1802, in
> 
> >>  lambda f=frmt, prog=self.prog : self.create(format=f, prog=prog))
> >>File "/usr/lib/python2.7/dist-packages/pydot.py", line 2023, in
> create
> >>  status, stderr_output) )
> >> pydot.InvocationException: Program terminated with status: 1. stderr
> >> follows: Error: not well-formed (invalid token) in line 1
> >> ... Design & Tech. 3D Design=A  0.5 ...
> >> in label of node 17
> >> Error: not well-formed (invalid token) in line 1
> >> ... Design & Tech. Product Design=A  0.5 ...
> >> in label of node 68
> >>
> >>
> >> Is this because there is some restriction on the types of strings that
> >> are supported as feature names?
> >>
> >> Two of the feature names are:
> >>
> >> 'Design & Tech. 3D Design=A'
> >>
> >> and
> >>
> >> 'Design & Tech. Product Design=A'
> >>
> >> Raphael
> >
> --
> > Transform Data into Opportunity.
> > Accelerate data analysis in your applications with
> > Intel Data Analytics Acceleration Library.
> > Click to learn more.
> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111=/4140
> > ___
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> --
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785111=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Average Per-Class Accuracy metric

2016-03-08 Thread Joel Nothman

You mean TP / N, not TP / TN.

And I think the average per-class accuracy does some weird things. Like:

true = [1, 1, 1, 0, 0]
pred = [1, 1, 1, 1, 1]
a.p.c.a = (3 + 3) / 5 / 2

true = [1, 1, 1, 0, 2]
pred = [1, 1, 1, 1, 1]
a.p.c.a = (4 + 4 + 3) / 5 / 3

I don't think that's very useful.

On 9 March 2016 at 13:36, Sebastian Raschka <se.rasc...@gmail.com> wrote:

> > Firstly, balanced accuracy is a different thing, and yes, it should be
> supported.
>
> > Secondly, I am correct in thinking you're talking about multiclass (not
> multilabel).
>
>
> Sorry for the confusion, and yes, you are right. I think have mixed the
> terms “average per-class accuracy” with “balanced accuracy” then.
>
> Maybe to clarify, a corrected example to describe what I meant. Given the
> confusion matrix
>
>predicted
>label
>
>[ 3,  0,  0]
>  true[ 7, 50, 12]
>  label   [ 0,  0, 18]
>
>
> I’d compute the accuracy as TP / TN =  (3 + 50 + 18) / 90 = 0.79
>
> and the “average per-class accuracy” as
>
> (83/90 + 71/90 + 78/90) / 3 = (83 + 71 + 78) / (3 * 90) = 0.86
>
> (I hope I got it right this time!)
>
> In any case, I am not finding any literature describing this, and I am
> also not proposing to add it to sickit-learn, just wanted to get some info
> whether this is implemented or not. Thanks! :)
>
>
>
> > On Mar 8, 2016, at 8:29 PM, Joel Nothman <joel.noth...@gmail.com> wrote:
> >
> > Firstly, balanced accuracy is a different thing, and yes, it should be
> supported.
> >
> > Secondly, I am correct in thinking you're talking about multiclass (not
> multilabel).
> >
> > However, what you're describing isn't accuracy. It's actually
> micro-averaged recall, except that your dataset is impossible because
> you're allowing there to be fewer predictions than instances. If we assume
> that we're allowed to predict some negative class, that's fine; we can
> nowadays exclude it from micro-averaged recall with the labels parameter to
> recall_score. (If all labels are included in a multiclass problem,
> micro-averaged recall = precision = fscore = accuracy.)
> >
> > I had assumed you meant binarised accuracy, which would add together
> both true positives and true negatives for each class.
> >
> > Either way, if there's no literature on this, I think we'd really best
> not support it.
> >
> > On 9 March 2016 at 11:15, Sebastian Raschka <se.rasc...@gmail.com>
> wrote:
> > I haven’t seen this in practice, yet, either. A colleague was looking
> for this in scikit-learn recently, and he asked me if I know whether this
> is implemented or not. I couldn’t find anything in the docs and was just
> curious about your opinion. However, I just found this entry here on
> wikipedia:
> >
> > https://en.wikipedia.org/wiki/Accuracy_and_precision
> > > Another useful performance measure is the balanced accuracy[10] which
> avoids inflated performance estimates on imbalanced datasets. It is defined
> as the arithmetic mean of sensitivity and specificity, or the average
> accuracy obtained on either class:
> >
> > > Am I right in thinking that in the binary case, this is identical to
> accuracy?
> >
> >
> > I think it would only be equal to the “accuracy” if the class labels are
> uniformly distributed.
> >
> > >  I'm not sure what this metric is getting at.
> >
> > I have to think about this more, but I think it may be useful for
> imbalanced datasets where you want to emphasize the minority class. E.g.,
> let’s say we have a dataset of 120 samples and three class labels 1, 2, 3.
> And the classes are distributed like this
> > 10 x 1
> > 50 x 2
> > 60 x 3
> >
> > Now, let’s assume we have a model that makes the following predictions
> >
> > - it gets 0 out of 10 from class 1 right
> > - 45 out of 50 from class 2
> > - 55 out of 60 from class 3
> >
> > So, the accuracy would then be computed as
> >
> > (0 + 45 + 55) / 120 = 0.833
> >
> > But the “balanced accuracy” would be much lower, because the model did
> really badly on class 1, i.e.,
> >
> > (0/10 + 45/50 + 55/60) / 3 = 0.61
> >
> > Hm, if I see this correctly, this is actually very similar to the F1
> score. But instead of computing the harmonic mean between “precision and
> the true positive rate), we compute the harmonic mean between "precision
> and true negative rate"
> >
> > > On Mar 8, 2016, at 6:40 PM, Joel Nothman <joel.noth...@gmail.com>
> wrote:
> > >
> > > I've not seen this metric used (referen

Re: [Scikit-learn-general] Average Per-Class Accuracy metric

2016-03-08 Thread Joel Nothman

Firstly, balanced accuracy is a different thing, and yes, it should be
supported.

Secondly, I am correct in thinking you're talking about multiclass (not
multilabel).

However, what you're describing isn't accuracy. It's actually
micro-averaged recall, except that your dataset is impossible because
you're allowing there to be fewer predictions than instances. If we assume
that we're allowed to predict some negative class, that's fine; we can
nowadays exclude it from micro-averaged recall with the labels parameter to
recall_score. (If all labels are included in a multiclass problem,
micro-averaged recall = precision = fscore = accuracy.)

I had assumed you meant binarised accuracy, which would add together both
true positives and true negatives for each class.

Either way, if there's no literature on this, I think we'd really best not
support it.

On 9 March 2016 at 11:15, Sebastian Raschka <se.rasc...@gmail.com> wrote:

> I haven’t seen this in practice, yet, either. A colleague was looking for
> this in scikit-learn recently, and he asked me if I know whether this is
> implemented or not. I couldn’t find anything in the docs and was just
> curious about your opinion. However, I just found this entry here on
> wikipedia:
>
> https://en.wikipedia.org/wiki/Accuracy_and_precision
> > Another useful performance measure is the balanced accuracy[10] which
> avoids inflated performance estimates on imbalanced datasets. It is defined
> as the arithmetic mean of sensitivity and specificity, or the average
> accuracy obtained on either class:
>
> > Am I right in thinking that in the binary case, this is identical to
> accuracy?
>
>
> I think it would only be equal to the “accuracy” if the class labels are
> uniformly distributed.
>
> >  I'm not sure what this metric is getting at.
>
> I have to think about this more, but I think it may be useful for
> imbalanced datasets where you want to emphasize the minority class. E.g.,
> let’s say we have a dataset of 120 samples and three class labels 1, 2, 3.
> And the classes are distributed like this
> 10 x 1
> 50 x 2
> 60 x 3
>
> Now, let’s assume we have a model that makes the following predictions
>
> - it gets 0 out of 10 from class 1 right
> - 45 out of 50 from class 2
> - 55 out of 60 from class 3
>
> So, the accuracy would then be computed as
>
> (0 + 45 + 55) / 120 = 0.833
>
> But the “balanced accuracy” would be much lower, because the model did
> really badly on class 1, i.e.,
>
> (0/10 + 45/50 + 55/60) / 3 = 0.61
>
> Hm, if I see this correctly, this is actually very similar to the F1
> score. But instead of computing the harmonic mean between “precision and
> the true positive rate), we compute the harmonic mean between "precision
> and true negative rate"
>
> > On Mar 8, 2016, at 6:40 PM, Joel Nothman <joel.noth...@gmail.com> wrote:
> >
> > I've not seen this metric used (references?). Am I right in thinking
> that in the binary case, this is identical to accuracy? If I predict all
> elements to be the majority class, then adding more minority classes into
> the problem increases my score. I'm not sure what this metric is getting at.
> >
> > On 8 March 2016 at 11:57, Sebastian Raschka <se.rasc...@gmail.com>
> wrote:
> > Hi,
> >
> > I was just wondering why there’s no support for the average per-class
> accuracy in the scorer functions (if I am not overlooking something).
> > E.g., we have 'f1_macro', 'f1_micro', 'f1_samples', ‘f1_weighted’ but I
> didn’t see a ‘accuracy_macro’, i.e.,
> > (acc.class_1 + acc.class_2 + … + acc.class_n) / n
> >
> > Would you discourage its usage (in favor of other metrics in imbalanced
> class problems) or was it simply not implemented, yet?
> >
> > Best,
> > Sebastian
> >
> --
> > Transform Data into Opportunity.
> > Accelerate data analysis in your applications with
> > Intel Data Analytics Acceleration Library.
> > Click to learn more.
> > http://makebettercode.com/inteldaal-eval
> > ___
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> --
> > Transform Data into Opportunity.
> > Accelerate data analysis in your applications with
> > Intel Data Analytics Acceleration Library.
> > Click to learn more.
> >
> http://makebettercode.com/inteldaal-eval___
> > Scikit-learn-ge

Re: [Scikit-learn-general] Average Per-Class Accuracy metric

2016-03-08 Thread Joel Nothman

(Although multiloutput accuracy is reasonable to support.)

On 9 March 2016 at 12:29, Joel Nothman <joel.noth...@gmail.com> wrote:

> Firstly, balanced accuracy is a different thing, and yes, it should be
> supported.
>
> Secondly, I am correct in thinking you're talking about multiclass (not
> multilabel).
>
> However, what you're describing isn't accuracy. It's actually
> micro-averaged recall, except that your dataset is impossible because
> you're allowing there to be fewer predictions than instances. If we assume
> that we're allowed to predict some negative class, that's fine; we can
> nowadays exclude it from micro-averaged recall with the labels parameter to
> recall_score. (If all labels are included in a multiclass problem,
> micro-averaged recall = precision = fscore = accuracy.)
>
> I had assumed you meant binarised accuracy, which would add together both
> true positives and true negatives for each class.
>
> Either way, if there's no literature on this, I think we'd really best not
> support it.
>
> On 9 March 2016 at 11:15, Sebastian Raschka <se.rasc...@gmail.com> wrote:
>
>> I haven’t seen this in practice, yet, either. A colleague was looking for
>> this in scikit-learn recently, and he asked me if I know whether this is
>> implemented or not. I couldn’t find anything in the docs and was just
>> curious about your opinion. However, I just found this entry here on
>> wikipedia:
>>
>> https://en.wikipedia.org/wiki/Accuracy_and_precision
>> > Another useful performance measure is the balanced accuracy[10] which
>> avoids inflated performance estimates on imbalanced datasets. It is defined
>> as the arithmetic mean of sensitivity and specificity, or the average
>> accuracy obtained on either class:
>>
>> > Am I right in thinking that in the binary case, this is identical to
>> accuracy?
>>
>>
>> I think it would only be equal to the “accuracy” if the class labels are
>> uniformly distributed.
>>
>> >  I'm not sure what this metric is getting at.
>>
>> I have to think about this more, but I think it may be useful for
>> imbalanced datasets where you want to emphasize the minority class. E.g.,
>> let’s say we have a dataset of 120 samples and three class labels 1, 2, 3.
>> And the classes are distributed like this
>> 10 x 1
>> 50 x 2
>> 60 x 3
>>
>> Now, let’s assume we have a model that makes the following predictions
>>
>> - it gets 0 out of 10 from class 1 right
>> - 45 out of 50 from class 2
>> - 55 out of 60 from class 3
>>
>> So, the accuracy would then be computed as
>>
>> (0 + 45 + 55) / 120 = 0.833
>>
>> But the “balanced accuracy” would be much lower, because the model did
>> really badly on class 1, i.e.,
>>
>> (0/10 + 45/50 + 55/60) / 3 = 0.61
>>
>> Hm, if I see this correctly, this is actually very similar to the F1
>> score. But instead of computing the harmonic mean between “precision and
>> the true positive rate), we compute the harmonic mean between "precision
>> and true negative rate"
>>
>> > On Mar 8, 2016, at 6:40 PM, Joel Nothman <joel.noth...@gmail.com>
>> wrote:
>> >
>> > I've not seen this metric used (references?). Am I right in thinking
>> that in the binary case, this is identical to accuracy? If I predict all
>> elements to be the majority class, then adding more minority classes into
>> the problem increases my score. I'm not sure what this metric is getting at.
>> >
>> > On 8 March 2016 at 11:57, Sebastian Raschka <se.rasc...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > I was just wondering why there’s no support for the average per-class
>> accuracy in the scorer functions (if I am not overlooking something).
>> > E.g., we have 'f1_macro', 'f1_micro', 'f1_samples', ‘f1_weighted’ but I
>> didn’t see a ‘accuracy_macro’, i.e.,
>> > (acc.class_1 + acc.class_2 + … + acc.class_n) / n
>> >
>> > Would you discourage its usage (in favor of other metrics in imbalanced
>> class problems) or was it simply not implemented, yet?
>> >
>> > Best,
>> > Sebastian
>> >
>> --
>> > Transform Data into Opportunity.
>> > Accelerate data analysis in your applications with
>> > Intel Data Analytics Acceleration Library.
>> > Click to learn more.
>> > http://makebettercode.com/inteldaal-eval
>> > ___
>> > Sc

Re: [Scikit-learn-general] Average Per-Class Accuracy metric

2016-03-08 Thread Joel Nothman

I've not seen this metric used (references?). Am I right in thinking that
in the binary case, this is identical to accuracy? If I predict all
elements to be the majority class, then adding more minority classes into
the problem increases my score. I'm not sure what this metric is getting at.

On 8 March 2016 at 11:57, Sebastian Raschka  wrote:

> Hi,
>
> I was just wondering why there’s no support for the average per-class
> accuracy in the scorer functions (if I am not overlooking something).
> E.g., we have 'f1_macro', 'f1_micro', 'f1_samples', ‘f1_weighted’ but I
> didn’t see a ‘accuracy_macro’, i.e.,
> (acc.class_1 + acc.class_2 + … + acc.class_n) / n
>
> Would you discourage its usage (in favor of other metrics in imbalanced
> class problems) or was it simply not implemented, yet?
>
> Best,
> Sebastian
>
> --
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://makebettercode.com/inteldaal-eval
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Problem with parallel processing in randomSearch

2016-02-23 Thread Joel Nothman

What estimator(s) are you searching over? How big is your data?

On 24 February 2016 at 06:15, Stylianos Kampakis <
stylianos.kampa...@gmail.com> wrote:

> Hi everyone,
>
> Sometimes, when I am using random search with n_jobs>1 the processing
> stops. I am on a Mac. I went through some discussions on Github where
> people said it relates joblib and this problem is more common on Mac.
> However, I couldn't find the answer to two questions I have:
>
> 1) Why the processing stops only some times and not every single time?
>
> 2) Have any people managed to find a workaround?
>
> Thank you all in advance,
>
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] reproducible error : memory Error in scikit learn's dbscan

2016-02-18 Thread Joel Nothman

If not stack overflow, the appropriate venue for such questions is the
scikit-learn-general mailing list.

The current dbscan implementation is by default not memory efficient,
constructing a full pairwise similarity matrix in the case where
kd/ball-trees cannot be used (e.g. with sparse matrices). This matrix will
consume n^2 floats, perhaps 40GB in your case.

We provide a couple of mechanisms for getting around this:

   - You can precompute a sparse radius neighborhood graph (where missing
   entries are presumed to be out of eps) in a memory-efficient way, and run
   dbscan over this with metric='precomputed'.
   - You can compress the dataset, either by removing exact duplicates if
   these occur in your data, or by using BIRCH. Then you only have a
   relatively small number of representatives for a large number of points.
   You can then provide a sample_weight when fitting DBSCAN.

I suspect this could be clearer in the documentation, and a pull request is
welcome.

Perhaps default implementation of radius_neighbors and kneighbors in the
brute force case should be more memory-sensitive; or dbscan should return
to / have an option to search for nearest neighbors when needed rather than
in advance, which is the source of the high memory consumption.

Cheers; but please don't email developers personally, and continue
correspondence through the mailing list.

Joel


On 19 February 2016 at 05:53, Lefevre, Augustin  wrote:

> Dear Joel and Robert,
>
>
>
> Sorry for contacting you directly, there may be a more
> formal way of contacting you about this. Anyway, here is my question.
>
>
>
> I tried using dbscan on scikit learn v0.17 today and got a
> memory Error. After reading about it on stackoverflow, I am still puzzled,
> since I am using a compressed sparse row matrix as input, of size 100,000 x
> 400, with density 0.01, which is far from huge (300 MB on disk).
> Apparently, the reason is that I am using the l1 distance as a metric.
> Please find below a sample of code to reproduce the error, and my
> traceback. If you have any suggestions on working around this problem, I
> would be very thankful.
>
>
>
> Y can reproduce the memory Error without having to download my own data,
> with the following code :
>
>
>
>
>
> Y=scipy.sparse.rand(10,400,density=.01)
>
> dbscan(Y,eps=10,min_samples=1,metric=’l1’)
>
> Also, here is the traceback I obtain after running the code : seems like
> initializing a dense matrix of zeros of size O(n^2) is not such a good idea.
>
>
>
> Traceback (most recent call last):
>
>   File
> "C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\IPython\core\interactiveshell.py",
> line 2885, in run_code
>
> exec(code_obj, self.user_global_ns, self.user_ns)
>
>   File "", line 1, in 
>
>
> sklearn.cluster.dbscan(scipy.sparse.rand(10,400,density=.01),metric='manhattan')
>
>   File
> "C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\cluster\dbscan_.py",
> line 146, in dbscan
>
> return_distance=False)
>
>   File
> "C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\neighbors\base.py",
> line 609, in radius_neighbors
>
> **self.effective_metric_params_)
>
>   File
> "C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py",
> line 1207, in pairwise_distances
>
> return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
>
>   File
> "C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py",
> line 1054, in _parallel_pairwise
>
> return func(X, Y, **kwds)
>
>   File
> "C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py",
> line 516, in manhattan_distances
>
> D = np.zeros((X.shape[0], Y.shape[0]))
>
> MemoryError
>
>
>
>
>
> *Augustin LEFEVRE *| Consultant Senior | Ykems | -
>
> T : +33 1 44 30 - - | M : +33 7 77 97 94 89 | alefe...@ykems.com |
> www.ykems.com
>
>
>
> [image: http://www.beijaflore.com/_mailing/signature/image001.png]
> [image:
> http://www.beijaflore.com/_mailing/signature/image002.png]
>  [image:
> http://www.beijaflore.com/_mailing/signature/image003.png]
>  [image:
> http://www.beijaflore.com/_mailing/signature/image004.png]
> 
>
>
>
> P Save a tree ! Think before you print
>
>
>
> *SECURE BUSINESS*
>
> *This message and its attachment contain information that may be
> privileged or confidential and is the property of Beijaflore. It is
> intended only for the person to whom it is addressed. If you are not the
> intended recipient, you are not authorized to read, print, retain, copy,
> disseminate, distribute, use or rely on the information contained in this
> email. If you receive this message in error,

Re: [Scikit-learn-general] BIRCH: merge subclusters

2016-02-07 Thread Joel Nothman

It's not clear *why* you're doing this. The model will automatically
recluster the subclusters after identifying them, as long as you specify
either a number of clusters or a clustering model to the n_clusters
parameter. Can you fit this post-processing into that "final clustering"
framework?

On 8 February 2016 at 07:12, Dženan Softić  wrote:

> Hi,
>
> I am doing some experiments with BIRCH. When BIRCH finish, I would like to 
> merge subclusters based on some criteria. I am doing this this by calling 
> "merge_subcluster" method on subcluster that I want to merge with, passing it 
> subcluster object of the second cluster:
>
> cluster1.merge_subcluster(cluster2, self.threshold)
>
> It seems to work, since it updates correctly N, LS, SS (n_samples, 
> linear_sum, squared_sum). What is left is to remove a merged subcluster 
> (cluster2) from the subclusters list and to update centroids:
>
> ind = leaf.subclusters_.index(cluster1) #getting the index to update the 
> centroid
> ind_remove = leaf.subclusters_.index(cluster2) #getting the index of a 
> cluster that needs to be removed because it is merged
> leaf.init_centroids_[ind] = cluster1.centroid_ #update centroid
> leaf.init_sq_norm_[ind] = cluster1.sq_norm_
> leaf.centroids_ = np.delete(leaf.centroids_, ind_remove, 0) #removing the 
> centroid of a cluster2
> self.root_.init_centroids_ = np.delete(self.root_.init_centroids_, 
> ind_remove, 0) #removing the centroid from the root
> leaf.subclusters_.remove(cluster) #removing the cluster itself
>
> I am not sure I am doing it the right way. Any suggestion/comment would be 
> very much appreciated.
>
> Thanks,
> Dzeno
>
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Latent Dirichlet Allocation

2016-01-26 Thread Joel Nothman

How many distinct words are in your dataset?

On 27 January 2016 at 00:21, Rockenkamm, Christian <
c.rockenk...@stud.uni-goettingen.de> wrote:

> Hallo,
>
>
> I have question concerning the Latent Dirichlet Allocation. The results I
> get from using it are a bit confusing.
>
> At first I use about 3000 documents. In the preparation with the
> CountVectorizrt I use the following parameters : max_df=0.95 and
> min_df=0.05.
>
> For the LDA fit I use the bath learning method. For the other parameters I
> have tried many different values. However regardless of which configuration
> I used, I face one common problem. I get topics that are never used in any
> of the docs and said topics all show the same structure
> (topic-word-distribution). I even tried gensim with the same configuration
> as scikit, yet I still encountered this problem. I also tried lowering the
> number of topics in the model, but this did not lead to the expected
> results either. For 100 topics, 20-27 were still affected by this problem,
> for 50 topics, there were still 2-8 of them being affected, depending on
> the parameter setting.
>
> Does anybody have an idea as to what might be causing this problem and how
> to resolve it?
>
>
> Best regards,
>
> Christian Rockenkamm
>
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Use of safe functions such as safe_sqr

2016-01-13 Thread Joel Nothman

safe_sqr applies when its operand may be a sparse matrix. In theory this
could be true of coef_, but I don't think this is tested as often as it
might be.

But, in general, you should not take what is done in any particular piece
of code to be indicative of best practice. There are often multiple ways to
do things, and while there may be a best choice, the best choice may change
over time as different parts of the codebase are atomically created or
modified. And we'll generally not be *that *nitpicky in reviewing, more
often focusing on readability, functionality and, where appropriate,
efficiency. In short, I don't think you need to worry about this much.

Supporting an arbitrary norm might be appropriate in RFE too.

HTH

- Joel

On 14 January 2016 at 03:40, WENDLINGER Antoine  wrote:

> Hi everyone,
>
> I'm working on issue 2121
> ,
> and have trouble understanding when to use safe methods like safe_sqr. What
> I would want to do here is use np.linalg.norm on the coeff array when it
> is of dimension 2, but it seems it is not the way to go (since it's not
> what is done for example in the RFE feature selector, where the l2 norm of
> the coeff array is used). Am I missing something here ?
>
>
> Regards,
>
> Antoine
>
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] figuring out the steps needed to achieve a result

2016-01-10 Thread Joel Nothman

I think you've misunderstood this one, Sören. This sounds like it is a
structured learning problem, where the steps are the "target" of the
learning task, and the result is the input example.

Take, for instance, the natural language processing task of dependency
parsing.

The "result" of some latent syntactic process is "He quickly ate the cat",
and the latent "steps" deriving that result are the dependency arcs:

"He" is dependent to "ate"
"quickly" is dependent to "ate"
"the" is dependent to "cat"
"cat" is dependent to "ate"

Given a sentence, we want to recover such dependency arcs.

This sort of structured learning problem is common in machine learning, but
is generally beyond the scope of scikit-learn. (The related pystruct
project by core scikit-learn dev Andreas Müller deals with this type of
task.) One of the main challenges of structured learning problems is
representing the target "steps" (i.e. the parse here) as, for example, some
previously solved class of probabilistic graphical model, such as linear
chain Conditional Random Fields, as well as representing features of the
observed instance ("result") in a way that the machine learning model can
learn to associate with the target.

HTH

Joel

On 11 January 2016 at 01:38, Sören Wacker  wrote:

> Sounds like a classification problem. You can try to see the steps as
> features and use classification methods e.g. decision tree to train a
> model. But it depends on what "reconstruct the steps" and "result" means.
>
> Sören
>
>
> On 01/09/2016 06:32 AM, Dominic Laflamme wrote:
>
> First I'd like to apologize in advanced for the "noobiness" nature of my
> question...
> I'd like to get some early guidance into which path I should take to help
> me solve a problem using machine learning.
>
> The goal would be to use machine learning to "reconstruct the steps"
> needed to take in order to get to a particular "result".
> My assumption (perhaps misguided at this point) would be that I could feed
> the system with a large amount of complete "examples"  (steps + result) to
> be eventually able to feed it a result in order to get a series of steps.
>
> I know my question is broad, to say the least. But I feel that knowing the
> right type of machine learning concept to apply to such a problem would
> help me started.
> Any insights on which category better suits my problem, and on whether my
> assumptions are flawed?
>
> Any and all comments greatly appreciated.
>
>
>
>
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup 
> Now!http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
>
>
>
> ___
> Scikit-learn-general mailing 
> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Dropping Python 2.6 compatibility

2016-01-04 Thread Joel Nothman

I have many times committed coded and had to fix for python 2.6.

FWIW: features that I have had to remove include format strings with
implicit arg numbers, set literals, dict comprehensions, perhaps ordered
dicts / counters. We are already clandestinely using argparse in benchmark
code.

Most of these are fairly infrequently needed in our codebase, but some are
particularly useful for succinct doctests. If we made a concerted effort to
rewrite code more Py2.7+-idiomatically we might touch a substantial
quantity, but we won't do that anyway.

If we take a more conservative approach and do not upgrade the requirements
immediately, what is the latest date we would upgrade to ensure officially
supported RHEL is not left behind?

On 5 January 2016 at 06:31, Gael Varoquaux 
wrote:

> On Mon, Jan 04, 2016 at 01:22:12PM -0500, Andreas Mueller wrote:
> > I'm not sure I'm for dropping 2.6 for the sake of dropping 2.6.
>
> I agree. I find the attitude of the post that I mentionned a bit
> annoying: dropping support for the sake of forcing people to move isn't a
> good thing. It should bring something to the project.
>
> > What would we actually gain? There are two fixes in
> > sklearn/utils/fixes.py that we could remove, I think.
>
> I wrote this mail because of:
> https://github.com/scikit-learn/scikit-learn/pull/5889/files#r48728184
>
> > Also: what does dropping 2.6 mean? Writing in the docs that we don't
> > support it any more?
> > Shutting down the continuous integration? Removing the fixes?
>
> All 3, IMHO.
>
> > If we remove the fixes, we force users to upgrade, with little benefit
> > to us, right?
>
> Well, those fixes are a long-term maintenance burden. That said, it seems
> that we have only 2 so far. I suspect also that in many places people are
> avoiding more idiomatic Python patterns that are not supported in
> Python2.6, but would lead to better code (as in the discussion I link to
> above). Finally, it is a burden for contributors, that have to keep in
> mind Python 2.6 compat (and often fail too).
>
> The benefit to us should be better maintenance and easier development. If
> it's not the case, we shouldn't do it :).
>
> Gaël
>
>
>
> --
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Import error for Robust scaler

2015-12-01 Thread Joel Nothman

But check that the version you are using in the appropriate Python instance
is correct. For example:

python -c 'import sklearn; print(sklearn.__version__)'



On 2 December 2015 at 16:24, Sumedh Arani 
wrote:

> Greetings!!
>
> I've used pip install --upgrade scikit-learn and it says the requirement
> is up to date.
> I remember to have upgraded to  version 0.17. Initially I had installed
> 0.16. Anyways thanks for the help!!
> I'll reinstall it!!
> On 2 Dec 2015 09:52, "Andreas Mueller"  wrote:
>
>> You are likely using an old version of scikit-learn that doesn't include
>> RobustScaler.
>> Update your installation.
>>
>>
>> On 11/28/2015 08:18 PM, Sumedh Arani wrote:
>>
>> Dear developers,
>>
>> In my due process to correct am way bug posted in the issues section in
>> github, I tried to work on robust scaler. I tried importing it several
>> times but to no avail. I even tried running plot_robust_scaling.py on my
>> system which runs on osX which still gave me an import error. When I went
>> and checked in data.py file which comes in sklearn.preprocessing, the class
>> and the method both exist. I tried several times and several round about to
>> achieve the same but still end up getting inconclusive results. This in
>> turn prevents me from solving a bug which I proactively decided to work
>> upon.
>>
>> Please help me figure out the same.
>>
>> Thank you.
>>
>> Yours sincerely,
>> Sumedh Arani,
>> PES University.
>>
>>
>> --
>>
>>
>>
>> ___
>> Scikit-learn-general mailing 
>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>>
>> --
>> Go from Idea to Many App Stores Faster with Intel(R) XDK
>> Give your users amazing mobile app experiences with Intel(R) XDK.
>> Use one codebase in this all-in-one HTML5 development environment.
>> Design, debug & build mobile apps & 2D/3D high-impact games for multiple
>> OSs.
>> http://pubads.g.doubleclick.net/gampad/clk?id=254741911=/4140
>> ___
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
> --
> Go from Idea to Many App Stores Faster with Intel(R) XDK
> Give your users amazing mobile app experiences with Intel(R) XDK.
> Use one codebase in this all-in-one HTML5 development environment.
> Design, debug & build mobile apps & 2D/3D high-impact games for multiple
> OSs.
> http://pubads.g.doubleclick.net/gampad/clk?id=254741911=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741911=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] "Need Review" tag

2015-12-01 Thread Joel Nothman

Labels weren't available for PRs until relatively recently. I think the
status and its meaning would be clearer with such tags.

On 2 December 2015 at 15:16, Andreas Mueller  wrote:

> Yeah that was the intention of [MRG]. Though it might be easier to
> filter by tag.
> No strong opinion though.
>
> On 12/02/2015 12:44 AM, Gael Varoquaux wrote:
> >> How about adding a "Need Review(s?)(er?)" tag?
> > For me, it's the '[MRG]' in the PR name.
> >
> >
> --
> > Go from Idea to Many App Stores Faster with Intel(R) XDK
> > Give your users amazing mobile app experiences with Intel(R) XDK.
> > Use one codebase in this all-in-one HTML5 development environment.
> > Design, debug & build mobile apps & 2D/3D high-impact games for multiple
> OSs.
> > http://pubads.g.doubleclick.net/gampad/clk?id=254741911=/4140
> > ___
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> --
> Go from Idea to Many App Stores Faster with Intel(R) XDK
> Give your users amazing mobile app experiences with Intel(R) XDK.
> Use one codebase in this all-in-one HTML5 development environment.
> Design, debug & build mobile apps & 2D/3D high-impact games for multiple
> OSs.
> http://pubads.g.doubleclick.net/gampad/clk?id=254741911=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
--
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741911=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] classification metrics understanding

2015-11-28 Thread Joel Nothman

If you are treating your Logistic Regression output as binary (i.e. not
using predict_proba or decision_function), could you please provide the
confusion matrix?

On 26 November 2015 at 05:06, Herbert Schulz  wrote:

> Hi, i think i have some "missunderstanding" due to the classification
> metric in scikit-learn
>
>
>
> i have a 2 class problem it is 1.0 or  2.0
>
>
>  precisionrecall  f1-score   support
>
> 1.0   0.86  0.76  0.81   254
> 2.0   0.49  0.65  0.5691
>
> avg / total   0.76  0.73  0.74   345
>
>
> Specificity: [ 1.*  0.35164835*  0.]
> recall,tpr,sensitivity  [ 0. * 0.24015748*  1.]
>
>
> # this part is manually computed  ( precision, sens, spec, ballanced
> accuracy )
>
> logistic regression 0.86,* 0.76, 0.65,* 0.7
>
>
>
> The   part with:
>
> Specificity: [ 1.  0.35164835  0.]
> recall,tpr,sensitivity  [ 0.  0.24015748  1.]
>
> are computed with
>
> fpr, tpr, thresholds = metrics.roc_curve(expected, predi,
> pos_label=1)
> print "Specificity:", 1-fpr
> print "recall,tpr,sensitivity",tpr
>
> Why is th speceficity for 1-fpr  are computed wtih [ 1.
> 0.35164835  0.]
>
> and not 0.65 ?
>
> Same with recall
>
>
>
>
>
>
>
>
>
>
>
> --
> Go from Idea to Many App Stores Faster with Intel(R) XDK
> Give your users amazing mobile app experiences with Intel(R) XDK.
> Use one codebase in this all-in-one HTML5 development environment.
> Design, debug & build mobile apps & 2D/3D high-impact games for multiple
> OSs.
> http://pubads.g.doubleclick.net/gampad/clk?id=254741551=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Nesting of stratified crossvalidation

2015-10-28 Thread Joel Nothman

Changes to support this case have recently been merged into master, and an
example is on its way:
https://github.com/scikit-learn/scikit-learn/issues/5589

I think you should be able to run your code by importing GridSearchCV,
cross_val_score and StratifiedShuffleSplit from the new
sklearn.model_selection, then the code is identical except you drop the `y`
argument from StratifiedShuffleSplit's constructor (it's a different class,
actually).

Please do try it out!

On 29 October 2015 at 05:00, Christoph Sawade <
christoph.saw...@googlemail.com> wrote:

> Hey there!
>
> A general purpose in machine learning when training a model is to estimate
> also the performance. This is often done via cross validation. In order to
> tune also hyperparameters one might want to nest the crossvalidation loops
> into another. The sklearn framework makes that very easy. However,
> sometimes it is necessary to stratify the folds to ensure some constrains
> (e.g., roughly some proportion of the target label in each fold). These
> splitters are also provided (e.g., StratifiedShuffleSplit) but do not work
> when they are nested:
>
> import numpy as np
> from sklearn.grid_search import GridSearchCV
> from sklearn.cross_validation import StratifiedShuffleSplit
> from sklearn.linear_model import LogisticRegression
> from sklearn.cross_validation import cross_val_score
>
> # Number of samples per component
> n_samples = 1000
>
> # Generate random sample, two classes
> X = np.r_[
> np.dot(np.random.randn(n_samples, 2), np.array([[0., -0.1], [1.7,
> .4]])),
> np.dot(np.random.randn(n_samples, 2), np.array([[1.0, 0.0], [0.0,
> 1.0]])) + np.array([-2, 2])
> ]
> y = np.concatenate([np.ones(n_samples), -np.ones(n_samples)])
>
> # Fit model
> LogRegOptimalC = GridSearchCV(
> estimator=LogisticRegression(),
> cv = StratifiedShuffleSplit(y, 3, test_size=0.5, random_state=0),
> param_grid={
> 'C': np.logspace(-3, 3, 7)
> }
> )
> print cross_val_score(LogRegOptimalC, X, y, cv=5).mean()
>
> The problem seems to be that the array reflecting the splitting criterion
> (here the target y) is not splitted for the inner folds. Is there some way
> to tackle that or are there already initiatives dealing with it?
>
> Thx Christoph
>
>
> --
>
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] BIRCH algorithm global step

2015-10-14 Thread Joel Nothman

Yes, simply set n_clusters=KMeans(). In fact, it's a pity we don't have an
example of this feature in the examples gallery and contributions are
welcome!

On 14 October 2015 at 23:27, Dženan Softić  wrote:

> Hi,
>
> I would like to change the global step of BIRCH algorithm to be performed
> using K-means instead of AgglomerativeClustering. Is something like that
> possible?
>
> My goal is to use BIRCH for a streaming data and try to improve output
> quality. The idea is to use BIRCH subclusters to estimate the number of
> clusters K for K-means (e.g. Gap statistics), and then run K-means as a
> final step.
>
> Thank you.
>
> Best,
> Dzenan
>
>
> --
>
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] How to optimize a random forest for out of sample prediction

2015-10-07 Thread Joel Nothman

RFECV will select features based on scores on a number of validation sets,
as selected by its cv parameter. As opposed to that StackOverflow query,
RFECV should now support RandomForest and its feature_importances_
attribute.

On 7 October 2015 at 18:16, Raphael C  wrote:

> I have a training set, a validation  set and a test set.  I build a
> random forest using RandomForestClassifier on the training set.
> However, I would like to tune it by scoring on  the validation  set.
> I find that the cross-validation score on  the training set is a lot
> better than the score on the validation set.
>
> To improve this I would like to do [RFE][1] to do feature selection to
> deal with overfitting.  I have tried removing features by hand and in
> some cases it does improve the score on the validation set.  This
> [question and answer][2] show how to use RFE with
> RandomForestClassifier but I don't understand how to do this when  you
> score on a separate validation set.
>
>  Can this sort of feature selection be done using RFE or some other
> scikit learn method?
>
>
>   [1]:
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html
>   [2]:
> https://stackoverflow.com/questions/24123498/recursive-feature-elimination-on-random-forest-using-scikit-learn
>
> Raphael
>
>
> --
> Full-scale, agent-less Infrastructure Monitoring from a single dashboard
> Integrate with 40+ ManageEngine ITSM Solutions for complete visibility
> Physical-Virtual-Cloud Infrastructure monitoring from one console
> Real user monitoring with APM Insights and performance trend reports
> Learn More
> http://pubads.g.doubleclick.net/gampad/clk?id=247754911=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
--
Full-scale, agent-less Infrastructure Monitoring from a single dashboard
Integrate with 40+ ManageEngine ITSM Solutions for complete visibility
Physical-Virtual-Cloud Infrastructure monitoring from one console
Real user monitoring with APM Insights and performance trend reports 
Learn More http://pubads.g.doubleclick.net/gampad/clk?id=247754911=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Comparing multiple Machine Learning models by ROC

2015-10-06 Thread Joel Nothman

See http://scikit-learn.org/stable/auto_examples/plot_roc.html

On 6 October 2015 at 17:56, aravind ramesh  wrote:

> Dear All,
>
> I want to compare my new svm model generated with already published model.
>
> I generated required features and got the prediction labels for both
> models.
>
> I have data in the following format:
>
> Data-PointOriginalOld-ModelNew-Model
> A0AV  11 1
> A0AX  10 1
> .
> .
> .
> Z0AV010
>
> So far I compared models using metrics like specificity, MCC, and other
> standard metrics. I want to get a ROC, plot for comparing two models.
>
> --Varavind
>
>
> --
>
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] [New feature] sklearn to PMML

2015-10-01 Thread Joel Nothman

Hi Mira,

I think the community is very interested in this work, but you might
consider collaborating with https://github.com/alex-pirozhenko/sklearn-pmml.
Its support for models is limited to trees and their ensembles, but it also
includes a test harness (
https://github.com/alex-pirozhenko/sklearn-pmml/blob/master/sklearn_pmml/convert/test/jpmml_test.py
).

Thanks

On 1 October 2015 at 19:55, Mira Epheldel  wrote:

> Hello,
>
> I've started working on a project that exports sklearn models to PMML
> format.
> Since I'm new to open source etc, I'm not sure if I should post to the
> mailing list
> or not about the kind of question I have, but anyway here I am.
>
> First of all, I'm not sure if this project it is interresting enough.
> Would it be an appreciated
> addition to scikit-learn ? Should I do a pull request ?
>
> Secondly, I'm not sure how to add tests. Until now, I've evaluated my
> generated
> files using jpmml. How could I validate my pmml files without using another
> big tool like jpmml ?
>
> Sorry if I shouldn't post here, and please point out any mistakes :
> English is not
> my primary language.
>
> Thanks in advance for any reply !
>
> Mira
>
>
> --
>
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] GridSearchCV using too many cores?

2015-09-24 Thread Joel Nothman

In terms of memory: I gather joblib.parallel is meant to automatically
memmap large arrays (>100MB). However, then each subprocess will extract a
non-contiguous set of samples from the data for training under a
cross-validation regime. Would I be right in thinking that's where the
memory blowout comes from? When there's risk of such an expensive indexing,
should we be using sample_weight (where the base estimator supports it) to
select portions of the training data without copy?

On 24 September 2015 at 23:21, Dale Smith  wrote:

> My experiences with parallel GridSearchCV and RFECV have not been
> pleasant. Memory usage was a huge problem, as apparently each job got a
> copy of the data with an out-of-the box scikit-learn installation using
> Anaconda 3. No matter how I set pre_dispatch, I could not get n_jobs = 2 to
> work, even with no one else using a 100 gb 24 core Windows box.
>
>
>
> I can create some reproducible code if anyone has time to work on it.
>
>
>
>
> *Dale Smith, Ph.D.*
> Data Scientist
> 
> [image:
> http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20logo.png]
> 
>
> * d.* 404.495.7220 x 4008   *f.* 404.795.7221
> Nexidia Corporate | 3565 Piedmont Road, Building Two, Suite 400 | Atlanta,
> GA 30305
>
> [image:
> http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20Blog.jpeg]
>  [image:
> http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20LinkedIn.jpeg]
>  [image:
> http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20Google.jpeg]
>  [image:
> http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20twitter.jpeg]
>  [image:
> http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20Youtube.jpeg]
> 
>
>
>
> *From:* Clyde Fare [mailto:clyde.f...@gmail.com]
> *Sent:* Thursday, September 24, 2015 8:38 AM
> *To:* scikit-learn-general@lists.sourceforge.net
> *Subject:* [Scikit-learn-general] GridSearchCV using too many cores?
>
>
>
> Hi,
>
>
>
> I'm trying to run GridSearchCV on a computational cluster but my jobs keep
> failing with an error from the queuing system claiming I'm using too many
> cores.
>
>
>
> If I set n_jobs equal 1, then the job doesn't fail but if it's more than
> one, no matter what number it is the job fails.
>
>
>
> In the example below I've set n_jobs to 6 and pre_dispatch to 12, and
> asked for 8 processors from the queue. I got the following error after ~10
> minutes: "PBS: job killed: ncpus 19.73 exceeded limit 8 (sum)"
>
>
>
> I've tried playing around the pre_dispatch but it makes difference. There
> will be other people running calculations on these nodes, so might there be
> some kind of intereference between GridSearchCV and the other jobs?
>
>
>
> Anyone come across anything like this before?
>
>
>
> Cheers
>
>
>
> Clyde
>
>
>
>
>
> import dill
>
> import numpy as np
>
>
>
> from sklearn.kernel_ridge import KernelRidge
>
> from sklearn.grid_search import GridSearchCV
>
>
>
> label='test_grdsrch3'
>
> X_train = np.random.rand(971,276)
>
> y_train = np.random.rand(971)
>
>
>
> kr = GridSearchCV(KernelRidge(), cv=10,
>
>   param_grid={"kernel": ['rbf', 'laplacian'],
>
>   "alpha": [2**i for i in
> np.arange(-40,-5,0.5)], #alpha=lambda
>
>   "gamma": [1/(2.**(2*i)) for i in
> np.arange(5,18,0.5)]},   #gamma = 1/sigma^2
>
>   pre_dispatch=12,
>
>   n_jobs=6)
>
>
>
> kr.fit(X_train, y_train)
>
>
>
> with open(label+'.pkl','w') as data_f:
>
> dill.dump(kr, data_f)
>
>
>
>
> --
> Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
> Get real-time metrics from all of your servers, apps and tools
> in one place.
> SourceForge users - Click here to start your Free Trial of Datadog now!
> http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Preparing the 0.17 release

2015-09-21 Thread Joel Nothman

And anyone looking for a small contribution to make could take on
https://github.com/scikit-learn/scikit-learn/issues/5281

On 22 September 2015 at 10:24, Andreas Mueller  wrote:

> The list is currently pretty long:
>
> https://github.com/scikit-learn/scikit-learn/issues?q=is%3Aopen+is%3Aissue+milestone%3A0.17
>
> I'd like to clean up the run of the test-suite. There are currently many
> deprecation warnings:
> https://github.com/scikit-learn/scikit-learn/issues/5089
> We should remove the multi-core support on LDA if it usually makes the
> runtime longer: https://github.com/scikit-learn/scikit-learn/issues/5118
> There are random segfaults in sparse PCA:
> https://github.com/scikit-learn/scikit-learn/issues/5013
> We might want to put a warning into PLS if our results are still not
> consistent with reference implementations:
> https://github.com/scikit-learn/scikit-learn/issues/2821
> We use 3.5 deprecated inspection:
> https://github.com/scikit-learn/scikit-learn/issues/5281
> There seem to be bugs in SGD
> https://github.com/scikit-learn/scikit-learn/issues/5246
> and Naive Bayes https://github.com/scikit-learn/scikit-learn/issues/5136
> that could be easily fixed.
>
> There are some near-ready PRs:
>
> https://github.com/scikit-learn/scikit-learn/pulls?q=is%3Aopen+is%3Apr+milestone%3A0.17
> The pipeline inverse_transform 1d thing should be deprecated:
> https://github.com/scikit-learn/scikit-learn/pull/5065 [this needs a
> two-line fix]
> We should value error if someone gives floats to a classifier:
> https://github.com/scikit-learn/scikit-learn/pull/5084
>
> I'll try to get to some of the easy fixes / bugs this week, but I'm
> still catching up on email.
>
> Andy
>
> On 09/21/2015 04:43 AM, Gilles Louppe wrote:
> > Hi Olivier,
> >
> > It seems the 3 PRs you mentioned are now closed/merged. Are there
> > other blocking PRs you need us to look at before freezing for the
> > release?
> >
> > Cheers,
> > Gilles
> >
> > On 4 September 2015 at 12:16, Olivier Grisel 
> wrote:
> >> Hi all,
> >>
> >> It's been a while since we have not made a release. I plan to cut the
> >> 0.17.X branch to prepare a first beta next week. Then 2 weeks after
> >> that we can release either a new beta or the final 0.17.0 based on
> >> feedback and if there is no identified blocker or major regression
> >> from 0.16.1.
> >>
> >> I would like the following to get in for the release to make it easier
> >> for people who experiment the multiprocessing crash under OSX (they
> >> would just have to use Python 3.4 to get rid of the crash):
> >>
> >> https://github.com/scikit-learn/scikit-learn/pull/5199
> >>
> >> Andreas also raised that as 0.17 will be the first release to include
> >> Latent Dirichlet Allocation, we should rename the sklearn.lda package
> >> and models to emphasize they are about Linear Discriminant Analysis to
> >> avoid confusion. This was attempted in
> >> https://github.com/scikit-learn/scikit-learn/pull/4421 but work is
> >> needed to rebase the renaming on the current state of master. I can do
> >> that if nobody does it in the mean time.
> >>
> >> There are other maintenance PRs that I would like to see part of this
> >> release such as:
> >>
> >> https://github.com/scikit-learn/scikit-learn/pull/5152
> >>
> >> Small bug fixes tagged with
> >> https://github.com/scikit-learn/scikit-learn/milestones/0.17 can be
> >> back-ported into the future 0.17.X branch to make it into the final
> >> 0.17.0 release.
> >>
> >> Le me know if you have any comment on this release process.
> >>
> >> Best,
> >>
> >> --
> >> Olivier
> >> http://twitter.com/ogrisel - http://github.com/ogrisel
> >>
> >>
> --
> >> ___
> >> Scikit-learn-general mailing list
> >> Scikit-learn-general@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> --
> > ___
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> --
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Common tests for functions vs deprecating functions

2015-09-10 Thread Joel Nothman

A reflective response without a clear opinion:

I'll admit to rarely-if-ever using function versions, and suspect they
frequently have limited utility over the estimator interface. Occasionally
they even wrap the estimator interface, so they're not going to provide the
efficiency advantages Gaël talks about.

While "People writing algorithms are not used to think in terms of
objects.", such people still know how to wrap an object to make it look
like a function. Seeing as there has been no consistent approach to
developing functional learners, I think that there are many functions that
effectively provide (data, estimator parameters) -> model attributes. This
is clearly a nice functional abstraction, but in truth, only those
functions that accept more/different parameters from their estimator
cousins, for instance only solve part of the learning problem, are
distinctively useful.

>From an API development perspective, functions that return model parameters
can be frustrating; they land up accumulating return_something flags in
order to fit changing/expanding output needs, while estimators act as a
namespace where diagnostic output can be dumped, usually at very little
cost. As with output, users may expect function input (i.e. argument
ordering) to be more fixed, in comparison to estimators where separating
data from parameters means it is more natural to use kwargs in
construction, or simply use set_params or attribute setting. So from the
perspective of version compatibility the function versions are harder to
maintain, and we've not yet really ascertained their benefit.

Their presence in the public API often duplicates the cost of maintaining
docstrings. But we could fairly disregard this issue, in part because even
when private we'd appreciate clear and explicit parameter/returns
documentation.

@Andy, the documentation implies these are for advanced use by (generally)
not referencing them in the narrative documentation. I think that's a fair
way to keep them only for the sight of those who dig deeper, but this
implicitness leaves some maintenance risks. While I don't think a note in
the docstring of each function version is the right solution, "See Also"
could be used to indicate the relationship. Additionally, or alternatively,
we could split classes.rst into "Estimators", "Low-level learning
functions" and "Utilities".

On 11 September 2015 at 01:21, Andreas Mueller  wrote:

>
>
> On 09/10/2015 10:08 AM, Gael Varoquaux wrote:
> >> >And your statement "they are for advanced users" is not manifested in
> >> >the API or documentation.
> > OK, but that's a bug of the documentation.
> So you suggest adding to the docstring of every function "this is for
> advanced users only"?
> That is kind of like making them private, only that private is much more
> explicit.
> >> >There is no reason a user would expect one to act different from the
> other.
> > Users who don't code aglorithms probably don't have any reason to be
> > using them.
> >
> Well the reason would be they find them in the API docs and they don't
> know whether to use the class or the function.
>
> It is fair to summarize your opinion as
> "functions don't need input validation or a consistent interface, the
> documentation should make clear they
> are for advanced users"?
>
> FWIW many of the functions do input validation at the moment, it is just
> inconsistent.
>
>
> --
> Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
> Get real-time metrics from all of your servers, apps and tools
> in one place.
> SourceForge users - Click here to start your Free Trial of Datadog now!
> http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
--
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Turning on sample weights for linear_model.LogisticRegression

2015-08-29 Thread Joel Nothman

new and small PR is best.

On 29 August 2015 at 03:35, Valentin Stolbunov valentin.stolbu...@gmail.com
 wrote:

 Sounds good. Does anyone happen to know of any PRs that are related and
 close to being accepted? Or do you think a new PR is the best option in
 this case?

 On Thu, Aug 27, 2015 at 6:33 PM, Joel Nothman joel.noth...@gmail.com
 wrote:

 +1

 On 28 August 2015 at 04:23, Andreas Mueller t3k...@gmail.com wrote:

 I think it would be fine to enable it now without support in all solvers.


 On 8/27/2015 11:29 AM, Valentin Stolbunov wrote:

 Joel, I see you've done some work in that PR. Is an additional review
 all that's needed there? Looks like changes in Logistic Regression CV broke
 the original contribution and it has since stalled (over 1 year ago).

 I guess the big question is: what is the best way to get sample weights
 in LR? Would it be to wait for progress in that PR and have weights for all
 solvers, or simply enable them in the other two solvers via the rough
 steps I outlined earlier?

 On Wed, Aug 26, 2015 at 9:59 PM, Andy t3k...@gmail.com wrote:

 On 08/26/2015 09:29 PM, Joel Nothman wrote:
  I agree. I suspect this was an unintentional omission, in fact.
 
  Apart from which, sample_weight support in liblinear could be merged
  from https://github.com/scikit-learn/scikit-learn/pull/2784 which is
  dormant, and merely needs some core contributors to show interest in
  merging it...
 
 merely ;)


 --
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --



 ___
 Scikit-learn-general mailing 
 listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --

 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --

 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --

 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] RFCC: duecredit citations for sklearn (and anything else you like ; ) )

2015-08-29 Thread Joel Nothman

A Cite me with duecredit sash on the opposite corner to Fork me on
github? ;)

On 30 August 2015 at 14:36, Mathieu Blondel math...@mblondel.org wrote:

On Sun, Aug 30, 2015 at 7:27 AM, Yaroslav Halchenko s...@onerussian.com
wrote:

As long as installation is straightforward, I think it should be a minor
hurdle. It will be by default (Recommends) installed with scikit-learn,
pymvpa,
and any other related package I am maintaining in Debian/Ubuntu. It is
already
available from pypi although installation there could be a bit
problematic due
to external depends indeed. We will look into minimizing possibility for
issues and will also look into packaging within conda universe. Happen
it is a
no brainer to have it installed -- installation of an external tool,
especially
if recommended by the project, should not be a big issue.

Even if installation is easy, people also have to know that the project
even exists.

For this reason, I think the ideal
solution should be web based. This could for example take the form
of a
sphinx plugin for easily integrating with the project's
documentation. We
could maintain a BibTeX file and reference BibTeX entries from
within the
documentation. The sphinx plugin would make it easier to find
relevant
citations from various places in the documentation (class reference,
user
guide).

Although sound idea on its own, even if complementary to duecredit, it
IMHO would not be as productive. Sure thing some determined users
will look up references for pieces they used, but not exhaustively and
not for core functions which they might have not even knew have called
(indirectly).

Indeed, both approaches are complementary. Even if duecredit succeeds, I
think it would still be nice to make it easier to find relevant citations
from the online documentation. Ideally, the citation annotations would be
reused by both duecredit and the sphinx plugin.

That is exactly what duecredit tries to address -- automate that
collection of references.

We also need to give an idea to users as to *why* they should cite a
certain paper. For example, cite paper [...] because it is the solver used
by LinearSVC(dual=True) for solving the SVM dual objective.

One difficulty, though, is that the relevant citations in
scikit-learn
estimators often depends on constructor options. For example, in
LinearSVC, the paper to cite is not the same whether we use
dual=True or
dual=False, penalty=l1 or penalty=l2, etc.

That is already partially handled, e.g.

https://github.com/duecredit/duecredit/blob/master/duecredit/injections/mod_scipy.py#L134
injector.add('scipy.cluster.hierarchy', 'linkage', BibTeX(
@article{ward1963hierarchical,
title={Hierarchical grouping to optimize an objective function},
author={Ward Jr, Joe H},
journal={Journal of the American statistical association},
volume={58},
number={301},
pages={236--244},
year={1963},
publisher={Taylor \ Francis}
}),
conditions={(1, 'method'): {'ward'}},
description=Ward hierarchical clustering,
min_version='0.4.3',
tags=['reference'])

says to reference that publication only if method='ward' to the linkage
call.
Similarly I can decorate __init__. But thus partially -- since I don't
want to
cite merely if __init__ was called, I would like to cite only if actual
computation has happened, so it should also be conditioned on some
methods of
the class being called... We will look into supporting that.

Ideally the citation annotations should be as concise as possible. For the
BibTeX part, I would prefer to reference an external BibTeX file. For
example, the file could sit next to __ini__.py at the project root.

Mathieu

___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] issue with pipeline always giving same results

2015-08-27 Thread Joel Nothman

The randomisation only changes the order of the data, not the set of data
points.

On 27 August 2015 at 22:44, Andrew Howe ahow...@gmail.com wrote:

 I'm working through the tutorial, and also experimenting kind of on my
 own.  I'm on the text analysis example, and am curious about the relative
 merits of analyzing by word frequency, relative frequency, and adjusted
 relative frequency.  Using the 20 newsgroups data, I've built a set of
 pipelines within a cross_validation loop; the important part of the code is
 here:

 # get the data
 nw = dat.datetime.now()
 rndstat = nw.hour*3600+nw.minute*60+nw.second
 twenty_train = fetch_20newsgroups(subset='train', categories=categories,
 random_state = rndstat, shuffle=True, download_if_missing=False)
 twenty_test = fetch_20newsgroups(subset='test', categories=categories,
 random_state = rndstat, shuffle=True, download_if_missing=False)

 # first with raw counts
 text_clf = Pipeline([('vect', CountVectorizer()), ('clf',
 MultinomialNB())])
 text_clf.fit(twenty_train.data,twenty_train.target)
 pred = text_clf.predict(twenty_test.data)
 test_ccrs[mccnt,0] = sum(pred ==
 twenty_test.target)/len(twenty_test.target)

 The issue is that everytime I run this, though I've confirmed the data
 sampled is different, the value in test_ccrs is *always* the same.  Am I
 missing something?

 Thanks!
 Andrew

 ~~~
 J. Andrew Howe, PhD
 Editor-in-Chief, European Journal of Mathematical Sciences
 Executive Editor, European Journal of Pure and Applied Mathematics
 www.andrewhowe.com
 http://www.linkedin.com/in/ahowe42
 https://www.researchgate.net/profile/John_Howe12/
 I live to learn, so I can learn to live. - me
 ~~~


 --

 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Turning on sample weights for linear_model.LogisticRegression

2015-08-27 Thread Joel Nothman

+1

On 28 August 2015 at 04:23, Andreas Mueller t3k...@gmail.com wrote:

 I think it would be fine to enable it now without support in all solvers.


 On 8/27/2015 11:29 AM, Valentin Stolbunov wrote:

 Joel, I see you've done some work in that PR. Is an additional review all
 that's needed there? Looks like changes in Logistic Regression CV broke the
 original contribution and it has since stalled (over 1 year ago).

 I guess the big question is: what is the best way to get sample weights in
 LR? Would it be to wait for progress in that PR and have weights for all
 solvers, or simply enable them in the other two solvers via the rough
 steps I outlined earlier?

 On Wed, Aug 26, 2015 at 9:59 PM, Andy t3k...@gmail.com wrote:

 On 08/26/2015 09:29 PM, Joel Nothman wrote:
  I agree. I suspect this was an unintentional omission, in fact.
 
  Apart from which, sample_weight support in liblinear could be merged
  from https://github.com/scikit-learn/scikit-learn/pull/2784 which is
  dormant, and merely needs some core contributors to show interest in
  merging it...
 
 merely ;)


 --
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --



 ___
 Scikit-learn-general mailing 
 listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --

 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Turning on sample weights for linear_model.LogisticRegression

2015-08-26 Thread Joel Nothman

I agree. I suspect this was an unintentional omission, in fact.

Apart from which, sample_weight support in liblinear could be merged from
https://github.com/scikit-learn/scikit-learn/pull/2784 which is dormant,
and merely needs some core contributors to show interest in merging it...

On 27 August 2015 at 10:15, Valentin Stolbunov valentin.stolbu...@gmail.com
 wrote:

 Hello everyone,

 I noticed that two of the three solvers in the logistic regression module
 (newton-cg and lbfgs) accept sample weights, but this feature is hidden
 away from users by not recognizing sample_weight as parameter in .ft().
 Instead, sample_weight is set to ones (line 555 of logistic.py). To the
 best of my knowledge this is because the default solver (liblinear) does
 not support them?

 Could we instead allow sample_weight as a parameter (default None) and set
 them to ones only if the chosen solver is liblinear (with appropriate
 documentation notes - similar to the way the L1 penalty is supported only
 by liblinear)?

 I realize that SGDClassifier's .fit() accepts sample weights and the loss
 can be set to 'log', however this isn't exactly the same.

 What do you think?

 Valentin


 --

 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Persisting models

2015-08-20 Thread Joel Nothman

I suspect supporting PMML import is a separate and low-priority project.
Higher priority is support for transformers (in pipelines / feature
unions), other predictors, and tests that verify the model against an
existing PMML predictor.

On 21 August 2015 at 01:37, Dale Smith dsm...@nexidia.com wrote:

 Package sklearn_pmml appeared on github:

 https://github.com/alex-pirozhenko/sklearn-pmml

 It's still in the early stages. I have yet to experiment with it, and I
 don't think it supports pmml import.

 Dale Smith, Ph.D.
 Data Scientist
 


 d. 404.495.7220 x 4008   f. 404.795.7221
 Nexidia Corporate | 3565 Piedmont Road, Building Two, Suite 400 | Atlanta,
 GA 30305




 -Original Message-
 From: Alexandre Gramfort [mailto:alexandre.gramf...@m4x.org]
 Sent: Thursday, August 20, 2015 4:28 AM
 To: scikit-learn-general
 Subject: Re: [Scikit-learn-general] Persisting models

 hi,

  Agreed—this is exactly the type of use case I want to support.
  Pickling won't work here, but using HDF5 like MNE does would probably
  be close to ideal (thanks to Chris Holdgraf for the
  heads-up):
 
  https://github.com/mne-tools/mne-python/blob/master/mne/_hdf5.py

 For your info Eric Larson has put the file in a separate project to make
 it easier to improve and reuse.

 https://github.com/h5io/h5io

 Alex


 --
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

 --
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Persisting models

2015-08-19 Thread Joel Nothman

Frequently the suggestion of supporting PMML or similar is raised, but it's
not clear whether such models would be importable in to scikit-learn, or
how to translate scikit-learn transformation pipelines into its notation
without going mad, etc. Still, even a library of exporters for individual
components would be welcome, IMO, if someone wanted to construct it.

On 19 August 2015 at 15:08, Sebastian Raschka se.rasc...@gmail.com wrote:

 Oh wow, thanks for the link, I just skimmed over the code, but this is an
 interesting idea snd looks like the sort of thing that would make my life
 easier in future. I will dig into that! That’s great, thanks!


  On Aug 19, 2015, at 12:58 AM, Stefan van der Walt stef...@berkeley.edu
 wrote:
 
  On 2015-08-18 21:37:41, Sebastian Raschka se.rasc...@gmail.com
  wrote:
  I think for “simple” linear models, it would be not a bad idea
  to save the weight coefficients in a log file or so. Here, I
  think that your model is really not that dependent on the
  changes in the scikit-learn code base (for example, imagine that
  you trained a model 10 years ago and published the results in a
  research paper, and today, someone asked you about this
  model). I mean, you know all about how a logistic regression,
  SVM etc. works, in the worst case you just use those weights to
  make the prediction on new data — I think in a typical “model
  persistence” case you don’t “update” your model anyways so
  “efficiency” would not be that big of a deal in a typical “worst
  case use case”.
 
  Agreed—this is exactly the type of use case I want to support.
  Pickling won't work here, but using HDF5 like MNE does would
  probably be close to ideal (thanks to Chris Holdgraf for the
  heads-up):
 
  https://github.com/mne-tools/mne-python/blob/master/mne/_hdf5.py
 
  Stéfan
 
 
 --
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



 --
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Persisting models

2015-08-19 Thread Joel Nothman

See https://github.com/scikit-learn/scikit-learn/issues/1596

On 19 August 2015 at 16:35, Joel Nothman joel.noth...@gmail.com wrote:

 Frequently the suggestion of supporting PMML or similar is raised, but
 it's not clear whether such models would be importable in to scikit-learn,
 or how to translate scikit-learn transformation pipelines into its notation
 without going mad, etc. Still, even a library of exporters for individual
 components would be welcome, IMO, if someone wanted to construct it.

 On 19 August 2015 at 15:08, Sebastian Raschka se.rasc...@gmail.com
 wrote:

 Oh wow, thanks for the link, I just skimmed over the code, but this is an
 interesting idea snd looks like the sort of thing that would make my life
 easier in future. I will dig into that! That’s great, thanks!


  On Aug 19, 2015, at 12:58 AM, Stefan van der Walt stef...@berkeley.edu
 wrote:
 
  On 2015-08-18 21:37:41, Sebastian Raschka se.rasc...@gmail.com
  wrote:
  I think for “simple” linear models, it would be not a bad idea
  to save the weight coefficients in a log file or so. Here, I
  think that your model is really not that dependent on the
  changes in the scikit-learn code base (for example, imagine that
  you trained a model 10 years ago and published the results in a
  research paper, and today, someone asked you about this
  model). I mean, you know all about how a logistic regression,
  SVM etc. works, in the worst case you just use those weights to
  make the prediction on new data — I think in a typical “model
  persistence” case you don’t “update” your model anyways so
  “efficiency” would not be that big of a deal in a typical “worst
  case use case”.
 
  Agreed—this is exactly the type of use case I want to support.
  Pickling won't work here, but using HDF5 like MNE does would
  probably be close to ideal (thanks to Chris Holdgraf for the
  heads-up):
 
  https://github.com/mne-tools/mne-python/blob/master/mne/_hdf5.py
 
  Stéfan
 
 
 --
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



 --
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] positive / nonnegative least angle regression estimators

2015-08-17 Thread Joel Nothman

Please make a pull request. This looks like a small and useful change,
consistent with Lasso's support of non-negativity.

On 18 August 2015 at 14:30, Michael Graber michigra...@gmail.com wrote:


 Dear all,

 I extended the lars_path, Lars and LarsLasso estimators in the
 scikit-learn least_angle.py module with the possibility to restrict
 coefficients to be  0 using the method described in the original paper by
 Efron et al, 2004, chapter 3.4.

 (
 https://github.com/scikit-learn/scikit-learn/compare/master...michigraber:nonnegative-lars
 )

 If you think this would be useful i could issue a pull request.

 I have not extended it to the cross-validated estimators yet but would be
 willing to do so, if requested.

 Cheers,
 Michael

 --
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Gridsearch pickle error with scipy distributions

2015-08-15 Thread Joel Nothman

This is a known scipy deficiency. See
https://github.com/scipy/scipy/pull/4821 and related issues.

On 15 August 2015 at 05:37, Jason Sanchez jason.sanchez.m...@statefarm.com
wrote:

 This code raises a PicklingError:

 from sklearn.datasets import load_boston
 from sklearn.pipeline import Pipeline
 from sklearn.ensemble import RandomForestRegressor
 from sklearn.grid_search import RandomizedSearchCV
 from sklearn.externals import joblib
 from scipy.stats import randint

 X, y = load_boston().data, load_boston().target
 pipe = Pipeline([(rf, RandomForestRegressor())])
 params = {rf__n_estimators: randint(2,3)}
 random_search = RandomizedSearchCV(pipe, params, n_iter=1).fit(X, y)
 joblib.dump(random_search, final_model.pkl, compress=3)


 In params, if randint(2,3) is changed to range(2,3), no pickling error
 occurs.

 In 0.16.2, changing all the parameters in a large grid search to ranges
 causes a memory error (due to all possible combinations being saved to an
 array), so this is not a workable solution.

 Pickling just the best_estimator_ works (which is now what I do), but
 currently there does not seem to be a way to pickle a gridsearch that has a
 large number of hyper-parameters (very common with RandomizedSearchCV) in
 0.16.2.

 You all do amazing work. Thank you all so much for your contributions to
 the project.

 Jason


 --
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] DecisionTreeClassifier refusing to split

2015-08-15 Thread Joel Nothman

While it's not bad to have more people know the internals of the tree code,
ideally people shouldn't *have* to. Do you have any hints for how
documentation could better serve users to not land in whatever trap you did?

On 15 August 2015 at 16:03, Simon Burton si...@arrowtheory.com wrote:


 My bad. I did something stupid (again).

 On the plus side, I now know my way around the internals of
 the tree code much better.

 Cheers.


 On Sat, 15 Aug 2015 14:11:49 +1000
 Simon Burton si...@arrowtheory.com wrote:

 
  Hi,
 
  I am training a DecisionTreeClassifier on samples with a large (500)
  number of features. I find that the tree refuses to grow (and so
  cannot be used in boosting) unless I remove (zero) some of the
  features. This seems strange. Any ideas why? I tried fiddling
  with the settings, now delving into the implementation.
 
  Simon.
 
 
 --
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


 --
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] scikit-learn Truck Factor

2015-08-12 Thread Joel Nothman

I find that list somewhat obscure, and reading your section on Code
Authorship gives me some sense of why. All of those people have been very
important contributors to the project, and I'd think the absence of Gaël,
Andreas and Olivier alone would be very damaging, if only because of their
dedication to the collaborative maintenance involved. Yet despite his top
score Fabian has not actively contributed for years and would be quite
unfamiliar with many of the files he created, while I think Mathieu Blondel
and Alexandre Gramfort, for example, would provide substantial code
coverage without those seven (although they may not be interested in the
maintenance).

I feel the approach is problematic because of the weight it puts on number
of commits (if that's how I should interpret the number of changes made
in f by D). Apart from the susceptibility of this measure to individual
author preferences, the project in infancy favoured small commits (because
the team was small), but more recently has preferred large contributions,
and has frequently squashed contributions with large commit histories into
single commits.

Have you considered measures of number of deliveries apart from number of
commits? While counting lines of code presents other problems, the number
of months in which a user committed changes to a file might be a more
realistic representation.

A number of factors attenuate developer loss: documentation and overall
code quality; fairly open and wide contribution, with regular in-person
interaction for a large number of contributors; GSoC and other
project-based involvement entailing new contributors become very familiar
with parts of the code; and the standardness of the algorithms implemented
in scikit-learn, meaning they can be maintained on the basis of reference
works (a broader documentation).

On 12 August 2015 at 22:57, Guilherme Avelino gavel...@gmail.com wrote:

 As part of my PhD research on code authorship, we calculated the Truck
 Factor (TF) of some popular GitHub repositories.

 As you probably know, the Truck (or Bus) Factor designates the minimal
 number of developers that have to be hit by a truck (or quit) before a
 project is incapacitated. In our work, we consider that a system is in
 trouble if more than 50% of its files become orphan (i.e., without a main
 author).

 More details on our work in this preprint:
 https://peerj.com/preprints/1233

 We calculated the TF for scikit-learn and obtained a value of 7.

 The developers responsible for this TF are:

 Fabian Pedregosa - author of 22% of the files
 Gael varoquaux - author of 13% of the files
 Andreas Mueller - author of 12% of the files
 Olivier Grisel - author of 10% of the files
 Lars Buitinck - author of 10% of the files
 Jake Vanderplas - author of 6% of the files
 Vlad Niculae - author of 5% of the files

 To validate our results, we would like to ask scikit-learn developers the
 following three brief questions:

 (a) Do you agree that the listed developers are the main developers of
 scikit-learn?

 (b) Do you agree that scikit-learn will be in trouble if the listed
 developers leave the project (e.g., if they win in the lottery, to be less
 morbid)?

 (c) Does scikit-learn have some characteristics that would attenuate the
 loss of the listed developers (e.g., detailed documentation)?

 Thanks in advance for your collaboration,

 Guilherme Avelino
 PhD Student
 Applied Software Engineering Group (ASERG)
 UFMG, Brazil
 http://aserg.labsoft.dcc.ufmg.br/

 --
 Prof. Guilherme Amaral Avelino
 Universidade Federal do Piauí
 Departamento de Computação


 --

 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-08-06 Thread Joel Nothman

It's nice to see some decent speed-up factors, though the accuracy tradeoff
is still not so great. Still, I'd like to see the code and where we can go
from here. Great work so far!

On 7 August 2015 at 07:50, Maheshakya Wijewardena pmaheshak...@gmail.com
wrote:

 I did a rough implementation, setting b = min_hash_match. The result I got
 from running the benchmark is attached. It was able to roughly triple the
 speed-up of kneighbors function for large index sizes. However, this
 implementation adds some overhead to fitting time as there are 2**b *
 n_estimators times numpy searchsorted calls during training. But that may
 most probably be compensated as the number of queries grow since 2**b *
 n_estimators is a constant time.

 I'll send a PR with proper refactoring.


 On Sun, Aug 2, 2015 at 6:41 PM, Joel Nothman joel.noth...@gmail.com
 wrote:

 Thanks, I look forward to this being improved, while I have little
 availability to help myself atm.

 On 2 August 2015 at 22:58, Maheshakya Wijewardena pmaheshak...@gmail.com
  wrote:

 I agree with Joel. Profiling indicated that 69.8% of total time of
 kneighbors is spent on _find_matching_indices and 22.9% is spent on
 _compute_distances. So I'll give priority to work on _find_matching_indices
 with the method you suggested.

 On Sun, Aug 2, 2015 at 10:51 AM, Maheshakya Wijewardena 
 pmaheshak...@gmail.com wrote:

 Hi Joel,
 I was on vacation during past 3 days. I''ll look into this asap and let
 you all know.

 I also did some profiling, but only with the usage of
 `pairwise_distance` method. Brute force technique directly uses that for
 the entire query array, but LSH uses that in a loop and I noticed there is
 a huge lag. I'll first confirm your claims. I can start working on this but
 I think I'll need your or some other contributers' reviewing as well . I'll
 do this if it's possible.

 On Sun, Aug 2, 2015 at 3:50 AM, Joel Nothman joel.noth...@gmail.com
 wrote:

 @Maheshakya, will you be able to do work in the near future on
 speeding up the ascending phase instead? Or should another contributor 
 take
 that baton? Not only does it seem to be a major contributor to runtime, 
 but
 it is independent of metric and hashing mechanism (within binary hashes),
 and hence the most fundamental component of LSHForest.T

 On 30 July 2015 at 22:28, Joel Nothman joel.noth...@gmail.com wrote:

 (sorry, I should have said the first b layers, not 2**b layers,
 producing a memoization of 2**b offsets

 On 30 July 2015 at 22:22, Joel Nothman joel.noth...@gmail.com
 wrote:

 One approach to fixing the ascending phase would ensure that
 _find_matching_indices is only searching over parts of the tree that 
 have
 not yet been explored, while currently it searches over the entire 
 index at
 each depth.

 My preferred, but more experimental, solution is to memoize where
 the first 2**b layers of the tree begin and end in the index, for small 
 b.
 So if our index stored:
 [[0, 0, 0, 1, 1, 0, 0, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 1, 0, 0, 0, 0, 0, 0],
  [0, 1, 0, 1, 1, 0, 0, 0],
  [0, 1, 1, 0, 0, 1, 1, 0],
  [1, 0, 0, 0, 0, 1, 0, 1],
  [1, 0, 0, 1, 0, 1, 0, 1],
  [1, 1, 0, 0, 0, 0, 0, 0],
  [1, 1, 1, 1, 1, 0, 0, 0]]
 and b=2, we'd memoize offsets for prefixes of size 2:
 [0, # 00
  3, # 01
  6, # 10
  8, # 11
 ]

 Given a query like  [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift
 down to leave the first b bits [0, 1] remaining, and look them up in the
 array just defined to identify a much narrower search space [3, 6) 
 matching
 that prefix in the overall index.

 Indeed, given the min_hash_match constraint, not having this sort of
 thing for b = min_hash_match seems wasteful.

 This provides us O(1) access to the top layers of the tree when
 ascending, and makes the searchsorted calls run in log(n / (2 ** b)) 
 time
 rather than log(n). It is also much more like traditional LSH. However, 
 it
 complexifies the code, as we now have to consider two strategies for
 descent/ascent.



 On 30 July 2015 at 21:46, Joel Nothman joel.noth...@gmail.com
 wrote:

 What makes you think this is the main bottleneck? While it is not
 an insignificant consumer of time, I really doubt this is what's making
 scikit-learn's LSH implementation severely underperform with respect to
 other implementations.

 We need to profile. In order to do that, we need some sensible
 parameters that users might actually want, e.g. number of features for
 {dense, sparse} cases, index size, target 10NN precision and recall
 (selecting corresponding n_estimators and n_candidates). Ideally we'd
 consider real-world datasets. And of course, these should be sensible 
 for
 whichever metric we're operating over, and whether we're doing KNN or
 Radius searches.

 I don't know if it's realistic, but I've profiled the following
 bench_plot_approximate_neighbors settings:

 Building NearestNeighbors for 10 samples in 100 dimensions
 LSHF parameters: n_estimators = 15, n_candidates = 100

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-08-02 Thread Joel Nothman

Thanks, I look forward to this being improved, while I have little
availability to help myself atm.

On 2 August 2015 at 22:58, Maheshakya Wijewardena pmaheshak...@gmail.com
wrote:

 I agree with Joel. Profiling indicated that 69.8% of total time of
 kneighbors is spent on _find_matching_indices and 22.9% is spent on
 _compute_distances. So I'll give priority to work on _find_matching_indices
 with the method you suggested.

 On Sun, Aug 2, 2015 at 10:51 AM, Maheshakya Wijewardena 
 pmaheshak...@gmail.com wrote:

 Hi Joel,
 I was on vacation during past 3 days. I''ll look into this asap and let
 you all know.

 I also did some profiling, but only with the usage of `pairwise_distance`
 method. Brute force technique directly uses that for the entire query
 array, but LSH uses that in a loop and I noticed there is a huge lag. I'll
 first confirm your claims. I can start working on this but I think I'll
 need your or some other contributers' reviewing as well . I'll do this if
 it's possible.

 On Sun, Aug 2, 2015 at 3:50 AM, Joel Nothman joel.noth...@gmail.com
 wrote:

 @Maheshakya, will you be able to do work in the near future on speeding
 up the ascending phase instead? Or should another contributor take that
 baton? Not only does it seem to be a major contributor to runtime, but it
 is independent of metric and hashing mechanism (within binary hashes), and
 hence the most fundamental component of LSHForest.T

 On 30 July 2015 at 22:28, Joel Nothman joel.noth...@gmail.com wrote:

 (sorry, I should have said the first b layers, not 2**b layers,
 producing a memoization of 2**b offsets

 On 30 July 2015 at 22:22, Joel Nothman joel.noth...@gmail.com wrote:

 One approach to fixing the ascending phase would ensure that
 _find_matching_indices is only searching over parts of the tree that have
 not yet been explored, while currently it searches over the entire index 
 at
 each depth.

 My preferred, but more experimental, solution is to memoize where the
 first 2**b layers of the tree begin and end in the index, for small b. So
 if our index stored:
 [[0, 0, 0, 1, 1, 0, 0, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 1, 0, 0, 0, 0, 0, 0],
  [0, 1, 0, 1, 1, 0, 0, 0],
  [0, 1, 1, 0, 0, 1, 1, 0],
  [1, 0, 0, 0, 0, 1, 0, 1],
  [1, 0, 0, 1, 0, 1, 0, 1],
  [1, 1, 0, 0, 0, 0, 0, 0],
  [1, 1, 1, 1, 1, 0, 0, 0]]
 and b=2, we'd memoize offsets for prefixes of size 2:
 [0, # 00
  3, # 01
  6, # 10
  8, # 11
 ]

 Given a query like  [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down
 to leave the first b bits [0, 1] remaining, and look them up in the array
 just defined to identify a much narrower search space [3, 6) matching that
 prefix in the overall index.

 Indeed, given the min_hash_match constraint, not having this sort of
 thing for b = min_hash_match seems wasteful.

 This provides us O(1) access to the top layers of the tree when
 ascending, and makes the searchsorted calls run in log(n / (2 ** b)) time
 rather than log(n). It is also much more like traditional LSH. However, it
 complexifies the code, as we now have to consider two strategies for
 descent/ascent.



 On 30 July 2015 at 21:46, Joel Nothman joel.noth...@gmail.com wrote:

 What makes you think this is the main bottleneck? While it is not an
 insignificant consumer of time, I really doubt this is what's making
 scikit-learn's LSH implementation severely underperform with respect to
 other implementations.

 We need to profile. In order to do that, we need some sensible
 parameters that users might actually want, e.g. number of features for
 {dense, sparse} cases, index size, target 10NN precision and recall
 (selecting corresponding n_estimators and n_candidates). Ideally we'd
 consider real-world datasets. And of course, these should be sensible for
 whichever metric we're operating over, and whether we're doing KNN or
 Radius searches.

 I don't know if it's realistic, but I've profiled the following
 bench_plot_approximate_neighbors settings:

 Building NearestNeighbors for 10 samples in 100 dimensions
 LSHF parameters: n_estimators = 15, n_candidates = 100
 Building LSHForest for 10 samples in 100 dimensions
 Done in 1.492s
 Average time for lshf neighbor queries: 0.005s
 Average time for exact neighbor queries: 0.002s
 Average Accuracy : 0.88
 Speed up: 0.5x

 Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query
 matrix, we have:

- 0.03 spent in _query (hashing and descending)
- 0.91 spent in _compute_distances (exact distance calculation)
- 3.80 remaining in _get_candidates (ascending phase), almost all
of which is spent in _find_matching_indices

 Cutting exact distance calculation to 0s will still not get this
 faster than the exact approach. Of course, your mileage may vary, but 
 this
 suggests to me you're barking up the wrong tree (no pun intended).

 On 30 July 2015 at 19:43, Maheshakya Wijewardena 
 pmaheshak...@gmail.com wrote:

 Hi,

 I've started to look into the matter

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-08-01 Thread Joel Nothman

@Maheshakya, will you be able to do work in the near future on speeding up
the ascending phase instead? Or should another contributor take that baton?
Not only does it seem to be a major contributor to runtime, but it is
independent of metric and hashing mechanism (within binary hashes), and
hence the most fundamental component of LSHForest.

On 30 July 2015 at 22:28, Joel Nothman joel.noth...@gmail.com wrote:

 (sorry, I should have said the first b layers, not 2**b layers, producing
 a memoization of 2**b offsets)

 On 30 July 2015 at 22:22, Joel Nothman joel.noth...@gmail.com wrote:

 One approach to fixing the ascending phase would ensure that
 _find_matching_indices is only searching over parts of the tree that have
 not yet been explored, while currently it searches over the entire index at
 each depth.

 My preferred, but more experimental, solution is to memoize where the
 first 2**b layers of the tree begin and end in the index, for small b. So
 if our index stored:
 [[0, 0, 0, 1, 1, 0, 0, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 1, 0, 0, 0, 0, 0, 0],
  [0, 1, 0, 1, 1, 0, 0, 0],
  [0, 1, 1, 0, 0, 1, 1, 0],
  [1, 0, 0, 0, 0, 1, 0, 1],
  [1, 0, 0, 1, 0, 1, 0, 1],
  [1, 1, 0, 0, 0, 0, 0, 0],
  [1, 1, 1, 1, 1, 0, 0, 0]]
 and b=2, we'd memoize offsets for prefixes of size 2:
 [0, # 00
  3, # 01
  6, # 10
  8, # 11
 ]

 Given a query like  [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down to
 leave the first b bits [0, 1] remaining, and look them up in the array just
 defined to identify a much narrower search space [3, 6) matching that
 prefix in the overall index.

 Indeed, given the min_hash_match constraint, not having this sort of
 thing for b = min_hash_match seems wasteful.

 This provides us O(1) access to the top layers of the tree when
 ascending, and makes the searchsorted calls run in log(n / (2 ** b)) time
 rather than log(n). It is also much more like traditional LSH. However, it
 complexifies the code, as we now have to consider two strategies for
 descent/ascent.



 On 30 July 2015 at 21:46, Joel Nothman joel.noth...@gmail.com wrote:

 What makes you think this is the main bottleneck? While it is not an
 insignificant consumer of time, I really doubt this is what's making
 scikit-learn's LSH implementation severely underperform with respect to
 other implementations.

 We need to profile. In order to do that, we need some sensible
 parameters that users might actually want, e.g. number of features for
 {dense, sparse} cases, index size, target 10NN precision and recall
 (selecting corresponding n_estimators and n_candidates). Ideally we'd
 consider real-world datasets. And of course, these should be sensible for
 whichever metric we're operating over, and whether we're doing KNN or
 Radius searches.

 I don't know if it's realistic, but I've profiled the following
 bench_plot_approximate_neighbors settings:

 Building NearestNeighbors for 10 samples in 100 dimensions
 LSHF parameters: n_estimators = 15, n_candidates = 100
 Building LSHForest for 10 samples in 100 dimensions
 Done in 1.492s
 Average time for lshf neighbor queries: 0.005s
 Average time for exact neighbor queries: 0.002s
 Average Accuracy : 0.88
 Speed up: 0.5x

 Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query
 matrix, we have:

- 0.03 spent in _query (hashing and descending)
- 0.91 spent in _compute_distances (exact distance calculation)
- 3.80 remaining in _get_candidates (ascending phase), almost all of
which is spent in _find_matching_indices

 Cutting exact distance calculation to 0s will still not get this faster
 than the exact approach. Of course, your mileage may vary, but this
 suggests to me you're barking up the wrong tree (no pun intended).

 On 30 July 2015 at 19:43, Maheshakya Wijewardena pmaheshak...@gmail.com
  wrote:

 Hi,

 I've started to look into the matter of improving performance of
 LSHForest. As we have discussed sometime before(in fact, quite a long
 time), main concern is to Cythonize distance calculations. Currently, this
 done by iteratively moving over all the query vectors when `kneighbors`
 method is called for a set of query vectors. It has been identified that
 iterating over each query with Python loops is a huge overhead. I have
 implemented a few Cython hacks to demonstrate the distance calculation in
 LSHForest and I was able to get an approximate speedup 10x compared to
 current distance calculation with a Python loop. However,  I came across
 some blockers while trying to do this and need some clarifications.

 What I need to know is, do we use a mechanism to release GIL when we
 want to parallelize. One of my observations is `pairwise_distance` uses all
 the cores even when I don't specify `n_jobs` parameter which is 1 in
 default. Is this an expected behavior?

 If I want to release GIL, can I use OpenMP module in Cython? Or is that
 a task of Joblib?
 Any input on this is highly appreciated.

 Best regards

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Joel Nothman

What makes you think this is the main bottleneck? While it is not an
insignificant consumer of time, I really doubt this is what's making
scikit-learn's LSH implementation severely underperform with respect to
other implementations.

We need to profile. In order to do that, we need some sensible parameters
that users might actually want, e.g. number of features for {dense, sparse}
cases, index size, target 10NN precision and recall (selecting
corresponding n_estimators and n_candidates). Ideally we'd consider
real-world datasets. And of course, these should be sensible for whichever
metric we're operating over, and whether we're doing KNN or Radius searches.

I don't know if it's realistic, but I've profiled the following
bench_plot_approximate_neighbors settings:

Building NearestNeighbors for 10 samples in 100 dimensions
LSHF parameters: n_estimators = 15, n_candidates = 100
Building LSHForest for 10 samples in 100 dimensions
Done in 1.492s
Average time for lshf neighbor queries: 0.005s
Average time for exact neighbor queries: 0.002s
Average Accuracy : 0.88
Speed up: 0.5x

Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query matrix,
we have:

   - 0.03 spent in _query (hashing and descending)
   - 0.91 spent in _compute_distances (exact distance calculation)
   - 3.80 remaining in _get_candidates (ascending phase), almost all of
   which is spent in _find_matching_indices

Cutting exact distance calculation to 0s will still not get this faster
than the exact approach. Of course, your mileage may vary, but this
suggests to me you're barking up the wrong tree (no pun intended).

On 30 July 2015 at 19:43, Maheshakya Wijewardena pmaheshak...@gmail.com
wrote:

 Hi,

 I've started to look into the matter of improving performance of
 LSHForest. As we have discussed sometime before(in fact, quite a long
 time), main concern is to Cythonize distance calculations. Currently, this
 done by iteratively moving over all the query vectors when `kneighbors`
 method is called for a set of query vectors. It has been identified that
 iterating over each query with Python loops is a huge overhead. I have
 implemented a few Cython hacks to demonstrate the distance calculation in
 LSHForest and I was able to get an approximate speedup 10x compared to
 current distance calculation with a Python loop. However,  I came across
 some blockers while trying to do this and need some clarifications.

 What I need to know is, do we use a mechanism to release GIL when we want
 to parallelize. One of my observations is `pairwise_distance` uses all the
 cores even when I don't specify `n_jobs` parameter which is 1 in default.
 Is this an expected behavior?

 If I want to release GIL, can I use OpenMP module in Cython? Or is that a
 task of Joblib?
 Any input on this is highly appreciated.

 Best regards,
 --

 *Maheshakya Wijewardena,Undergraduate,*
 *Department of Computer Science and Engineering,*
 *Faculty of Engineering.*
 *University of Moratuwa,*
 *Sri Lanka*


 --

 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Joel Nothman

One approach to fixing the ascending phase would ensure that
_find_matching_indices is only searching over parts of the tree that have
not yet been explored, while currently it searches over the entire index at
each depth.

My preferred, but more experimental, solution is to memoize where the first
2**b layers of the tree begin and end in the index, for small b. So if our
index stored:
[[0, 0, 0, 1, 1, 0, 0, 0],
 [0, 0, 1, 0, 1, 0, 1, 0],
 [0, 0, 1, 0, 1, 0, 1, 0],
 [0, 1, 0, 0, 0, 0, 0, 0],
 [0, 1, 0, 1, 1, 0, 0, 0],
 [0, 1, 1, 0, 0, 1, 1, 0],
 [1, 0, 0, 0, 0, 1, 0, 1],
 [1, 0, 0, 1, 0, 1, 0, 1],
 [1, 1, 0, 0, 0, 0, 0, 0],
 [1, 1, 1, 1, 1, 0, 0, 0]]
and b=2, we'd memoize offsets for prefixes of size 2:
[0, # 00
 3, # 01
 6, # 10
 8, # 11
]

Given a query like  [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down to
leave the first b bits [0, 1] remaining, and look them up in the array just
defined to identify a much narrower search space [3, 6) matching that
prefix in the overall index.

Indeed, given the min_hash_match constraint, not having this sort of thing
for b = min_hash_match seems wasteful.

This provides us O(1) access to the top layers of the tree when ascending,
and makes the searchsorted calls run in log(n / (2 ** b)) time rather than
log(n). It is also much more like traditional LSH. However, it complexifies
the code, as we now have to consider two strategies for descent/ascent.



On 30 July 2015 at 21:46, Joel Nothman joel.noth...@gmail.com wrote:

 What makes you think this is the main bottleneck? While it is not an
 insignificant consumer of time, I really doubt this is what's making
 scikit-learn's LSH implementation severely underperform with respect to
 other implementations.

 We need to profile. In order to do that, we need some sensible parameters
 that users might actually want, e.g. number of features for {dense, sparse}
 cases, index size, target 10NN precision and recall (selecting
 corresponding n_estimators and n_candidates). Ideally we'd consider
 real-world datasets. And of course, these should be sensible for whichever
 metric we're operating over, and whether we're doing KNN or Radius searches.

 I don't know if it's realistic, but I've profiled the following
 bench_plot_approximate_neighbors settings:

 Building NearestNeighbors for 10 samples in 100 dimensions
 LSHF parameters: n_estimators = 15, n_candidates = 100
 Building LSHForest for 10 samples in 100 dimensions
 Done in 1.492s
 Average time for lshf neighbor queries: 0.005s
 Average time for exact neighbor queries: 0.002s
 Average Accuracy : 0.88
 Speed up: 0.5x

 Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query matrix,
 we have:

- 0.03 spent in _query (hashing and descending)
- 0.91 spent in _compute_distances (exact distance calculation)
- 3.80 remaining in _get_candidates (ascending phase), almost all of
which is spent in _find_matching_indices

 Cutting exact distance calculation to 0s will still not get this faster
 than the exact approach. Of course, your mileage may vary, but this
 suggests to me you're barking up the wrong tree (no pun intended).

 On 30 July 2015 at 19:43, Maheshakya Wijewardena pmaheshak...@gmail.com
 wrote:

 Hi,

 I've started to look into the matter of improving performance of
 LSHForest. As we have discussed sometime before(in fact, quite a long
 time), main concern is to Cythonize distance calculations. Currently, this
 done by iteratively moving over all the query vectors when `kneighbors`
 method is called for a set of query vectors. It has been identified that
 iterating over each query with Python loops is a huge overhead. I have
 implemented a few Cython hacks to demonstrate the distance calculation in
 LSHForest and I was able to get an approximate speedup 10x compared to
 current distance calculation with a Python loop. However,  I came across
 some blockers while trying to do this and need some clarifications.

 What I need to know is, do we use a mechanism to release GIL when we want
 to parallelize. One of my observations is `pairwise_distance` uses all the
 cores even when I don't specify `n_jobs` parameter which is 1 in default.
 Is this an expected behavior?

 If I want to release GIL, can I use OpenMP module in Cython? Or is that a
 task of Joblib?
 Any input on this is highly appreciated.

 Best regards,
 --

 *Maheshakya Wijewardena,Undergraduate,*
 *Department of Computer Science and Engineering,*
 *Faculty of Engineering.*
 *University of Moratuwa,*
 *Sri Lanka*


 --

 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



--
___
Scikit-learn-general mailing list
Scikit-learn-general

Re: [Scikit-learn-general] Making approximate nearest neighbor search more efficient

2015-07-30 Thread Joel Nothman

(sorry, I should have said the first b layers, not 2**b layers, producing a
memoization of 2**b offsets)

On 30 July 2015 at 22:22, Joel Nothman joel.noth...@gmail.com wrote:

 One approach to fixing the ascending phase would ensure that
 _find_matching_indices is only searching over parts of the tree that have
 not yet been explored, while currently it searches over the entire index at
 each depth.

 My preferred, but more experimental, solution is to memoize where the
 first 2**b layers of the tree begin and end in the index, for small b. So
 if our index stored:
 [[0, 0, 0, 1, 1, 0, 0, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 0, 1, 0, 1, 0, 1, 0],
  [0, 1, 0, 0, 0, 0, 0, 0],
  [0, 1, 0, 1, 1, 0, 0, 0],
  [0, 1, 1, 0, 0, 1, 1, 0],
  [1, 0, 0, 0, 0, 1, 0, 1],
  [1, 0, 0, 1, 0, 1, 0, 1],
  [1, 1, 0, 0, 0, 0, 0, 0],
  [1, 1, 1, 1, 1, 0, 0, 0]]
 and b=2, we'd memoize offsets for prefixes of size 2:
 [0, # 00
  3, # 01
  6, # 10
  8, # 11
 ]

 Given a query like  [0, 1, 1, 0, 0, 0, 0, 0], it's easy to shift down to
 leave the first b bits [0, 1] remaining, and look them up in the array just
 defined to identify a much narrower search space [3, 6) matching that
 prefix in the overall index.

 Indeed, given the min_hash_match constraint, not having this sort of thing
 for b = min_hash_match seems wasteful.

 This provides us O(1) access to the top layers of the tree when ascending,
 and makes the searchsorted calls run in log(n / (2 ** b)) time rather than
 log(n). It is also much more like traditional LSH. However, it complexifies
 the code, as we now have to consider two strategies for descent/ascent.



 On 30 July 2015 at 21:46, Joel Nothman joel.noth...@gmail.com wrote:

 What makes you think this is the main bottleneck? While it is not an
 insignificant consumer of time, I really doubt this is what's making
 scikit-learn's LSH implementation severely underperform with respect to
 other implementations.

 We need to profile. In order to do that, we need some sensible parameters
 that users might actually want, e.g. number of features for {dense, sparse}
 cases, index size, target 10NN precision and recall (selecting
 corresponding n_estimators and n_candidates). Ideally we'd consider
 real-world datasets. And of course, these should be sensible for whichever
 metric we're operating over, and whether we're doing KNN or Radius searches.

 I don't know if it's realistic, but I've profiled the following
 bench_plot_approximate_neighbors settings:

 Building NearestNeighbors for 10 samples in 100 dimensions
 LSHF parameters: n_estimators = 15, n_candidates = 100
 Building LSHForest for 10 samples in 100 dimensions
 Done in 1.492s
 Average time for lshf neighbor queries: 0.005s
 Average time for exact neighbor queries: 0.002s
 Average Accuracy : 0.88
 Speed up: 0.5x

 Of 4.77s total time spent in LSHForest.kneighbors for a 1000-query
 matrix, we have:

- 0.03 spent in _query (hashing and descending)
- 0.91 spent in _compute_distances (exact distance calculation)
- 3.80 remaining in _get_candidates (ascending phase), almost all of
which is spent in _find_matching_indices

 Cutting exact distance calculation to 0s will still not get this faster
 than the exact approach. Of course, your mileage may vary, but this
 suggests to me you're barking up the wrong tree (no pun intended).

 On 30 July 2015 at 19:43, Maheshakya Wijewardena pmaheshak...@gmail.com
 wrote:

 Hi,

 I've started to look into the matter of improving performance of
 LSHForest. As we have discussed sometime before(in fact, quite a long
 time), main concern is to Cythonize distance calculations. Currently, this
 done by iteratively moving over all the query vectors when `kneighbors`
 method is called for a set of query vectors. It has been identified that
 iterating over each query with Python loops is a huge overhead. I have
 implemented a few Cython hacks to demonstrate the distance calculation in
 LSHForest and I was able to get an approximate speedup 10x compared to
 current distance calculation with a Python loop. However,  I came across
 some blockers while trying to do this and need some clarifications.

 What I need to know is, do we use a mechanism to release GIL when we
 want to parallelize. One of my observations is `pairwise_distance` uses all
 the cores even when I don't specify `n_jobs` parameter which is 1 in
 default. Is this an expected behavior?

 If I want to release GIL, can I use OpenMP module in Cython? Or is that
 a task of Joblib?
 Any input on this is highly appreciated.

 Best regards,
 --

 *Maheshakya Wijewardena,Undergraduate,*
 *Department of Computer Science and Engineering,*
 *Faculty of Engineering.*
 *University of Moratuwa,*
 *Sri Lanka*


 --

 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo

Re: [Scikit-learn-general] [scikit-learn-general] Possible bug in RFECV.fit?

2015-07-22 Thread Joel Nothman

This isn't directly a problem with RFECV, it's a problem with what you
provided as an argument to `scoring`. I suspect you provided a function
with signature fn(y_true, y_pred) - score, where what is required is a
function fn(estimator, X, y_true) - score. See
http://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules

Perhaps we should be failing faster in such a case. We could, for instance,
extend check_scoring to smoke-test scoring(estimator, X, y_true), at a cost
that we hope is small relative to fitting.

And where is that parallelism happening? It looks like the RFECV code could
be parallelised, but is not atm.


On 22 July 2015 at 21:34, Dale Smith dsm...@nexidia.com wrote:

  Hello,



 I just ran a four-day fit using RFECV. At the end I got the following
 message. My question is whether this is a bug? If so, I’ll write some
 reproducible code (if I can) and submit a report.



 I have searched for similar messages but didn’t find anything.



 I am using Windows Server 8 R2 Enterprise with Anaconda 2.2.0 64-bit. I
 haven’t patched scikit-learn or any dependencies.



 ………

 Fitting estimator with 4 features.

 [Parallel(n_jobs=20)]: Done   1 out of 300 | elapsed:0.6s remaining:
 3.5min

 [Parallel(n_jobs=20)]: Done 300 out of 300 | elapsed:   10.6s finished

 Fitting estimator with 3 features.

 [Parallel(n_jobs=20)]: Done   1 out of 300 | elapsed:0.4s remaining:
 2.6min

 [Parallel(n_jobs=20)]: Done 300 out of 300 | elapsed:7.1s finished

 Fitting estimator with 2 features.

 [Parallel(n_jobs=20)]: Done   1 out of 300 | elapsed:0.6s remaining:
 3.5min

 [Parallel(n_jobs=20)]: Done 300 out of 300 | elapsed:8.2s finished

 [Parallel(n_jobs=20)]: Done   1 out of 300 | elapsed:0.5s remaining:
 2.9min

 [Parallel(n_jobs=20)]: Done 300 out of 300 | elapsed:8.6s finished

 [Parallel(n_jobs=20)]: Done   1 out of 300 | elapsed:0.5s remaining:
 3.2min

 [Parallel(n_jobs=20)]: Done 300 out of 300 | elapsed:8.6s finished

 Traceback (most recent call last):

   File test_rfecv.py, line 62, in module

 churn.rfe()

   File D:\Research\Churn\python\churn.py, line 805, in rfe

 print(r%s % traceback.format_exc())

   File C:\Anaconda3\lib\site-packages\sklearn\feature_selection\rfe.py,
 line 382, in fit

 score = _score(estimator, X_test[:, indices], y_test, scorer)

   File C:\Anaconda3\lib\site-packages\sklearn\cross_validation.py, line
 1534, in _score

 score = scorer(estimator, X_test, y_test)

   File C:\Anaconda3\lib\site-packages\sklearn\metrics\classification.py,
 line 676, in fbeta_score

 sample_weight=sample_weight)

   File C:\Anaconda3\lib\site-packages\sklearn\metrics\classification.py,
 line 855, in precision_recall_fscore_support

 if beta = 0:

 ValueError: The truth value of an array with more than one element is
 ambiguous.

 Use a.any() or a.all()




 *Dale Smith, Ph.D.*
 Data Scientist
 
 [image:
 http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20logo.png]
 http://nexidia.com/

 * d.* 404.495.7220 x 4008   *f.* 404.795.7221
 Nexidia Corporate | 3565 Piedmont Road, Building Two, Suite 400 | Atlanta,
 GA 30305

 [image:
 http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20Blog.jpeg]
 http://blog.nexidia.com/ [image:
 http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20LinkedIn.jpeg]
 https://www.linkedin.com/company/nexidia [image:
 http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20Google.jpeg]
 https://plus.google.com/u/0/107921893643164441840/posts [image:
 http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20twitter.jpeg]
 https://twitter.com/Nexidia [image:
 http://host.msgapp.com/Extranet/96621/Signature%20Images/sig%20Youtube.jpeg]
 https://www.youtube.com/user/NexidiaTV




 --
 Don't Limit Your Business. Reach for the Cloud.
 GigeNET's Cloud Solutions provide you with the tools and support that
 you need to offload your IT needs and focus on growing your business.
 Configured For All Businesses. Start Your Cloud Today.
 https://www.gigenetcloud.com/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Speed up transformation step with multiple 1 vs rest binary text classifiers.

2015-07-02 Thread Joel Nothman

TfidfVectorizer is just CountVectorizer followed by a TfidfTransformer. The
Tfidf transformation tends to be cheap relative to tokenization which is
independent of what corpus you want to calculate TF.IDF over. If I
understand correctly, you can perform CountVectorizer on all of your
documents, then select documents pertinent to a topic, then perform TF.IDF
calculation.

If that's not what you mean, maybe you just need to create something like:

memory = joblib.Memory('/path/to/cache/storage')
class CachedTfidfVectorizer(TfidfVectorizer):
def transform(self, X, y=None):
return memory.cache(super(CachedTfidfVectorizer, self).transform)(X)

handling fit_transform is a bit trickier, though.

On 3 July 2015 at 07:03, Artem barmaley@gmail.com wrote:

Hi Nikhil

Do you somehow do topic-specific TF-IDF transformations? Could you provide
a small (pseudo) code snippet for what you're doing?

I may be wrong, but judging from what you wrote, it doesn't look like you
use scikit-learn's OneVsRestClassifier
http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html.
It will do all the work of managing multiple classes for you. Also, check
out Pipeline
http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html.
At the moment your pipeline looks simple (just one transformer), but you
may get interested in more complicated preprocessing in the future.

On Thu, Jul 2, 2015 at 9:07 PM, nmura...@masonlive.gmu.edu
nmura...@masonlive.gmu.edu wrote:

Hello,

I have a text classification problem where I have about 50 classes and
have 50 binary classifiers (1 per topic). The training set used to train
each topic classifier is different (some instances might overlap). Each
instance consists of a text snippet which is
transformed using tf-idf vectorizer. I am using LinearSVM for each of
the classifiers..
Now I am trying to develop a web-service over this classification
architecture where, given a new snippet of text, the service returns the
scores for each of the topics ( [p(Topic) , p(Not-Topic)] in each case.) .
For the new snippet of text, as I understand it, I will have to do 50
transformations of the text to the tf-idf vectorizer for each topic and
then pass the corresponding tf-idf transformed vector into the
corresponding topic-classifier. I am trying to somehow minimize the number
of transformation operations wherein, instead of having to do the
transformation 50 times, I want to somehow combine all the topic
information and calculate Tf-Idf of the new text once and run it through
each of the classifiers. Is this possible using Scikit Learn? Any
particular type of vectorizer that address problems like this?

Thanks,
Nikhil

--
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Library of pre-trained models

2015-07-01 Thread Joel Nothman

oh, I missed that one from Omer Levy's debunking word2vec series. Nice!

On 1 July 2015 at 23:52, Mathieu Blondel math...@mblondel.org wrote:



 On Wed, Jul 1, 2015 at 8:43 PM, Dale Smith dsm...@nexidia.com wrote:

  Apparently so; here is a python/cython implementation.



 http://rare-technologies.com/deep-learning-with-word2vec-and-gensim/


 word2vec is *not* deep learning. The skip-gram model has been shown
 recently to reduce to a certain matrix factorization [*]. So it's a shallow
 network with only one hidden layer and without non-linearities.

 Mathieu

 [*] Neural Word Embedding as Implicit Matrix Factorization by O. Levy and
 Y. Goldberg.



 --
 Don't Limit Your Business. Reach for the Cloud.
 GigeNET's Cloud Solutions provide you with the tools and support that
 you need to offload your IT needs and focus on growing your business.
 Configured For All Businesses. Start Your Cloud Today.
 https://www.gigenetcloud.com/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] RandomizedSearchCV error

2015-06-25 Thread Joel Nothman

It's a problem of excessive memory consumption due to a O(# possible
parameter settings) approach to sampling from discrete parameter grids
without replacement.

The fix was merged into master only hours ago. Please feel free to work
with master, or to cherry-pick febefb0

On 25 June 2015 at 16:22, Jason Sanchez jason.sanchez.m...@statefarm.com
wrote:

 This code that uses RandomizedSearchCV works fine in 0.15.2:

 import pandas as pd
 from sklearn.pipeline import Pipeline
 from sklearn.datasets import load_iris
 from sklearn.ensemble import RandomForestClassifier
 from sklearn.grid_search import RandomizedSearchCV

 iris = load_iris()
 X = iris.data
 y = iris.target

 pipeline = Pipeline([(rf, RandomForestClassifier())])

 params = {  rf__n_estimators: range(10,50),
 rf__max_depth: range(5,10),
 rf__max_features: range(1, 5),
 rf__min_samples_split: range(5,101),
 rf__min_samples_leaf: range(20,50),
 rf__max_leaf_nodes: range(200, 350)}

 random_search = RandomizedSearchCV(pipeline, params).fit(X, y)


 It does not work in 0.16.1. When I kill the process, here is the Traceback:
 ---
 KeyboardInterrupt Traceback (most recent call last)
 ipython-input-108-8794e7d30469 in module()
  24 random_search = RandomizedSearchCV(pipeline, params,
 n_iter=n_iter_search, cv=2, refit=True, n_jobs=1)
  25
 --- 26 random_search.fit(X_iris, y_iris)

 /.../lib/python2.7/site-packages/sklearn/grid_search.pyc in fit(self, X, y)
 896   self.n_iter,
 897
  random_state=self.random_state)
 -- 898 return self._fit(X, y, sampled_params)

 /.../lib/python2.7/site-packages/sklearn/grid_search.pyc in _fit(self, X,
 y, parameter_iterable)
 503 self.fit_params,
 return_parameters=True,
 504 error_score=self.error_score)
 -- 505 for parameters in parameter_iterable
 506 for train, test in cv)
 507

 /.../lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in
 __call__(self, iterable)
 656 os.environ[JOBLIB_SPAWNED_PROCESS] = '1'
 657 self._iterating = True
 -- 658 for function, args, kwargs in iterable:
 659 self.dispatch(function, args, kwargs)
 660

 /.../lib/python2.7/site-packages/sklearn/grid_search.pyc in
 genexpr(***failed resolving arguments***)
 499 pre_dispatch=pre_dispatch
 500 )(
 -- 501 delayed(_fit_and_score)(clone(base_estimator), X, y,
 self.scorer_,
 502 train, test, self.verbose,
 parameters,
 503 self.fit_params,
 return_parameters=True,

 /.../lib/python2.7/site-packages/sklearn/grid_search.pyc in __iter__(self)
 180 if all_lists:
 181 # get complete grid and yield from it
 -- 182 param_grid =
 list(ParameterGrid(self.param_distributions))
 183 grid_size = len(param_grid)
 184

 /.../lib/python2.7/site-packages/sklearn/grid_search.pyc in __iter__(self)
 100 keys, values = zip(*items)
 101 for v in product(*values):
 -- 102 params = dict(zip(keys, v))
 103 yield params
 104

 KeyboardInterrupt:


 Any thoughts?


 --
 Monitor 25 network devices or servers for free with OpManager!
 OpManager is web-based network management software that monitors
 network devices and physical  virtual servers, alerts via email  sms
 for fault. Monitor 25 devices for free with no restriction. Download now
 http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors 
network devices and physical  virtual servers, alerts via email  sms 
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Passing kwargs to pipeline predict

2015-06-25 Thread Joel Nothman

As much as possible, parameters to a model should be specified to the class
constructor, not methods, even if application is there. This has been the
scikit-learn design for a while in order to enable things like grid search
and a bare-bones pipeline implementation. So external projects adding
additional args to predict may not have been correctly designed (and it's
easy enough for you to inherit from their model to fix the problem by
moving that param to the constructor). It may be valid to add a param to
predict when: (a) it is data-dependent, e.g. a parallel array to the
feature array; or (b) it modifies the output format of predict, e.g. to
return error as in GaussianProcess. Any uses of this are fundamentally
custom approaches for which Pipeline already has reduced utility, but I can
see how forwarding kwargs may give a minor convenience.

On 26 June 2015 at 03:35, Michael Kneier michael.kne...@gmail.com wrote:

 As far as I know, there aren't any estimators with predict kwargs. This
 doesn't mean that engineers aren't writing their own estimators, which may
 need kwargs. To me, one of sklearns great strengths is its pipeline, and
 extending its functionality to allow for more flexible estimator methods
 seems like a good thing. Perusing the sklearn github, it seems that there
 are is demand for extending the parameters pipeline accepts.

 If the yourself and the community prefer not to, that's fine. I think it
 is worth discussion though.



 On Wed, Jun 24, 2015 at 2:47 PM, Joel Nothman joel.noth...@gmail.com
 wrote:

 What estimators have predict with multiple args? Without support for same
 in cross validation routines and scorers, isn't t easier to write this
 functionality in custom code as you need it, leaving the predictor off the
 Pipeline?

 On 25 June 2015 at 06:06, Michael Kneier michael.kne...@gmail.com
 wrote:

 Hi all,

 It doesn't look like pipelines currently support passing kwargs to their
 estimators' predict method. I think it would be great to add this
 functionality, but I want to get your thoughts before I open a PR.

 Thanks,
 Mike


 --
 Monitor 25 network devices or servers for free with OpManager!
 OpManager is web-based network management software that monitors
 network devices and physical  virtual servers, alerts via email  sms
 for fault. Monitor 25 devices for free with no restriction. Download now
 http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 Monitor 25 network devices or servers for free with OpManager!
 OpManager is web-based network management software that monitors
 network devices and physical  virtual servers, alerts via email  sms
 for fault. Monitor 25 devices for free with no restriction. Download now
 http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 Monitor 25 network devices or servers for free with OpManager!
 OpManager is web-based network management software that monitors
 network devices and physical  virtual servers, alerts via email  sms
 for fault. Monitor 25 devices for free with no restriction. Download now
 http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors 
network devices and physical  virtual servers, alerts via email  sms 
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] What do SGDClassifier weights do mathematically?

2015-06-25 Thread Joel Nothman

Across models, weights should be implemented such that duplicating samples
would give identical results to corresponding integer weights. That is true
here, to my understanding, if we remove the stochasticity such that all
identical samples have their update occur at once.

On 25 June 2015 at 19:28, Daniel Sullivan dbsulliva...@gmail.com wrote:

 Hi Anton,

 The update for each sample is just multiplied by the sample_weight and
 the class_weight before it's applied to the weight vector (coef_). So
 if your sample is [1, 2, 3], your gradient is .1, your weight for this
 sample is .2, and your eta is 5.0 your weight_vector (coef_) would be
 updated by [+(1 * .1 * .2 * 5.0),+(2 * .1 * .2 * 5.0),+(3 *.1
 * .2 * 5.0)]. Does that help?

 Danny

 On Thu, Jun 25, 2015 at 10:19 AM, Anton Suchaneck a.suchan...@gmail.com
 wrote:
  Hello!
 
  How can I find out what the precise mathematical treatment of the
  sample_weights for SGDClassifier in the partial_fit setting is?
 
  Cheers,
  Anton
 
 
 
 --
  Monitor 25 network devices or servers for free with OpManager!
  OpManager is web-based network management software that monitors
  network devices and physical  virtual servers, alerts via email  sms
  for fault. Monitor 25 devices for free with no restriction. Download now
  http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 


 --
 Monitor 25 network devices or servers for free with OpManager!
 OpManager is web-based network management software that monitors
 network devices and physical  virtual servers, alerts via email  sms
 for fault. Monitor 25 devices for free with no restriction. Download now
 http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors 
network devices and physical  virtual servers, alerts via email  sms 
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Passing kwargs to pipeline predict

2015-06-24 Thread Joel Nothman

What estimators have predict with multiple args? Without support for same
in cross validation routines and scorers, isn't t easier to write this
functionality in custom code as you need it, leaving the predictor off the
Pipeline?

On 25 June 2015 at 06:06, Michael Kneier michael.kne...@gmail.com wrote:

 Hi all,

 It doesn't look like pipelines currently support passing kwargs to their
 estimators' predict method. I think it would be great to add this
 functionality, but I want to get your thoughts before I open a PR.

 Thanks,
 Mike


 --
 Monitor 25 network devices or servers for free with OpManager!
 OpManager is web-based network management software that monitors
 network devices and physical  virtual servers, alerts via email  sms
 for fault. Monitor 25 devices for free with no restriction. Download now
 http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors 
network devices and physical  virtual servers, alerts via email  sms 
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] differences between metrics.classification_report and own function

2015-06-17 Thread Joel Nothman

To me, those numbers appear identical at 2 decimal places.

On 17 June 2015 at 23:04, Herbert Schulz hrbrt@gmail.com wrote:

 Hello everyone,

 i wrote a function to calculate the sensitivity,specificity, ballance
 accuracy and accuracy from a confusion matrix.


 Now i have a Problem, I'm getting different values when I'm comparing my
 Values with those from the metrics.classification_report function.
 The general problem ist, my predicted sensitivity is in the classification
 report the precision value. I'm computing every sensitivity  with the one
 vs all approach. So e.g. Class 1 == true, class 2,3,4,5 are the rest (not
 true).

 I did this only to get the specificity, and to compare if i computed
 everything right.



 --- ensemble ---

  precisionrecall  f1-score   support

 1.0  * 0.56 * 0.68  0.61   129
 2.0   *0.28*  0.15  0.2078
 3.0  * 0.45  *0.47  0.46   116
 4.0   *0.29*  0.05  0.0940
 5.0  * 0.44 * 0.66  0.5370

 avg / total   0.43  0.47  0.43   433


 Class: 1
  sensitivity:*0.556962025316*
  specificity: 0.850909090909
  ballanced accuracy: 0.703935558113

 Class: 2
  sensitivity:*0.279069767442*
  specificity: 0.830769230769
  ballanced accuracy: 0.554919499106

 Class: 3
  sensitivity*:0.446280991736*
  specificity: 0.801282051282
  ballanced accuracy: 0.623781521509

 Class: 4
  sensitivity:*0.285714285714*
  specificity: 0.910798122066
  ballanced accuracy: 0.59825620389

 Class: 5
  sensitivity:*0.442307692308*
  specificity: 0.927051671733
  ballanced accuracy: 0.68467968202





 --

 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] differences between metrics.classification_report and own function

2015-06-17 Thread Joel Nothman

Scikit-learn has had a default of a weighted (micro-)average. This is a bit
non-standard, so from now users are expected to specify the average when
using precision/recall/fscore. Once
https://github.com/scikit-learn/scikit-learn/pull/4622 is merged,
classification_report will show all the common averages.

I might also note that for multiclass problems with all classes included,
micro precision == recall == fscore == accuracy. In the development
version, it is now possible to specify that not all classes should be
included in micro-averages, so micro average is now more useful for
multiclass evaluation...

On 18 June 2015 at 01:42, Sebastian Raschka se.rasc...@gmail.com wrote:

 About the average: The two common scenarios are micro and macro
 average (I think macro is typically the default in scikit-learn) -- you
 calculated the macro average in your example.

 To further explain the difference betw. macro and micro, let's consider a
 simple 2-class scenario and calculate the precision

 a) macro-average precision:
 (PRE1 + PRE2) / 2

 b) micro-average precision:
  (TP1+TP2)/(TP1+TP2+FP1+FP2)

 Hope that helps.

 Best,
 Sebastian


  On Jun 17, 2015, at 10:49 AM, Herbert Schulz hrbrt@gmail.com
 wrote:
 
  Ok i think i have it, thanks everyone for the help!
 
  But there is an another problem.
 
  How are you calculating the avg?
 
  example:
 
  --- k-NN ---
 
   precisionrecall  f1-score   support
 
  1.0   0.50  0.43  0.46   129
  2.0   0.31  0.40  0.3588
  3.0   0.45  0.36  0.40   107
  4.0   0.06  0.03  0.0433
  5.0   0.42  0.58  0.4976
 
  avg / total   0.40  0.40  0.40   433
 
  so: (0.5+0.31+0.45+0.06+0.42) / 5 = 0.348 ~ 0.35like i calculated it
 in my avg part. Are you using some weights?
 
  Class: 1
   sensitivity:0.43
   specificity: 0.81
   ballanced accuracy: 0.62
   precision 0.50
  .
  .
  .
  .
 
  Class: 5
   sensitivity:0.58
   specificity: 0.83
   ballanced accuracy: 0.70
   precision 0.42
 
  avg total:
   sensitivity: 0.36
   specificity: 0.85
   avg ballance: 0.60
   avg precision: 0.35
 
 
 
 
 
 
 
  On 17 June 2015 at 16:06, Herbert Schulz hrbrt@gmail.com wrote:
  I actually computed it like this, maybe I did something in my
 TP,FP,FN,TN calculation wrong?
 
 
  c1,c2,c3,c4,c5=[1,0,0,0,0],[2,0,0,0,0],[3,0,0,0,0],[4,0,0,0,0],[5,0,0,0,0]
  alle=[c1,c2,c3,c4,c5]
 
 
  #as i mentioned 1 vs all, so the first element in the array is just the
 class
  #[1,0,0,0,0]  == class 1, then in the order:   TP,FP,FN,TN
  #maybe here is something wring:
 
  for i in alle:
  pred=predicted
 
  for k in range(len(predicted)):
 
  if float(i[0]) == y_test[k]:
  if float(i[0]) == pred[k]:
  i[1]+=1
  else:
  i[2]+=1
 
  elif pred[k] == float(i[0]):
  i[3]+=1
  elif pred[k] !=float(i[0]) and y_test[k] !=float(i[0]):
  i[4]+=1
 
  #specs looks like this: [1, 71, 51, 103, 208]
 
  sens=specs[1]/float(specs[1]+specs[3])
 
 
 
 
  if I'm calculatig
 
  sens=specs[1]/float(specs[1]+specs[2]) im getting also the recall like
 in the matrix.
 
  On 17 June 2015 at 15:42, Andreas Mueller t3k...@gmail.com wrote:
  Sensitivity is recall:
  https://en.wikipedia.org/wiki/Sensitivity_and_specificity
 
  Recall is TP / (TP + FN) and precision is TP / (TP + FP).
 
  What did you compute?
 
 
  On 06/17/2015 09:32 AM, Herbert Schulz wrote:
  Yeah i know, thats why I'm asking. i thought precision is not the same
 like recall/sensitivity.
 
  recall == sensitivity!?
 
  But in this matrix, the precision is my calculated sensitivity, or is
 the precision in this case the sensitivity?
 
  On 17 June 2015 at 15:29, Andreas Mueller t3k...@gmail.com wrote:
  Yeah that is the rounding of using %2f in the classification report.
 
 
  On 06/17/2015 09:20 AM, Joel Nothman wrote:
  To me, those numbers appear identical at 2 decimal places.
 
  On 17 June 2015 at 23:04, Herbert Schulz hrbrt@gmail.com wrote:
  Hello everyone,
 
  i wrote a function to calculate the sensitivity,specificity, ballance
 accuracy and accuracy from a confusion matrix.
 
 
  Now i have a Problem, I'm getting different values when I'm comparing
 my Values with those from the metrics.classification_report function.
  The general problem ist, my predicted sensitivity is in the
 classification report the precision value. I'm computing every sensitivity
 with the one vs all approach. So e.g. Class 1 == true, class 2,3,4,5 are
 the rest (not true).
 
  I did this only to get the specificity, and to compare if i computed
 everything right.
 
 
 
  --- ensemble ---
 
   precisionrecall  f1-score   support
 
  1.0

Re: [Scikit-learn-general] Incrementally Printing GridSearch Results

2015-06-15 Thread Joel Nothman

I think it gets a bit noisier when using n_jobs != 1, as verbose is passed
to joblib.Parallel. I agree that it's not a very controllable or
well-documented setting.

On 16 June 2015 at 13:24, Adam Goodkind a.goodk...@gmail.com wrote:

 Right. Thank you. I guess I was just overwhelmed by the amount of data
 pouring in.


 On Sun, Jun 14, 2015 at 4:42 PM, Andreas Mueller t3k...@gmail.com wrote:

  Not really. It only ouputs parameters and scores, though, right?
 Well, it prints the parameters when it starts a job and after it finishes
 a job.


 On 06/12/2015 06:57 PM, Adam Goodkind wrote:

 Thanks Andy. I see that I have to set verbose to at least 3 to get the
 scores. However, at that level it prints out a lot. Is there a way to
 refine the output to just the parameters and scores?

  Thanks,
 Adam

 On Wed, Jun 10, 2015 at 3:41 PM, Andreas Mueller t3k...@gmail.com
 wrote:

  Yes, set verbose to a nonzero value.


 On 06/10/2015 03:25 PM, Adam Goodkind wrote:

  Is it possible to print the results of a grid search as each iteration
 is completed?

  Thanks,
 Adam

  --
  *Adam Goodkind *
 adamgoodkind.com http://www.adamgoodkind.com
 @adamgreatkind https://twitter.com/#%21/adamgreatkind


  
 --



 ___
 Scikit-learn-general mailing 
 listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --

 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




  --
  *Adam Goodkind *
 adamgoodkind.com http://www.adamgoodkind.com
 @adamgreatkind https://twitter.com/#%21/adamgreatkind


 --



 ___
 Scikit-learn-general mailing 
 listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --

 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 *Adam Goodkind *
 adamgoodkind.com http://www.adamgoodkind.com
 @adamgreatkind https://twitter.com/#!/adamgreatkind


 --

 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] silhouette_score and silhouette_samples

2015-06-15 Thread Joel Nothman

See the sample_size parameter: silhouette score can be calculated on a
random subset of the data, presumably for efficiency. Feel free to submit a
PR improving the docstring.

On 16 June 2015 at 13:54, Sebastian Raschka se.rasc...@gmail.com wrote:

 Hi, all,

 I am a little bit confused about the two related metrics silhouette_score
 and silhouette_samples. The silhouette_samples calculates the silhouette
 coefficient for each sample and returns an array of those. However, I am
 wondering if I interpret the silhouette_score correctly. Based on the
 documentation at
 http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html
 I assume that it's just the average of the silhouette coefficients, which
 can be confirmed by running, e.g.,

 np.mean(silhouette_samples(X, y, metric='euclidean'))

 Now, I am wondering why silhouette_score has this additional random_state
 parameter?

 Best,
 Sebastian

 --
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Sample weighting in RandomizedSearchCV

2015-06-09 Thread Joel Nothman

Until sample_weight is directly supported in Pipeline, you need to prefix
`sample_weight` by the step name with '__'. So for Pipeline([('a', A()),
('b', B())] use fit_params={'a__sample_weight': sample_weight,
'b__sample_weight': sample_weight} or similar.

HTH

On 10 June 2015 at 03:57, José Guilherme Camargo de Souza 
jose.camargo.so...@gmail.com wrote:

 Hi Andy,

 Thanks for your reply. The full traceback is below, weights.shape and
 the training data shape are:

 (773,)
 (773, 82)

 I weas using a ExtraTreeClassifier but the same thing happens with an
 SVC. It doesn't seem to be an estimator-specific issue.

 ...
 ...
 /Users/jgcdesouza/anaconda/lib/python2.7/site-packages/sklearn/pipeline.pyc
 in _pre_transform(self=Pipeline(steps=[('standardscaler',
 StandardScale...one, shrinking=True, tol=0.001, verbose=False))]),
 X=array([[ 16.   ,  16.   ,   1.   ,    1.   ,
   4.   ,   4.   ]]), y=array([ 1.,  1.,  1.,  1.,  1.,
  1.,  1.,  0.,  ...,  0.,
 1.,  1.,  0.,  0.,  1.,  1.,  0.]),
 **fit_params={'sample_weight': array([ 0.54980595,  0.54980595,
 0.54980595,  0...5,
 0.54980595,  0.54980595,  0.45019405])})
 111 # Estimator interface
 112
 113 def _pre_transform(self, X, y=None, **fit_params):
 114 fit_params_steps = dict((step, {}) for step, _ in
 self.steps)
 115 for pname, pval in six.iteritems(fit_params):
 -- 116 step, param = pname.split('__', 1)
 117 fit_params_steps[step][param] = pval
 118 Xt = X
 119 for name, transform in self.steps[:-1]:
 120 if hasattr(transform, fit_transform):

 ValueError: need more than 1 value to unpack
 ___

 Process finished with exit code 1
 



 José Guilherme


--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] my silence

2015-05-31 Thread Joel Nothman

Just a quick note that I've been silent lately because I've been Busy With
Life, but also because github was notifying an email address hosted at my
previous employer, which was deactivated a fortnight ago. If there were
issues that sought my particular attention, please let me know.
--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] how to know which feature is informative or redundant in make_classification()?

2015-05-28 Thread Joel Nothman

As at
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html

Prior to shuffling, `X` stacks a number of these primary informative
features, redundant linear combinations of these, repeated
duplicates
of sampled features, and arbitrary noise for and remaining features.

If you set shuffle=False, then you can extract the first n_informative
columns as the primary informative features, etc.

HTH

On 28 May 2015 at 19:18, Daniel Homola daniel.homol...@imperial.ac.uk
wrote:

 Hi everyone,

 I'm benchmarking various feature selection methods, and for that I use
 the make_classification helper function which really great. However, is
 there a way to retrieve a list of the informative and redundant features
 after generating the fake data? It would really interesting to see, if
 the algorithm I'm working on is able to tell the difference between
 informative and redundant ones.

 Cheers,
 Daniel


 --
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] how to know which feature is informative or redundant in make_classification()?

2015-05-28 Thread Joel Nothman

I should note however that the informative features already have
covariance, so their differentiation from the redundant features is likely
hard. One difference is that the covariance is per-class in the underlying
features, whereas the redundant features will vary identically
(disregarding added noise in flip_y) across classes with respect to the
informative features.

On 28 May 2015 at 19:57, Joel Nothman joel.noth...@gmail.com wrote:

 As at
 http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html

 Prior to shuffling, `X` stacks a number of these primary informative
 features, redundant linear combinations of these, repeated
 duplicates
 of sampled features, and arbitrary noise for and remaining features.

 If you set shuffle=False, then you can extract the first n_informative
 columns as the primary informative features, etc.

 HTH

 On 28 May 2015 at 19:18, Daniel Homola daniel.homol...@imperial.ac.uk
 wrote:

 Hi everyone,

 I'm benchmarking various feature selection methods, and for that I use
 the make_classification helper function which really great. However, is
 there a way to retrieve a list of the informative and redundant features
 after generating the fake data? It would really interesting to see, if
 the algorithm I'm working on is able to tell the difference between
 informative and redundant ones.

 Cheers,
 Daniel


 --
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Grid search error

2015-05-17 Thread Joel Nothman

Sorry, I meant https://github.com/scikit-learn/scikit-learn/issues/4301

On 18 May 2015 at 12:10, Joel Nothman joel.noth...@gmail.com wrote:

 Sorry, grid search (and similar) does not support clusterers. This
 probably should be formally tracked as an issue.
 https://github.com/scikit-learn/scikit-learn/issues/4040 might be helpful
 to you.

 On 18 May 2015 at 11:56, Jitesh Khandelwal jk231...@gmail.com wrote:

 I have recently been using grid search to evaluate a custom method for
 dimensionality reduction (DR) along with supervised and unsupervised
 estimators later in the pipeline to discover its usefulness.

 gr = grid_search.GridSearchCV(
 pipeline
 , param_grid, cv = None)

 The scoring functions I used are:
 1. make_scorer(adjusted_rand_index)
 2. make_scorer(homogenity_score)

 The two settings that I have used successfully are:
 pipeline1 = [DR method, knn classifier], a case for supervised estimator
 pipeline2 = [DR method, kmean clustering], a case for unsupervised
 estimator

 But I am getting an error for the following:
 pipeline3 = [DR method, DBScan clustering]
 pipeline4 = [DR method, agglomerative clustering]

 The reason being that DBScan and agglomerative do not have the predict
 function in their api. Why this is so?

 I am just guessing that, may be this is because it is not possible for
 these 2 algorithms to assign cluster labels to unseen data. Correct me if i
 am wrong.

 Even if this is the case, shouldn't grid search automatically decide to
 use
 either
 pred = est.fit(X1).predict(X2) if cv is not None
 or
 pred = est.fit_predict(X) if cv is None (as in my case above)
 based on the cv paramerter.

 Thanks
 Jitesh

 [image: --]
 Jitesh Khandelwal
 http://about.me/jitesh.khandelwal?promo=email_sig
 [image: http://]about.me/jitesh.khandelwal
 http://about.me/jitesh.khandelwal?promo=email_sig



 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Grid search error

2015-05-17 Thread Joel Nothman

Sorry, grid search (and similar) does not support clusterers. This probably
should be formally tracked as an issue.
https://github.com/scikit-learn/scikit-learn/issues/4040 might be helpful
to you.

On 18 May 2015 at 11:56, Jitesh Khandelwal jk231...@gmail.com wrote:

 I have recently been using grid search to evaluate a custom method for
 dimensionality reduction (DR) along with supervised and unsupervised
 estimators later in the pipeline to discover its usefulness.

 gr = grid_search.GridSearchCV(
 pipeline
 , param_grid, cv = None)

 The scoring functions I used are:
 1. make_scorer(adjusted_rand_index)
 2. make_scorer(homogenity_score)

 The two settings that I have used successfully are:
 pipeline1 = [DR method, knn classifier], a case for supervised estimator
 pipeline2 = [DR method, kmean clustering], a case for unsupervised
 estimator

 But I am getting an error for the following:
 pipeline3 = [DR method, DBScan clustering]
 pipeline4 = [DR method, agglomerative clustering]

 The reason being that DBScan and agglomerative do not have the predict
 function in their api. Why this is so?

 I am just guessing that, may be this is because it is not possible for
 these 2 algorithms to assign cluster labels to unseen data. Correct me if i
 am wrong.

 Even if this is the case, shouldn't grid search automatically decide to
 use
 either
 pred = est.fit(X1).predict(X2) if cv is not None
 or
 pred = est.fit_predict(X) if cv is None (as in my case above)
 based on the cv paramerter.

 Thanks
 Jitesh

 [image: --]
 Jitesh Khandelwal
 http://about.me/jitesh.khandelwal?promo=email_sig
 [image: http://]about.me/jitesh.khandelwal
 http://about.me/jitesh.khandelwal?promo=email_sig



 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Divisive Hierarchical Clustering

2015-05-17 Thread Joel Nothman

Hi Sam,

I think this could be interesting. You could allow for learning parameters
on each sub-cluster by accepting a transformer as a parameter, then using
sample = sklearn.base.clone(transformer).fit_transform(sample).

I suspect bisecting k-means is notable enough and different enough for
inclusion. Seeing as you have an implementation, I would suggest
constructing a small example (preferably on real-world data) that
highlights the superiority or distinctiveness of this approach. Once you
have something illustrative, submit a PR with the output and see how people
review it.

In terms of establishing a fixed algorithm, is the criterion for which
cluster is next expanded standard in the literature? Are there
alternatives? (I've not read up yet.)

Thanks,

Joel

On 17 May 2015 at 06:43, Sam Schetterer samsc...@gmail.com wrote:

 Andreas,

 There isn't necessarily a linkage function defined, at least in the sense
 of agglomerative clustering, since this is not comparing clusters to merge
 but rather breaking them up. The clusters are split using another
 clustering algorithm supplied by the caller. The most common one that I've
 found in literature is kmeans with 2 clusters, which leads to a binary tree
 structure and is generally referred to as bisecting kmeans (used for
 example in the first citation). One could use any clustering algorithm,
 even have two different ones that are used in different conditions
 (spectral clustering when n  1000 and kmeans otherwise, for example).

 In addition, with divisive clustering, one can refine the distance metric
 for various tree branches which I don't think is possible with hierarchical
 clustering. I've done this with text clustering to get more accurate tf-idf
 deeper in the hierarchy and the second paper I cited in the original email
 performs the SVD at each new level.

 You bring up a good point about divisive vs agglomerative being an
 implementation detail although I think for certain uses, it may be very
 important. If it's expensive to compute a connectivity matrix, a bisecting
 kmeans will perform significantly better than the agglomerative methods on
 larger datasets.

 Best,
 Sam

--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Why don't we support Neural Network Algorithms?

2015-05-06 Thread Joel Nothman

What Sebastian and Ronnie said. Plus: there are multiple off-the-shelf
neural net pull requests in the process of review, notably those by Issam
Laradji for GSoC 2014. Extreme Learning Machines and Multilayer Perceptrons
should be merged Real Soon Now.


On 7 May 2015 at 14:58, Ronnie Ghose ronnie.gh...@gmail.com wrote:

 neural nets are already well supported in other python libraries and don't
 fit the current transformer model that scikit-learn uses

 On Thu, May 7, 2015 at 12:55 AM, Sebastian Raschka se.rasc...@gmail.com
 wrote:

 I am not one of the core developers, just a typical user, but although I
 think that neural nets would be a nice addition, I have to admit that I
 wouldn't count them as top priority. I think that in applications, neural
 networks require far more flexibility for tweaking than the classic
 off-the-shelve learning algorithms currently implemented in scikit-learn. I
 think that it really requires a lot of planning to implement them in a way
 that allows a user certain flexibility. To me, neural nets are more of an
 research tool in contrast to the currently implemented algos in
 scikit-learn. I really would like to see some way of implementing
 frameworks for neural networks in some useful way in scikit-learn, but I
 can understand that it would really require a different API, a lot of
 planning, and a lot of work. Also, there are many attempts to implement
 neural nets already, like pylearn2, lasagne, and all the other theano
 wrappers




  On May 6, 2015, at 11:15 PM, 赵孽 snakehunt2...@gmail.com wrote:
 
  I used to seeking neural network algorithms in sklearn, but I just
 found a RBF in it.
  There are plenty of Neural Network algorithms, why dose we only support
 RBF which is not even a typical neural network ?
  I thought the neural networks should be the largest family amount
 sklearn algoritms, but it is far smaller than embedings, far smaller than
 SVMs.
 
 --
  One dashboard for servers and applications across Physical-Virtual-Cloud
  Widest out-of-the-box monitoring support with 50+ applications
  Performance metrics, stats and reports that give you Actionable Insights
  Deep dive visibility with transaction tracing using APM Insight.
 
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] clustering on unordered set

2015-04-30 Thread Joel Nothman

The algorithm isn't the issue so much as defining a metric that measures
the distance or affinity between items, or else finding a way to reduce
your data to a more standard metric space.

I have for instance clustered sets of objects by first minhashing them (an
approximate dim reduction for sets) then DBSCAN clustering in hamming
space. One benefit of this was that objects that differed only a little
might be reduced to the same hash, making the number of distinct samples to
cluster smaller, instead employing weighted samples in DBSCAN.

On 1 May 2015 at 06:32, Paul Frandsen paulbfrand...@gmail.com wrote:

 Hello,

 I'm interested in clustering many unordered sets of bitsets. In general, a
 data point would look like: {101110, 010001, 001100,
 11}, where each bitset has the same number of digits and are
 ordered, but the set is unordered. Alternatively (with this particular data
 set), I could represent the same data point as a set of sets of integers:
 {{0,2,3,8},{1,9},{6,7},{4,5}}. Ideally, I'd like to use k-means, but I
 imagine that figuring out centroids would be difficult. Are there any
 clustering algorithms in scikit-learn that could cluster data like these?
 I've looked through the docs, but I am coming up short.

 Thank you,

 Paul Frandsen



 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Topic extraction

2015-04-29 Thread Joel Nothman

Yes, this is not a probabilistic method.

On 29 April 2015 at 14:56, C K Kashyap ckkash...@gmail.com wrote:

Works like a charm. Just noticed though that the max value is sometimes
more than 1.0 is that okay?

Regards,
Kashyap

On Wed, Apr 29, 2015 at 10:12 AM, Joel Nothman joel.noth...@gmail.com
wrote:

mask with np.max(..., axis=1) threshold

On 29 April 2015 at 14:35, C K Kashyap ckkash...@gmail.com wrote:

Thank you so much Joel,

I understood. Just one more thing please.

How can I include a document against it's highest ranking topic only if
it crosses a threshold?

regards,
Kashyap

On Wed, Apr 29, 2015 at 9:45 AM, Joel Nothman joel.noth...@gmail.com
wrote:

Highest ranking topic for each doc is just
np.argmax(nmf.transform(tfidf), axis=1).

This is because nmf.transform
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html#sklearn.decomposition.NMF.transform(tfidf)
returns a matrix of shape (num samples, num components / topics) scoring
each topic per sample. An argmax over axis 1 indicates the highest scoring
topic per sample.

On 29 April 2015 at 11:44, C K Kashyap ckkash...@gmail.com wrote:

Thanks Joel and Andreas,

Joel, I think highest ranking topic for each doc is exactly what I
am looking for. Could you elaborate on the code please?

What would be dataset.target_names and dataset.target in my case -
http://lpaste.net/131649

Regards,
Kashyap

On Wed, Apr 29, 2015 at 3:08 AM, Joel Nothman joel.noth...@gmail.com
wrote:

This shows the newsgroup name and highest scoring topic for each doc.

zip(np.take(dataset.target_names, dataset.target),
np.argmax(nmf.transform(tfidf), axis=1))

I think something based on this should be added to the example.

On 29 April 2015 at 07:01, Andreas Mueller t3k...@gmail.com wrote:

Clusters are one per data point, while topics are not. So the model
is slightly different.
You can get the list of topics for each sample using
NMF().fit_transform(X).

On 04/28/2015 01:13 PM, C K Kashyap wrote:

Hi everyone,
I am new to scikit. I only feel sad for not knowing it earlier -
it's awesome.

I am trying to do the following. Extract topics from a bunch of
tweets. I tried NMF (from the sample here -
http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf.html)
but I was not able to figure out how to list documents corresponding to
the
extracted topics. Could someone please point me to an example that lists
the documents under each topic?

When I got stuck with NMF, I thought of using kmeans (min batch).
I am just wondering though if clustering is a reasonable approach for
topics.

I'd really appreciate any advice here.

Thanks,
Kashyap

___
Scikit-learn-general mailing
listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
One dashboard for servers and applications across
Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable
Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread Joel Nothman

Highest ranking topic for each doc is just np.argmax(nmf.transform(tfidf),
axis=1).

This is because nmf.transform
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html#sklearn.decomposition.NMF.transform(tfidf)
returns a matrix of shape (num samples, num components / topics) scoring
each topic per sample. An argmax over axis 1 indicates the highest scoring
topic per sample.

On 29 April 2015 at 11:44, C K Kashyap ckkash...@gmail.com wrote:

 Thanks Joel and Andreas,

 Joel, I think highest ranking topic for each doc is exactly what I am
 looking for. Could you elaborate on the code please?

 What would be dataset.target_names and dataset.target in my case -
 http://lpaste.net/131649

 Regards,
 Kashyap

 On Wed, Apr 29, 2015 at 3:08 AM, Joel Nothman joel.noth...@gmail.com
 wrote:

 This shows the newsgroup name and highest scoring topic for each doc.

 zip(np.take(dataset.target_names, dataset.target),
 np.argmax(nmf.transform(tfidf), axis=1))

 I think something based on this should be added to the example.

 On 29 April 2015 at 07:01, Andreas Mueller t3k...@gmail.com wrote:

  Clusters are one per data point, while topics are not. So the model is
 slightly different.
 You can get the list of topics for each sample using
 NMF().fit_transform(X).


 On 04/28/2015 01:13 PM, C K Kashyap wrote:

 Hi everyone,
 I am new to scikit. I only feel sad for not knowing it earlier - it's
 awesome.

  I am trying to do the following. Extract topics from a bunch of
 tweets. I tried NMF (from the sample here -
 http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf.html)
 but I was not able to figure out how to list documents corresponding to the
 extracted topics. Could someone please point me to an example that lists
 the documents under each topic?

  When I got stuck with NMF, I thought of using kmeans (min batch). I am
 just wondering though if clustering is a reasonable approach for topics.

  I'd really appreciate any advice here.

  Thanks,
 Kashyap


 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM 
 Insight.http://ad.doubleclick.net/ddm/clk/290420510;117567292;y



 ___
 Scikit-learn-general mailing 
 listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Scikit-learn-general mailing list
Scikit-learn

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread Joel Nothman

mask with np.max(..., axis=1)  threshold

On 29 April 2015 at 14:35, C K Kashyap ckkash...@gmail.com wrote:

 Thank you so much Joel,

 I understood. Just one more thing please.

 How can I include a document against it's highest ranking topic only if it
 crosses a threshold?

 regards,
 Kashyap

 On Wed, Apr 29, 2015 at 9:45 AM, Joel Nothman joel.noth...@gmail.com
 wrote:

 Highest ranking topic for each doc is just np.argmax(nmf.transform(tfidf),
 axis=1).

 This is because nmf.transform
 http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html#sklearn.decomposition.NMF.transform(tfidf)
 returns a matrix of shape (num samples, num components / topics) scoring
 each topic per sample. An argmax over axis 1 indicates the highest scoring
 topic per sample.

 On 29 April 2015 at 11:44, C K Kashyap ckkash...@gmail.com wrote:

 Thanks Joel and Andreas,

 Joel, I think highest ranking topic for each doc is exactly what I am
 looking for. Could you elaborate on the code please?

 What would be dataset.target_names and dataset.target in my case -
 http://lpaste.net/131649

 Regards,
 Kashyap

 On Wed, Apr 29, 2015 at 3:08 AM, Joel Nothman joel.noth...@gmail.com
 wrote:

 This shows the newsgroup name and highest scoring topic for each doc.

 zip(np.take(dataset.target_names, dataset.target),
 np.argmax(nmf.transform(tfidf), axis=1))

 I think something based on this should be added to the example.

 On 29 April 2015 at 07:01, Andreas Mueller t3k...@gmail.com wrote:

  Clusters are one per data point, while topics are not. So the model
 is slightly different.
 You can get the list of topics for each sample using
 NMF().fit_transform(X).


 On 04/28/2015 01:13 PM, C K Kashyap wrote:

 Hi everyone,
 I am new to scikit. I only feel sad for not knowing it earlier - it's
 awesome.

  I am trying to do the following. Extract topics from a bunch of
 tweets. I tried NMF (from the sample here -
 http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf.html)
 but I was not able to figure out how to list documents corresponding to 
 the
 extracted topics. Could someone please point me to an example that lists
 the documents under each topic?

  When I got stuck with NMF, I thought of using kmeans (min batch). I
 am just wondering though if clustering is a reasonable approach for
 topics.

  I'd really appreciate any advice here.

  Thanks,
 Kashyap


 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM 
 Insight.http://ad.doubleclick.net/ddm/clk/290420510;117567292;y



 ___
 Scikit-learn-general mailing 
 listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 One dashboard for servers and applications across
 Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable
 Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 One dashboard for servers

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread Joel Nothman

This shows the newsgroup name and highest scoring topic for each doc.

zip(np.take(dataset.target_names, dataset.target),
np.argmax(nmf.transform(tfidf), axis=1))

I think something based on this should be added to the example.

On 29 April 2015 at 07:01, Andreas Mueller t3k...@gmail.com wrote:

Clusters are one per data point, while topics are not. So the model is
slightly different.
You can get the list of topics for each sample using
NMF().fit_transform(X).

On 04/28/2015 01:13 PM, C K Kashyap wrote:

Hi everyone,
I am new to scikit. I only feel sad for not knowing it earlier - it's
awesome.

I am trying to do the following. Extract topics from a bunch of tweets.
I tried NMF (from the sample here -
http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf.html)
but I was not able to figure out how to list documents corresponding to the
extracted topics. Could someone please point me to an example that lists
the documents under each topic?

When I got stuck with NMF, I thought of using kmeans (min batch). I am
just wondering though if clustering is a reasonable approach for topics.

I'd really appreciate any advice here.

Thanks,
Kashyap

___
Scikit-learn-general mailing
listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Value error when using KNeighboursClassifier with GridSearch

2015-04-27 Thread Joel Nothman

I assume you have checked that combine_train_test_dataset produces data of
the correct dimensions in both X and y.

I would be very surprised if the problem were not in PAA, so check it
again: make sure that you test that PAA().fit(X1).transform(X2) gives the
transformation of X2. The error seems to suggest it is returning an array
of X1's size.

On 28 April 2015 at 05:11, Jitesh Khandelwal jk231...@gmail.com wrote:

 Hi Andreas,

 Thanks for your response.

 No, PAA does not change the number of samples. It just reduces the number
 of features.

 For example if the input matrix is X and X.shape = (100, 100) and the
 n_components = 10 in PAA, then the resultant X.shape = (100, 10).

 Yes, I did try using PAA in the ipython shell (without the grid search) on
 the same dataset and it does the transformation as expected.

 Another interesting observation is that the dataset that I have used in
 the code has dimensions (56, 256) and also 37 + 19 = 56. Does this provide
 any insight about the error?


 [image: --]
 Jitesh Khandelwal
 http://about.me/jitesh.khandelwal?promo=email_sig
 [image: http://]about.me/jitesh.khandelwal
 http://about.me/jitesh.khandelwal?promo=email_sig


 On Tue, Apr 28, 2015 at 12:26 AM, Andreas Mueller t3k...@gmail.com
 wrote:

  Does PAA by any chance change the number of samples?
 The error is:
 ValueError: Found array with dim 37. Expected 19

 Interestingly that happens only in the scoring.

 Does it work without the grid-search?



 On 04/27/2015 07:14 AM, Jitesh Khandelwal wrote:

  Hi all,

  I am trying to use grid search to evaluate some decomposition
 techniques of my own. I have implemented some custom transformers such as
 PAA, DFT, DWT as shown in the code below.

  I am getting a strange ValueError when run the below code and I am
 unable to figure out the origin of the problem.

  I have pasted the code below and attached the error log file.

  Any suggestions on how can I move forward from here would be helpful.

  Thanks.

  Code:
 ===
  from sklearn.pipeline import Pipeline
 from sklearn.grid_search import GridSearchCV
 from sklearn.neighbors import KNeighborsClassifier

  from time_series.decomposition import PAA, DFT, DWT, ShapeX
 from prepare_data import combine_train_test_dataset

  knn = KNeighborsClassifier()
 paa = PAA()

  pipe = Pipeline([
 ('paa', paa),
 ('knn', knn)
 ])

  n_components = [1,2,4,5,10,20,40]
 n_neighbors = range(1,11)
 metrics = ['euclidean']

  datadir = ../keogh_datasets/Coffee
 X,y = combine_train_test_dataset(datadir)

  model_tunning = GridSearchCV(pipe, {
 'paa__n_components': n_components,
 'knn__n_neighbors': n_neighbors,
 'knn__metric': metrics,
 },
 n_jobs=-1)

  model_tunning.fit(X,y)

  print model_tunning.best_score_
 print model_tunning.best_params_
 ===



 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM 
 Insight.http://ad.doubleclick.net/ddm/clk/290420510;117567292;y



 ___
 Scikit-learn-general mailing 
 listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance

Re: [Scikit-learn-general] sequential feature selection algorithms

2015-04-27 Thread Joel Nothman

I suspect this method is underreported by any particular name, as it's a
straightforward greedy search. It is also very close to what I think many
researchers do in system development or report in system analysis, albeit
with more automation.

In the case of KNN, I would think metric learning could subsume or
outperform this.

On 28 April 2015 at 08:50, Andreas Mueller t3k...@gmail.com wrote:

 Maybe we would want mrmr first?

 http://penglab.janelia.org/proj/mRMR/


 On 04/27/2015 06:46 PM, Sebastian Raschka wrote:
  I guess that could be done, but has a much higher complexity than RFE.
  Oh yes, I agree, the sequential feature algorithms are definitely
 computationally more costly.
 
  It seems interesting. Is that really used in practice and is there any
  literature evaluating it?
 
  I am not sure how often it is used in practice nowadays, but I think it
 is one of the classic approaches for feature selection -- I learned about
 it a couple of years ago in a pattern classification class, and there is a
 relatively detailed article in
 
  Ferri, F., et al. Comparative study of techniques for large-scale
 feature selection. Pattern Recognition in Practice IV (1994): 403-413.
 
  The optimal solution to feature selection would be to evaluate the
 performance of all possible feature combination, which is a little bit too
 costly in practice. The sequential forward or backward selection (SFS and
 SBS) algorithms are just a suboptimal solution, and there are some minor
 improvements, e.g,. Sequential Floating Forward Selection (SFFS) which
 allows for the removal of added features in later stages etc.
 
  I have an implementation of SBS that uses k-fold cross_val_score, and it
 is actually not a bad idea to use it for KNN to reduce overfitting as
 alternative to dimensionality reduction, for example, KNN cross-val mean
 accuracy on the wine dataset where the features are selected by SBS:
 http://i.imgur.com/ywDTHom.png?1
 
  But for scikit-learn, it may be better to implement SBBS or SFFS which
 is slightly more sophisticated.
 
 
  On Apr 27, 2015, at 6:00 PM, Andreas Mueller t3k...@gmail.com wrote:
 
  That is like a one-step look-ahead feature selection?
  I guess that could be done, but has a much higher complexity than RFE.
  RFE works for anything that returns importances, not just linear
 models.
  It doesn't really work for KNN, as you say. [I wouldn't say
  non-parametric models. Trees are pretty non-parametric].
 
  It seems interesting. Is that really used in practice and is there any
  literature evaluating it?
  There is some discussion here
  http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf in 4.2
  but there is no empirical comparison or theoretical analysis.
 
  To be added to sklearn, you'd need to show that it is widely used and /
  or widely useful.
 
 
  On 04/27/2015 02:47 PM, Sebastian Raschka wrote:
  Hi, I was wondering if sequential feature selection algorithms are
 currently implemented in scikit-learn. The closest that I could find was
 recursive feature elimination (RFE);
 http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html.
 However, unless the application requires a fixed number of features, I am
 not sure if it is necessarily worthwhile using it over regularized models.
 When I understand correctly, it works like this:
 
  {x1, x2, x3} -- eliminate xi with smallest corresponding weight
 
  {x1, x3} -- eliminate xi with smallest corresponding weight
 
  {x1}
 
  However, this would only work with linear, discriminative models right?
 
  Wouldn't be a classic sequential feature selection algorithm useful
 for non-regularized, nonparametric models e.g,. K-nearest neighbors as an
 alternative to dimensionality reduction for applications where the original
 features may need to be maintained? The RFE, for example, wouldn't work
 with KNN, and maybe the data is non-linearly separable so that RFE with a
 linear model doesn't make sense.
 
  In a nutshell, SFS algorithms simply add or remove one feature at the
 time based on the classifier performance.
 
  e.g., Sequential backward selection:
 
  {x1, x2, x3} --- estimate performance on {x1, x2}, {x2, x3} and {x1,
 x3}, and pick the subset with the best performance
  {x1, x3} --- estimate performance on {x1}, {x3} and pick the subset
 with the best performance
  {x1}
 
  where performance could be e.g., cross-val accuracy.
 
  What do you think?
 
  Best,
  Sebastian
 
 --
  One dashboard for servers and applications across
 Physical-Virtual-Cloud
  Widest out-of-the-box monitoring support with 50+ applications
  Performance metrics, stats and reports that give you Actionable
 Insights
  Deep dive visibility with transaction tracing using APM Insight.
  http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
  ___
  Scikit-learn-general mailing list

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-19 Thread Joel Nothman

On 17 April 2015 at 13:52, Daniel Vainsencher daniel.vainsenc...@gmail.com
wrote:

 On 04/16/2015 05:49 PM, Joel Nothman wrote:
  I more or less agree. Certainly we only need to do one searchsorted per
  query per tree, and then do linear scans. There is a question of how
  close we stay to the original LSHForest algorithm, which relies on
  matching prefixes rather than hamming distance. Hamming distance is
  easier to calculate in NumPy and is probably faster to calculate in C
  too (with or without using POPCNT). Perhaps the only advantage of using
  Cython in your solution is to avoid the memory overhead of unpackbits.
 You obviously know more than I do about Cython vs numpy options.

  However, n_candidates before and after is arguably not sufficient if one
  side has more than n_candidates with a high prefix overlap.
 I disagree. Being able to look at 2*n_candidates that must contain
 n_candidates of the closest ones, rather than as many as happen to
 agree on x number of bits is a feature, not a bug. Especially if we
 then filter them by hamming distance.
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


But it need not contain the closest ones that would have been retrieved by
LSHForest (assuming we're only looking at a single tree). Let's say
n_candidates is 1, our query is 11 and our index contains

A. 10 agreed = 1
B. 110011 agreed = 3
C. 110100 agreed = 5

A binary search will find A-B. The n-candidates x 2 window includes A and
B. C is closer and has a longer prefix overlap with the query than A does.
My understanding of LSHForest is that its ascent by prefix length would
necessarily find C. Your approach would not.

While that may be a feature of your approach, I think we have reason to
prefer a published algorithm.
--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn?

2015-04-16 Thread Joel Nothman

Although I note that I've got LaTeX compilation errors, so I'm not sure how
Andy compiles this.

On 16 April 2015 at 20:25, Joel Nothman joel.noth...@gmail.com wrote:

 I've proposed a better chapter ordering at
 https://github.com/scikit-learn/scikit-learn/pull/4602...

 On 16 April 2015 at 03:48, Andreas Mueller t3k...@gmail.com wrote:

 Hi.
 Yes, run make latexpdf in the doc folder.

 Best,
 Andy


 On 04/15/2015 01:11 PM, Tim wrote:
  Thanks, Andy!
 
  How do you generate the pdf file? Can I also do that?
 
  
  On Wed, 4/15/15, Andreas Mueller t3k...@gmail.com wrote:
 
Subject: Re: [Scikit-learn-general] Is there a pdf documentation for
 the latest stable scikit-learn?
To: scikit-learn-general@lists.sourceforge.net
Date: Wednesday, April 15, 2015, 12:55 PM
 
Hi Tim.
There are pdfs for 0.16.0 and 0.16.1 up now
at
 
http://sourceforge.net/projects/scikit-learn/files/documentation/
 
Let us know if there are
issues with it.
 
Cheers,
Andy
 
 
On
04/15/2015 12:08 PM, Tim wrote:

Hello,

 I am
looking for a pdf file for the documentation for the latest
stable scikit-learn i.e. 0.16.1.

 I followed
 http://scikit-learn.org/stable/support.html#documentation-resources,
which leads me to
 http://sourceforge.net/projects/scikit-learn/files/documentation/,
But the pdf files are for = 0.12 version and no
later than 2012.


Can the official team make the pdf files available?

 Thanks!


 
  
 --
 BPM Camp - Free Virtual Workshop May 6th
at 10am PDT/1PM EDT
 Develop your own
process in accordance with the BPMN 2 standard
 Learn Process modeling best practices with
Bonita BPM through live exercises
 http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
event?utm_

source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF

___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 
 
  
 --
BPM Camp - Free Virtual Workshop May 6th at
10am PDT/1PM EDT
Develop your own process in
accordance with the BPMN 2 standard
Learn
Process modeling best practices with Bonita BPM through live
exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 
 
 --
  BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
  Develop your own process in accordance with the BPMN 2 standard
  Learn Process modeling best practices with Bonita BPM through live
 exercises
  http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
 event?utm_
  source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
 Develop your own process in accordance with the BPMN 2 standard
 Learn Process modeling best practices with Bonita BPM through live
 exercises
 http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
 event?utm_
 source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Is there a pdf documentation for the latest stable scikit-learn?

2015-04-16 Thread Joel Nothman

I've proposed a better chapter ordering at
https://github.com/scikit-learn/scikit-learn/pull/4602...

On 16 April 2015 at 03:48, Andreas Mueller t3k...@gmail.com wrote:

 Hi.
 Yes, run make latexpdf in the doc folder.

 Best,
 Andy


 On 04/15/2015 01:11 PM, Tim wrote:
  Thanks, Andy!
 
  How do you generate the pdf file? Can I also do that?
 
  
  On Wed, 4/15/15, Andreas Mueller t3k...@gmail.com wrote:
 
Subject: Re: [Scikit-learn-general] Is there a pdf documentation for
 the latest stable scikit-learn?
To: scikit-learn-general@lists.sourceforge.net
Date: Wednesday, April 15, 2015, 12:55 PM
 
Hi Tim.
There are pdfs for 0.16.0 and 0.16.1 up now
at
 
http://sourceforge.net/projects/scikit-learn/files/documentation/
 
Let us know if there are
issues with it.
 
Cheers,
Andy
 
 
On
04/15/2015 12:08 PM, Tim wrote:

Hello,

 I am
looking for a pdf file for the documentation for the latest
stable scikit-learn i.e. 0.16.1.

 I followed
 http://scikit-learn.org/stable/support.html#documentation-resources,
which leads me to
 http://sourceforge.net/projects/scikit-learn/files/documentation/,
But the pdf files are for = 0.12 version and no
later than 2012.


Can the official team make the pdf files available?

 Thanks!


 
  
 --
 BPM Camp - Free Virtual Workshop May 6th
at 10am PDT/1PM EDT
 Develop your own
process in accordance with the BPMN 2 standard
 Learn Process modeling best practices with
Bonita BPM through live exercises
 http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
event?utm_

source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF

___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 
 
  
 --
BPM Camp - Free Virtual Workshop May 6th at
10am PDT/1PM EDT
Develop your own process in
accordance with the BPMN 2 standard
Learn
Process modeling best practices with Bonita BPM through live
exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 
 
 --
  BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
  Develop your own process in accordance with the BPMN 2 standard
  Learn Process modeling best practices with Bonita BPM through live
 exercises
  http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
 event?utm_
  source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
 Develop your own process in accordance with the BPMN 2 standard
 Learn Process modeling best practices with Bonita BPM through live
 exercises
 http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
 event?utm_
 source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-16 Thread Joel Nothman

I more or less agree. Certainly we only need to do one searchsorted per
query per tree, and then do linear scans. There is a question of how close
we stay to the original LSHForest algorithm, which relies on matching
prefixes rather than hamming distance. Hamming distance is easier to
calculate in NumPy and is probably faster to calculate in C too (with or
without using POPCNT). Perhaps the only advantage of using Cython in your
solution is to avoid the memory overhead of unpackbits.

However, n_candidates before and after is arguably not sufficient if one
side has more than n_candidates with a high prefix overlap. But until we
look at the suffixes we can't know if it is closer or farther in hamming
distance.

I also think the use of n_candidates in the current code is somewhat
broken, as suggested by my XXX comment in _get_candidates, which we
discussed but did not resolve clearly. I think it will be hard to make
improvements of this sort without breaking the current results and
parameter sensitivities of the implementation.

On 17 April 2015 at 00:16, Daniel Vainsencher daniel.vainsenc...@gmail.com
wrote:

Hi Joel,

To extend your analysis:
- when n_samples*n_indices is large enough, the bottleneck is the use of
the index, as you say.
- when n_dimensions*n_candidates is large enough, the bottleneck is
computation of true distances between DB points and the query.

To serve well both kinds of use cases is perfectly possible, but
requires use of the index that is both:
A) Fast
B) Uses the index optimally to reduce the number of candidates for which
we compare distances.

Here is a variant of your proposal (better keep track of context) that
also requires a little Cython but improves both aspects A and B and
reduces code complexity.

Observation I:
Only a single binary search per index is necessary, the first. After we
find the correct location for the query binary code, we can restrict
ourselves to the n_candidates (or even fewer) before and after that
location.

So no further binary searches are necessary at all, and the restriction
to a small linear part of the array should be much more cache friendly.
This makes full use of our array implementation of orderedcollection,
instead of acting as if we were still on a binary tree implementation as
in the original LSH-Forest paper.

There is a price to pay for this simplification: we are now looking at
(computing full distance from query for) 2*n_candidates*n_indices
points, which can be expensive (we improved A at a cost to B).

But here is where some Cython can be really useful. Observation II:
The best information we can extract from the binary representation is
not the distances in the tree structure, but hamming distances to the
query.

So after the restriction of I, compute the *hamming distances* of the
2*n_candidate*n_indices points each from the binary representation of
the query (corresponding to the appropriate index). Then compute full
metric only for the n_candidates with the lowest hamming distances.

This should achieve a pretty good sweet spot of performance, with just a
bit of Cython.

Daniel

On 04/16/2015 12:18 AM, Joel Nothman wrote:
Once we're dealing with large enough index and n_candidates, most time
is spent in searchsorted in the synchronous ascending phase, while any
overhead around it is marginal. Currently we are searching over the
whole array in each searchsorted, while it could be rewritten to keep
better track of context to cut down the overall array when searching.
While possible, I suspect this will look confusing in Python/Numpy, and
Cython will be a clearer and faster way to present this logic.

On the other hand, time spent in _compute_distances is substantial, and
yet most of its consumption is /outside/ of pairwise_distances. This
commit

https://github.com/scikit-learn/scikit-learn/commit/c1f335f70aa0f766a930f8ac54eeaa601245725a

cuts a basic benchmark from 85 to 70 seconds. Vote here for merge
https://github.com/scikit-learn/scikit-learn/pull/4603!

On 16 April 2015 at 12:32, Maheshakya Wijewardena
pmaheshak...@gmail.com mailto:pmaheshak...@gmail.com wrote:

Moreover, this drawback occurs because LSHForest does not vectorize
multiple queries as in 'ball_tree' or any other method. This slows
the exact neighbor distance calculation down significantly after
approximation. This will not be a problem if queries are for
individual points. Unfortunately, former is the more useful usage of
LSHForest.
Are you trying individual queries or multiple queries (for
n_samples)?

On Thu, Apr 16, 2015 at 6:14 AM, Daniel Vainsencher
daniel.vainsenc...@gmail.com mailto:daniel.vainsenc...@gmail.com
wrote:

LHSForest is not intended for dimensions at which exact methods
work well, nor for tiny datasets. Try d500, n_points10, I
don't remember the switchover point

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman

I agree this is disappointing, and we need to work on making LSHForest
faster. Portions should probably be coded in Cython, for instance, as the
current implementation is a bit circuitous in order to work in numpy. PRs
are welcome.

LSHForest could use parallelism to be faster, but so can (and will) the
exact neighbors methods. In theory in LSHForest, each tree could be
stored on entirely different machines, providing memory benefits, but
scikit-learn can't really take advantage of this.

Having said that, I would also try with higher n_features and n_queries. We
have to limit the scale of our examples in order to limit the overall
document compilation time.

On 16 April 2015 at 01:12, Miroslav Batchkarov mbatchka...@gmail.com
wrote:

 Hi everyone,

 was really impressed by the speedups provided by LSHForest compared to
 brute-force search. Out of curiosity, I compared LSRForest to the existing
 ball tree implementation. The approximate algorithm is consistently slower
 (see below). Is this normal and should it be mentioned in the
 documentation? Does approximate search offer any benefits in terms of
 memory usage?


 I ran the same example
 http://scikit-learn.org/stable/auto_examples/neighbors/plot_approximate_nearest_neighbors_scalability.html#example-neighbors-plot-approximate-nearest-neighbors-scalability-py
  with
 a algorithm=ball_tree. I also had to set metric=‘euclidean’ (this may
 affect results). The output is:

 Index size: 1000, exact: 0.000s, LSHF: 0.007s, speedup: 0.0, accuracy:
 1.00 +/-0.00
 Index size: 2511, exact: 0.001s, LSHF: 0.007s, speedup: 0.1, accuracy:
 0.94 +/-0.05
 Index size: 6309, exact: 0.001s, LSHF: 0.008s, speedup: 0.2, accuracy:
 0.92 +/-0.07
 Index size: 15848, exact: 0.002s, LSHF: 0.008s, speedup: 0.3, accuracy:
 0.92 +/-0.07
 Index size: 39810, exact: 0.005s, LSHF: 0.010s, speedup: 0.5, accuracy:
 0.84 +/-0.10
 Index size: 10, exact: 0.008s, LSHF: 0.016s, speedup: 0.5, accuracy:
 0.80 +/-0.06

 With n_candidates=100, the output is

 Index size: 1000, exact: 0.000s, LSHF: 0.006s, speedup: 0.0, accuracy:
 1.00 +/-0.00
 Index size: 2511, exact: 0.001s, LSHF: 0.006s, speedup: 0.1, accuracy:
 0.94 +/-0.05
 Index size: 6309, exact: 0.001s, LSHF: 0.005s, speedup: 0.2, accuracy:
 0.92 +/-0.07
 Index size: 15848, exact: 0.002s, LSHF: 0.007s, speedup: 0.4, accuracy:
 0.90 +/-0.11
 Index size: 39810, exact: 0.005s, LSHF: 0.008s, speedup: 0.7, accuracy:
 0.82 +/-0.13
 Index size: 10, exact: 0.007s, LSHF: 0.013s, speedup: 0.6, accuracy:
 0.78 +/-0.04



 ---
 Miroslav Batchkarov
 PhD Student,
 Text Analysis Group,
 Department of Informatics,
 University of Sussex





 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
 Develop your own process in accordance with the BPMN 2 standard
 Learn Process modeling best practices with Bonita BPM through live
 exercises
 http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
 event?utm_
 source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman

Oh. Silly mistake. Doesn't break with the correct patch, now at PR#4604...

On 16 April 2015 at 14:24, Joel Nothman joel.noth...@gmail.com wrote:

 Except apparently that commit breaks the code... Maybe I've misunderstood
 something :(

 On 16 April 2015 at 14:18, Joel Nothman joel.noth...@gmail.com wrote:

 ball tree is not vectorized in the sense of SIMD, but there is
 Python/numpy overhead in LSHForest that is not present in ball tree.

 I think one of the problems is the high n_candidates relative to the
 n_neighbors. This really increases the search time.

 Once we're dealing with large enough index and n_candidates, most time is
 spent in searchsorted in the synchronous ascending phase, while any
 overhead around it is marginal. Currently we are searching over the whole
 array in each searchsorted, while it could be rewritten to keep better
 track of context to cut down the overall array when searching. While
 possible, I suspect this will look confusing in Python/Numpy, and Cython
 will be a clearer and faster way to present this logic.

 On the other hand, time spent in _compute_distances is substantial, and
 yet most of its consumption is *outside* of pairwise_distances. This
 commit
 https://github.com/scikit-learn/scikit-learn/commit/c1f335f70aa0f766a930f8ac54eeaa601245725a
 cuts a basic benchmark from 85 to 70 seconds. Vote here for merge
 https://github.com/scikit-learn/scikit-learn/pull/4603!

 On 16 April 2015 at 12:32, Maheshakya Wijewardena pmaheshak...@gmail.com
  wrote:

 Moreover, this drawback occurs because LSHForest does not vectorize
 multiple queries as in 'ball_tree' or any other method. This slows the
 exact neighbor distance calculation down significantly after approximation.
 This will not be a problem if queries are for individual points.
 Unfortunately, former is the more useful usage of LSHForest.
 Are you trying individual queries or multiple queries (for n_samples)?

 On Thu, Apr 16, 2015 at 6:14 AM, Daniel Vainsencher 
 daniel.vainsenc...@gmail.com wrote:

 LHSForest is not intended for dimensions at which exact methods work
 well, nor for tiny datasets. Try d500, n_points10, I don't remember
 the switchover point.

 The documentation should make this clear, but unfortunately I don't see
 that it does.
 On Apr 15, 2015 7:08 PM, Joel Nothman joel.noth...@gmail.com wrote:

 I agree this is disappointing, and we need to work on making LSHForest
 faster. Portions should probably be coded in Cython, for instance, as the
 current implementation is a bit circuitous in order to work in numpy. PRs
 are welcome.

 LSHForest could use parallelism to be faster, but so can (and will)
 the exact neighbors methods. In theory in LSHForest, each tree could be
 stored on entirely different machines, providing memory benefits, but
 scikit-learn can't really take advantage of this.

 Having said that, I would also try with higher n_features and
 n_queries. We have to limit the scale of our examples in order to limit 
 the
 overall document compilation time.

 On 16 April 2015 at 01:12, Miroslav Batchkarov mbatchka...@gmail.com
 wrote:

 Hi everyone,

 was really impressed by the speedups provided by LSHForest compared
 to brute-force search. Out of curiosity, I compared LSRForest to the
 existing ball tree implementation. The approximate algorithm is
 consistently slower (see below). Is this normal and should it be 
 mentioned
 in the documentation? Does approximate search offer any benefits in terms
 of memory usage?


 I ran the same example
 http://scikit-learn.org/stable/auto_examples/neighbors/plot_approximate_nearest_neighbors_scalability.html#example-neighbors-plot-approximate-nearest-neighbors-scalability-py
  with
 a algorithm=ball_tree. I also had to set metric=‘euclidean’ (this may
 affect results). The output is:

 Index size: 1000, exact: 0.000s, LSHF: 0.007s, speedup: 0.0,
 accuracy: 1.00 +/-0.00
 Index size: 2511, exact: 0.001s, LSHF: 0.007s, speedup: 0.1,
 accuracy: 0.94 +/-0.05
 Index size: 6309, exact: 0.001s, LSHF: 0.008s, speedup: 0.2,
 accuracy: 0.92 +/-0.07
 Index size: 15848, exact: 0.002s, LSHF: 0.008s, speedup: 0.3,
 accuracy: 0.92 +/-0.07
 Index size: 39810, exact: 0.005s, LSHF: 0.010s, speedup: 0.5,
 accuracy: 0.84 +/-0.10
 Index size: 10, exact: 0.008s, LSHF: 0.016s, speedup: 0.5,
 accuracy: 0.80 +/-0.06

 With n_candidates=100, the output is

 Index size: 1000, exact: 0.000s, LSHF: 0.006s, speedup: 0.0,
 accuracy: 1.00 +/-0.00
 Index size: 2511, exact: 0.001s, LSHF: 0.006s, speedup: 0.1,
 accuracy: 0.94 +/-0.05
 Index size: 6309, exact: 0.001s, LSHF: 0.005s, speedup: 0.2,
 accuracy: 0.92 +/-0.07
 Index size: 15848, exact: 0.002s, LSHF: 0.007s, speedup: 0.4,
 accuracy: 0.90 +/-0.11
 Index size: 39810, exact: 0.005s, LSHF: 0.008s, speedup: 0.7,
 accuracy: 0.82 +/-0.13
 Index size: 10, exact: 0.007s, LSHF: 0.013s, speedup: 0.6,
 accuracy: 0.78 +/-0.04



 ---
 Miroslav Batchkarov
 PhD Student,
 Text Analysis Group,
 Department of Informatics

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman

ball tree is not vectorized in the sense of SIMD, but there is Python/numpy
overhead in LSHForest that is not present in ball tree.

I think one of the problems is the high n_candidates relative to the
n_neighbors. This really increases the search time.

Once we're dealing with large enough index and n_candidates, most time is
spent in searchsorted in the synchronous ascending phase, while any
overhead around it is marginal. Currently we are searching over the whole
array in each searchsorted, while it could be rewritten to keep better
track of context to cut down the overall array when searching. While
possible, I suspect this will look confusing in Python/Numpy, and Cython
will be a clearer and faster way to present this logic.

On the other hand, time spent in _compute_distances is substantial, and yet
most of its consumption is *outside* of pairwise_distances. This commit
https://github.com/scikit-learn/scikit-learn/commit/c1f335f70aa0f766a930f8ac54eeaa601245725a
cuts a basic benchmark from 85 to 70 seconds. Vote here for merge
https://github.com/scikit-learn/scikit-learn/pull/4603!

On 16 April 2015 at 12:32, Maheshakya Wijewardena pmaheshak...@gmail.com
wrote:

 Moreover, this drawback occurs because LSHForest does not vectorize
 multiple queries as in 'ball_tree' or any other method. This slows the
 exact neighbor distance calculation down significantly after approximation.
 This will not be a problem if queries are for individual points.
 Unfortunately, former is the more useful usage of LSHForest.
 Are you trying individual queries or multiple queries (for n_samples)?

 On Thu, Apr 16, 2015 at 6:14 AM, Daniel Vainsencher 
 daniel.vainsenc...@gmail.com wrote:

 LHSForest is not intended for dimensions at which exact methods work
 well, nor for tiny datasets. Try d500, n_points10, I don't remember
 the switchover point.

 The documentation should make this clear, but unfortunately I don't see
 that it does.
 On Apr 15, 2015 7:08 PM, Joel Nothman joel.noth...@gmail.com wrote:

 I agree this is disappointing, and we need to work on making LSHForest
 faster. Portions should probably be coded in Cython, for instance, as the
 current implementation is a bit circuitous in order to work in numpy. PRs
 are welcome.

 LSHForest could use parallelism to be faster, but so can (and will) the
 exact neighbors methods. In theory in LSHForest, each tree could be
 stored on entirely different machines, providing memory benefits, but
 scikit-learn can't really take advantage of this.

 Having said that, I would also try with higher n_features and n_queries.
 We have to limit the scale of our examples in order to limit the overall
 document compilation time.

 On 16 April 2015 at 01:12, Miroslav Batchkarov mbatchka...@gmail.com
 wrote:

 Hi everyone,

 was really impressed by the speedups provided by LSHForest compared to
 brute-force search. Out of curiosity, I compared LSRForest to the existing
 ball tree implementation. The approximate algorithm is consistently slower
 (see below). Is this normal and should it be mentioned in the
 documentation? Does approximate search offer any benefits in terms of
 memory usage?


 I ran the same example
 http://scikit-learn.org/stable/auto_examples/neighbors/plot_approximate_nearest_neighbors_scalability.html#example-neighbors-plot-approximate-nearest-neighbors-scalability-py
  with
 a algorithm=ball_tree. I also had to set metric=‘euclidean’ (this may
 affect results). The output is:

 Index size: 1000, exact: 0.000s, LSHF: 0.007s, speedup: 0.0, accuracy:
 1.00 +/-0.00
 Index size: 2511, exact: 0.001s, LSHF: 0.007s, speedup: 0.1, accuracy:
 0.94 +/-0.05
 Index size: 6309, exact: 0.001s, LSHF: 0.008s, speedup: 0.2, accuracy:
 0.92 +/-0.07
 Index size: 15848, exact: 0.002s, LSHF: 0.008s, speedup: 0.3, accuracy:
 0.92 +/-0.07
 Index size: 39810, exact: 0.005s, LSHF: 0.010s, speedup: 0.5, accuracy:
 0.84 +/-0.10
 Index size: 10, exact: 0.008s, LSHF: 0.016s, speedup: 0.5,
 accuracy: 0.80 +/-0.06

 With n_candidates=100, the output is

 Index size: 1000, exact: 0.000s, LSHF: 0.006s, speedup: 0.0, accuracy:
 1.00 +/-0.00
 Index size: 2511, exact: 0.001s, LSHF: 0.006s, speedup: 0.1, accuracy:
 0.94 +/-0.05
 Index size: 6309, exact: 0.001s, LSHF: 0.005s, speedup: 0.2, accuracy:
 0.92 +/-0.07
 Index size: 15848, exact: 0.002s, LSHF: 0.007s, speedup: 0.4, accuracy:
 0.90 +/-0.11
 Index size: 39810, exact: 0.005s, LSHF: 0.008s, speedup: 0.7, accuracy:
 0.82 +/-0.13
 Index size: 10, exact: 0.007s, LSHF: 0.013s, speedup: 0.6,
 accuracy: 0.78 +/-0.04



 ---
 Miroslav Batchkarov
 PhD Student,
 Text Analysis Group,
 Department of Informatics,
 University of Sussex





 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
 Develop your own process in accordance with the BPMN 2 standard
 Learn Process modeling best practices with Bonita BPM through live
 exercises
 http://www.bonitasoft.com

Re: [Scikit-learn-general] Performance of LSHForest

2015-04-15 Thread Joel Nothman

Except apparently that commit breaks the code... Maybe I've misunderstood
something :(

On 16 April 2015 at 14:18, Joel Nothman joel.noth...@gmail.com wrote:

 ball tree is not vectorized in the sense of SIMD, but there is
 Python/numpy overhead in LSHForest that is not present in ball tree.

 I think one of the problems is the high n_candidates relative to the
 n_neighbors. This really increases the search time.

 Once we're dealing with large enough index and n_candidates, most time is
 spent in searchsorted in the synchronous ascending phase, while any
 overhead around it is marginal. Currently we are searching over the whole
 array in each searchsorted, while it could be rewritten to keep better
 track of context to cut down the overall array when searching. While
 possible, I suspect this will look confusing in Python/Numpy, and Cython
 will be a clearer and faster way to present this logic.

 On the other hand, time spent in _compute_distances is substantial, and
 yet most of its consumption is *outside* of pairwise_distances. This
 commit
 https://github.com/scikit-learn/scikit-learn/commit/c1f335f70aa0f766a930f8ac54eeaa601245725a
 cuts a basic benchmark from 85 to 70 seconds. Vote here for merge
 https://github.com/scikit-learn/scikit-learn/pull/4603!

 On 16 April 2015 at 12:32, Maheshakya Wijewardena pmaheshak...@gmail.com
 wrote:

 Moreover, this drawback occurs because LSHForest does not vectorize
 multiple queries as in 'ball_tree' or any other method. This slows the
 exact neighbor distance calculation down significantly after approximation.
 This will not be a problem if queries are for individual points.
 Unfortunately, former is the more useful usage of LSHForest.
 Are you trying individual queries or multiple queries (for n_samples)?

 On Thu, Apr 16, 2015 at 6:14 AM, Daniel Vainsencher 
 daniel.vainsenc...@gmail.com wrote:

 LHSForest is not intended for dimensions at which exact methods work
 well, nor for tiny datasets. Try d500, n_points10, I don't remember
 the switchover point.

 The documentation should make this clear, but unfortunately I don't see
 that it does.
 On Apr 15, 2015 7:08 PM, Joel Nothman joel.noth...@gmail.com wrote:

 I agree this is disappointing, and we need to work on making LSHForest
 faster. Portions should probably be coded in Cython, for instance, as the
 current implementation is a bit circuitous in order to work in numpy. PRs
 are welcome.

 LSHForest could use parallelism to be faster, but so can (and will) the
 exact neighbors methods. In theory in LSHForest, each tree could be
 stored on entirely different machines, providing memory benefits, but
 scikit-learn can't really take advantage of this.

 Having said that, I would also try with higher n_features and
 n_queries. We have to limit the scale of our examples in order to limit the
 overall document compilation time.

 On 16 April 2015 at 01:12, Miroslav Batchkarov mbatchka...@gmail.com
 wrote:

 Hi everyone,

 was really impressed by the speedups provided by LSHForest compared to
 brute-force search. Out of curiosity, I compared LSRForest to the existing
 ball tree implementation. The approximate algorithm is consistently slower
 (see below). Is this normal and should it be mentioned in the
 documentation? Does approximate search offer any benefits in terms of
 memory usage?


 I ran the same example
 http://scikit-learn.org/stable/auto_examples/neighbors/plot_approximate_nearest_neighbors_scalability.html#example-neighbors-plot-approximate-nearest-neighbors-scalability-py
  with
 a algorithm=ball_tree. I also had to set metric=‘euclidean’ (this may
 affect results). The output is:

 Index size: 1000, exact: 0.000s, LSHF: 0.007s, speedup: 0.0, accuracy:
 1.00 +/-0.00
 Index size: 2511, exact: 0.001s, LSHF: 0.007s, speedup: 0.1, accuracy:
 0.94 +/-0.05
 Index size: 6309, exact: 0.001s, LSHF: 0.008s, speedup: 0.2, accuracy:
 0.92 +/-0.07
 Index size: 15848, exact: 0.002s, LSHF: 0.008s, speedup: 0.3,
 accuracy: 0.92 +/-0.07
 Index size: 39810, exact: 0.005s, LSHF: 0.010s, speedup: 0.5,
 accuracy: 0.84 +/-0.10
 Index size: 10, exact: 0.008s, LSHF: 0.016s, speedup: 0.5,
 accuracy: 0.80 +/-0.06

 With n_candidates=100, the output is

 Index size: 1000, exact: 0.000s, LSHF: 0.006s, speedup: 0.0, accuracy:
 1.00 +/-0.00
 Index size: 2511, exact: 0.001s, LSHF: 0.006s, speedup: 0.1, accuracy:
 0.94 +/-0.05
 Index size: 6309, exact: 0.001s, LSHF: 0.005s, speedup: 0.2, accuracy:
 0.92 +/-0.07
 Index size: 15848, exact: 0.002s, LSHF: 0.007s, speedup: 0.4,
 accuracy: 0.90 +/-0.11
 Index size: 39810, exact: 0.005s, LSHF: 0.008s, speedup: 0.7,
 accuracy: 0.82 +/-0.13
 Index size: 10, exact: 0.007s, LSHF: 0.013s, speedup: 0.6,
 accuracy: 0.78 +/-0.04



 ---
 Miroslav Batchkarov
 PhD Student,
 Text Analysis Group,
 Department of Informatics,
 University of Sussex





 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM

Re: [Scikit-learn-general] reconstruct image after preprocessing

2015-04-14 Thread Joel Nothman

Use preprocessing.StandardScaler()'s transform and inverse_transform
methods.

HTH!

On 14 April 2015 at 19:06, Souad Chaabouni chaabouni_so...@yahoo.fr wrote:

 Hello,

 Im beginner,
 I have an image which i done a preprocessing with sklearn

 img_scaled = preprocessing.scale(img)


 my question how can reconstrcut my original image just from img_scaled??
 is it possible or no??

 is there a function that ensures the reverse of preprocessing.scale???

 Thx for replay.

 *Souad CHAABOUNI*
 Computer Science PhD student,
 Bordeaux University, Sfax University
 EMAIL: chaabouni_so...@yahoo.fr
 TEL: (+216) 21 77 17 44


 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
 Develop your own process in accordance with the BPMN 2 standard
 Learn Process modeling best practices with Bonita BPM through live
 exercises
 http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
 event?utm_
 source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Help: Getting ValueError @precision_recall_fscore_support

2015-04-13 Thread Joel Nothman

Ignoring the class label 'O' from evaluation will be possible with #4287
https://github.com/scikit-learn/scikit-learn/pull/4287 merged

On 14 April 2015 at 11:43, namma igloo nammaig...@outlook.com wrote:

 I was removing the class 'O' (other) from labels as given in the
 python-crfsuite example [1]. It is interesting that the same code with one
 less label works in original example, but not with my data. Anyway, I could
 fix the issue by keeping all the labels. Thanks a lot Andreas!

 Cheers

 [1] function bio_classification_report () -
 http://nbviewer.ipython.org/github/tpeng/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb

 --
 Date: Mon, 13 Apr 2015 18:27:20 -0400
 From: t3k...@gmail.com
 To: scikit-learn-general@lists.sourceforge.net
 Subject: Re: [Scikit-learn-general] Help: Getting ValueError
 @precision_recall_fscore_support

 I think this is because your y_true_combined has five classes, while your
 labels and target names only have four classes.


 On 04/13/2015 11:29 AM, namma igloo wrote:

 Hi,

  I'm new to sklearn and trying out python-crfsuit example [1] on my own
 data and following the example code, I'm getting -
 @precision_recall_fscore_support method: ValueError: too many boolean
 indices

  I must note that this method is called by [classification_report]
 function which I'm calling with following signature:

  classification_report(
 y_true_combined,
 y_pred_combined,
 labels = [class_indices[cls] for cls in tagset],
 target_names = tagset
  )

  I do not get any error if I omit the optional parameters 'labels' and
 'target_names'. So I am not sure if the issue is because of values of those
 two parameters or somewhere else.

  I have pasted here [2] the values of all four variables supplied to
 classification_report.
 I'll appreciate in any help in troubleshooting this issue.

  [1]
 http://nbviewer.ipython.org/github/tpeng/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb
 [2] http://pastebin.com/qaq7Kf1u

  Thanks


 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
 Develop your own process in accordance with the BPMN 2 standard
 Learn Process modeling best practices with Bonita BPM through live 
 exerciseshttp://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- 
 event?utm_
 source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF



 ___
 Scikit-learn-general mailing 
 listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general



 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your
 own process in accordance with the BPMN 2 standard Learn Process modeling
 best practices with Bonita BPM through live exercises
 http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
 event?utm_
 source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF
 ___ Scikit-learn-general
 mailing list Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
 Develop your own process in accordance with the BPMN 2 standard
 Learn Process modeling best practices with Bonita BPM through live
 exercises
 http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
 event?utm_
 source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Micro and Macro F-measure for text classification

2015-04-11 Thread Joel Nothman

Or report macro and micro in classification_report. Micro is equivalent to
accuracy for multiclass without #4287
https://github.com/scikit-learn/scikit-learn/pull/4287.

On 10 April 2015 at 01:00, Andreas Mueller t3k...@gmail.com wrote:

  Hi Jack.
 You mean in the classification report?
 That give micro-average from looking at the code:

 https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/classification.py#L1265

 If you use the f1_score function instead you can give the averaging scheme:

 http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score

 I wonder if we should rather call precision_recall_fscore_support again
 here with an average option?

 Hth,
 Andy



 On 04/09/2015 10:20 AM, Jack Alan wrote:

 Hi folks,

  I wonder for classification of text documents available on:

 http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html#example-text-mlcomp-sparse-document-classification-py

  What sort of F-measure that has been used? Is it Micro or Macro? and how
 to change the default option to use the other?

  ~j


 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
 Develop your own process in accordance with the BPMN 2 standard
 Learn Process modeling best practices with Bonita BPM through live 
 exerciseshttp://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- 
 event?utm_
 source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF



 ___
 Scikit-learn-general mailing 
 listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
 Develop your own process in accordance with the BPMN 2 standard
 Learn Process modeling best practices with Bonita BPM through live
 exercises
 http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
 event?utm_
 source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Artificial Neural Networks

2015-04-07 Thread Joel Nothman

Issam Laradji implemented a multilayer perceptron and extreme learning
machines for last year's GSoC. Both are awaiting final reviews before being
merged. They should be functional and can be found in the Issue Tracker.


On 7 April 2015 at 21:09, Vlad Ionescu ionescu.vl...@gmail.com wrote:

 Hello,

 I was wondering why there isn't a classic neural network implementation in
 scikit-learn (a multilayer perceptron). This could have varying levels of
 complexity: it could be hardcoded to just one hidden layer, allowing one to
 specify the type of neurons in it (sigmoid, tanh, rectified linear etc.),
 the learning rate and values for weight decay and momentum.

 It could also be made to accept multiple hidden layers, with the ability
 to specify the number of neurons and their type for each one.

 Has this been considered before but no one has gotten around to it? Would
 it be of interest for you?

 There are of course more sophisticated methods that would be nice to have
 as well. I'm only asking about the basic type because that is what I
 currently would be willing to help with, but it would be great if more were
 under consideration.


 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
 Develop your own process in accordance with the BPMN 2 standard
 Learn Process modeling best practices with Bonita BPM through live
 exercises
 http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
 event?utm_
 source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] [GSoC] Metric Learning

2015-03-24 Thread Joel Nothman

On 25 March 2015 at 00:01, Gael Varoquaux gael.varoqu...@normalesup.org
wrote:


  To make this more concrete, the MetricLearner().metric_ estimator would
  require specialised set_params or clone behaviour, I assume. I.e. it
  involves hacking API fundamentals.

 It's more a general principle of freeze: to be able to settle down on
 something that we _know_ works and is robust, understandable, bugless...
 we need to stop changing or adding things.


Yes, I get that too. GSoC tends to pull in the opposite direction by way of
being project oriented.
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] GSoC 2015 Proposal: Multiple Metric Learning

2015-03-24 Thread Joel Nothman

I agree with everything Andy says. I think the core developers are very
enthusiastic to have a project along the lines of Finish all the things
that need finishing, but it's very impractical to do so much context
switching both for students and mentors/reviewers.

One of the advantages of GSoC is that it creates specialisation: on the one
hand, a user becomes expert in what they tackle; on the other, reviewers
and mentors can limit their attention to the topic at hand. So please, try
to focus a little more.

On 24 March 2015 at 08:40, Andreas Mueller t3k...@gmail.com wrote:

Hi Raghav.

I feel that your proposal lacks some focus.
I'd remove the two:

Mallow's Cp for LASSO / LARS
Implement built in abs max scaler, Nesterov's momentum and finish up the
Multilayer Perceptron module.

And as discussed in this thread probably also
Forge a self sufficient ML tutorial based on scikit-learn.

If you feel like you proposal has not enough material (not sure about
that),
two things that could be added and are more related to the
cross-validation and grid-search part
(but probably difficult from an API standpoint) are making CV objects (aka
path algorithms, or generalized cross-validation)
work together with GridSearchCV.
The other would be how to allow early stopping using a validation set.
The two are probably related (imho).

Olivier also mentioned cross-validation for out-of-core (partial_fit)
algorithms.
I feel that is not as important, but might also tie into your proposal.

Finishing the refactoring of model_evaluation in three days seems a bit
optimistic, if you include reviews.

For sample_weight support, I'm not if there are obvious ways to extend
sample_weight to all the algorithms that you mentioned.
How does it work for spectral clustering and agglomerative clustering for
example?

In general, I feel you should rather focus on less things, and more on the
details of what to do there.
Otherwise the proposal looks good.
For the wiki, having links to the issues might be helpful.

Thanks for the application :)

Andy

On 03/22/2015 08:52 PM, Raghav R V wrote:

2 things :

* The subject should have been Multiple Metric Support in grid_search
and cross_validation modules and other general improvements and not
multiple metric learning! Sorry for that!
* The link was not available due to the trailing . (dot), which has been
fixed now!

Thanks
R

On Mon, Mar 23, 2015 at 5:47 AM, Raghav R V rag...@gmail.com wrote:

1. the link is broken

Ah! Sorry :) -
https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements
.

2. that sounds quite difficult and unfortunately conducive to cheating

Hmm... Should I then simply opt for adding more examples then?

On Sun, Mar 22, 2015 at 7:57 PM, Raghav R V rag...@gmail.com wrote:

Hi,

1. This is my proposal for the multiple metric learning project as a
wiki page -
https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements
.

Possible mentors : Andreas Mueller (amueller) and Joel Nothman
(jnothman)

Any feedback/suggestions/additions/deletions would be awesome. :)

2. Given that there is a huge interest among students in learning
about ML, do you think it would be within the scope of/beneficial to skl to
have all the exercises and/or concepts, from a good quality book (ESL /
PRML / Murphy) or an academic course like NG's CS229 (not the less rigorous
coursera version), implemented using sklearn? Or perhaps we could instead
enhance our tutorials and examples, to be a self study guide to learn about
ML?
I have included this in my GSoC proposal but was not quite sure if this
would be an useful idea!!

Or would it be better if I simply add more examples?

Please let me know your views!!

Thanks

--
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
by Intel and developed in partnership with Slashdot Media, is your hub
for all
things parallel software development, from weekly thought leadership
blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] [GSoC] Metric Learning

2015-03-24 Thread Joel Nothman

On 24 March 2015 at 23:56, Gael Varoquaux gael.varoqu...@normalesup.org
wrote:

  So I just thought: what if metric learners will have an attribute
 `metric`

 Before adding features and API entries, I'd really like to focus on
 having a 1.0 release, with a fixed API that really solves the problems
 that we currently are trying to solve.

 In other words, I would like to get in an API freeze state where we
 add/modify only essentials stuff to the API.

 Gaël


To make this more concrete, the MetricLearner().metric_ estimator would
require specialised set_params or clone behaviour, I assume. I.e. it
involves hacking API fundamentals.
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] [GSoC] Metric Learning

2015-03-24 Thread Joel Nothman

Hi Artem, I've taken a look at your proposal. I think this is an
interesting contribution, but I suspect your proposal is far too ambitious:

   - The proposal doesn't well account for the need to receive reviews and
   alter the PR in accordance. This is especially so because you are
   developing a new variant of the API which means that even if the algorithm
   works perfectly you won't get a free green light.
   - With an implementation of one or two algorithms two algorithms, it
   would be much better to add good examples of their utility and their
   features to the example gallery than to implement more algorithms.
   Developing good examples takes time too (and the reviewers are just as
   picky).
   - You will need to package your contributions into manageable PRs, and
   ideally after each is merged, the overall project should still be usable
   (well-tested, documented, etc.). So the documentation will, at least in
   some measure need to be integrated.
   - As Gaël suggested, there's some cause for concern in that it requires
   developing a new variant of the general API. This means everything is
   slower, including more need for sanity and integration testing than other
   projects may entail.
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Student looking to contribute to scikit-learn

2015-03-21 Thread Joel Nothman

GSOC isn't the best way to get started. We recommend you get to know the
code structure, API and development process by starting with issues
labelled https://github.com/scikit-learn/scikit-learn/labels/Easy. In
general, look through the Issue Tracker and find something of interest, or
which has stagnated. Please read through
http://scikit-learn.org/dev/developers/index.html.

And we look forward to your contributions. Thanks!

On 21 March 2015 at 13:34, Rohit Shinde rohit.shinde12...@gmail.com wrote:

 Hello everyone,

 It been 2 days now and I still have not got a reply to my earlier mail. I
 would really like to contribute to this library and I want to know how to.

 Thank you,
 Rohit Shinde

 On Thu, Mar 19, 2015 at 1:58 PM, Rohit Shinde rohit.shinde12...@gmail.com
  wrote:

 Hello everyone,

 I am a final year student of Computer Science from India. I study at the
 Vishwakarma Institute of Technology in Pune. I am interested in various
 areas under Machine Learning and Aritificial Intelligence. I have a
 theoretical background in both these subjects and a limited experience of
 some projects in these fields. I also tried to build a Chess program using
 AI techniques. It is in progress right now and I dedicate time to it as and
 when possible. I programmed a small application using backpropagation to
 classify one of the data sets on the UCI dataset repository. It wasn't very
 succesful but it gave me some exposure to neural networks. I have also been
 exposed to Data Mining, so I do know of other algorithms like KNN, SVM and
 K-Medoids and others.

 I have programmed using Python, Java and C++. I am very proficient in all
 these three languages. I have done all my mini as well as major projects in
 these languages and so I have a lot of programming experience in these
 languages.

 I came across scikit-learn when looking for some good Machine Learning
 libraries written in Python. I was also told by one of my professors that
 this library is a highly regarded Machine Learning library. I was told that
 many of the algorithms in this library work off the shelf. I would really
 like to contribute to this library.

 How can I contribute to this library?

 Thank you,
 Rohit Shinde




 --
 Dive into the World of Parallel Programming The Go Parallel Website,
 sponsored
 by Intel and developed in partnership with Slashdot Media, is your hub for
 all
 things parallel software development, from weekly thought leadership blogs
 to
 news, videos, case studies, tutorials and more. Take a look and join the
 conversation now. http://goparallel.sourceforge.net/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] [GSoC] Metric Learning

2015-03-21 Thread Joel Nothman


 Are there any objections on Joel's variant of y? It serves my needs, but
 is quite different from what one can usually find in scikit-learn.


FWIW It'll require some changes to cross-validation routines.

On 22 March 2015 at 11:54, Artem barmaley@gmail.com wrote:

 Are there any objections on Joel's variant of y? It serves my needs, but
 is quite different from what one can usually find in scikit-learn.

 --

 Another point I want to bring up is metric-aware KMeans. Currently it
 works with Euclidean distance only, which is not a problem for a
 Mahalanobis distance, but as (and if) we move towards kernel metrics, it
 becomes impossible to transform the data in a way that the Euclidean
 distance between the transformed points accurately reflects the distance
 between the points in a space with the learned metric.

 I think it'd nice to have non-linear metrics, too. One of the possible
 approaches (widely recognized among researchers on metric learning) is to
 use KernelPCA before learning the metric. This would work really well with
 sklearn's Pipelines.
 But not all the methods are justified to be used with Kernel PCA. Namely,
 ITML uses a special kind of regularization that breaks all theoretical
 guarantees.

 And, it's a bit weird that something that is called a metric learning
 actually does space transformation. Maybe we should also add something like
 factories of metrics, whose sole result is a DistanceMetric (in particular
 for those kernel metrics)?

 On Fri, Mar 20, 2015 at 10:01 AM, Gael Varoquaux 
 gael.varoqu...@normalesup.org wrote:

 On Fri, Mar 20, 2015 at 11:50:37AM +1100, Zay Maung Maung Aye wrote:
  Neighborhood Component Analysis is more cited than ITML.

 There is already a pull request on neighborhood component analysis
 https://github.com/scikit-learn/scikit-learn/issues/3213

 A first step of the GSoC could be to complete it.

 Gaël

  On Wed, Mar 18, 2015 at 11:39 PM, Artem barmaley@gmail.com wrote:

  Hello everyone

  Recently I mentioned metric learning as one of possible projects
 for this
  years' GSoC, and would like to hear your comments.

  Metric learning, as follows from the name, is about learning
 distance
  functions. Usually the metric that is learned is a Mahalanobis
 metric, thus
  the problem reduces to finding a PSD matrix A that minimizes some
  functional.

  Metric learning is usually done in a supervised way, that is, a
 user tells
  which points should be closer and which should be more distant. It
 can be
  expressed either in form of similar / dissimilar, or A is
 closer to B
  than to C.

  Since metric learning is (mostly) about a PSD matrix A, one can
 do Cholesky
  decomposition on it to obtain a matrix G to transform the data. It
 could
  lead to something like guided clustering, where we first transform
 the data
  space according to our prior knowledge of similarity.

  Metric learning seems to be quite an active field of research ([1],
 [2], [3
  ]). There are 2 somewhat up-to date surveys: [1] and [2].

  Top 3 seemingly most cited methods (according to Google Scholar)
 are

□ MMC by Xing et al. This is a pioneering work and, according to
 the
  survey #2

  The algorithm used to solve (1) is a simple projected
 gradient
  approach requiring the full
   
  eigenvalue decomposition of
   
  M
   
  at each iteration. This is typically intractable for medium
   
  and high-dimensional problems

□ Large Margin Nearest Neighbor by Weinberger et al. The survey 2
  acknowledges this method as one of the most widely-used
 Mahalanobis
  distance learning methods

  LMNN generally performs very well in practice, although it
 is
  sometimes prone to overfitting due to the absence of
  regularization, especially in high dimension

□ Information-theoretic metric learning by Davis et al. This one
 features
  a special kind of regularizer called logDet.
□ There are many other methods. If you guys know that other
 methods rock,
  let me know.

  So the project I'm proposing is about implementing 2nd or 3rd (or
 both?)
  algorithms along with a relevant transformer.

 
  
 --
  Dive into the World of Parallel Programming The Go Parallel Website,
  sponsored
  by Intel and developed in partnership with Slashdot Media, is your
 hub for
  all
  things parallel software development, from weekly thought
 leadership blogs
  to
  news, videos, case studies, tutorials and more. Take a look and
 join the
  conversation now. http://goparallel.sourceforge.net/
  ___
  Scikit-learn-general

Re: [Scikit-learn-general] GSoC2015 Hyperparameter Optimization topic

2015-03-19 Thread Joel Nothman

This is off-topic, but I should note that there is a patch at
https://github.com/scikit-learn/scikit-learn/pull/2784 awaiting review for
a while now...

On 20 March 2015 at 08:16, Charles Martin charlesmarti...@gmail.com wrote:

I would like to propose extending the linearSVC package
by replacing the liblinear version with a newer version that

1. allows setting instance weights
2. provides the dual variables /Lagrange multipliers

This would facilitate research and development of transductive SVMs
and related semi-supervised methods.

Charles H Martin, PhD

On Thu, Mar 19, 2015 at 2:12 PM, Christof Angermueller
c.angermuel...@gmail.com wrote:
Hi All,

you can find my proposal for the hyperparameter optimization topic here:
* http://goo.gl/XHuav8
*

https://docs.google.com/document/d/1bAWdiu6hZ6-FhSOlhgH-7x3weTluxRfouw9op9bHBxs/edit?usp=sharing

Please give feedback!

Cheers,
Christof

On 20150310 15:27, Sturla Molden wrote:
Andreas Mueller t3k...@gmail.com wrote:
Does emcee implement Bayesian optimization?
What is the distribution you assume? GPs?
I thought emcee was a sampler. I need to check in with Dan ;)
Just pick the mode :-)

The distribution is whatever you want it to be.

Sturla

On 03/09/2015 09:27 AM, Sturla Molden wrote:
For Bayesian optimization with MCMC (which I believe spearmint also
does) I have found that emcee is very nice:

http://dan.iel.fm/emcee/current/

It is much faster than naïve MCMC methods and all we need to do is
compute a callback that computes the loglikelihood given the parameter
set (which can just as well be hyperparameters).

To do this computation in parallel one can simply evaluate the walkers
in parallel and do a barrier synchronization after each step. The
contention due to the barrier can be reduced by increasing the number
of
walkers as needed. Also one should use something like DCMT for random
numbers to make sure there are no contention for the PRNG and to
ensure
that each thread (or process) gets an independent stream of random
numbers.

emcee implements this kind of optimization using multiprocessing, but
it
passes parameter sets around using pickle and is therefore not very
efficient compared to just storing the current parameter for each
walker
in shared memory. So there is a lot of room for improvement here.

Sturla

On 07/03/15 15:06, Kyle Kastner wrote:
I think finding one method is indeed the goal. Even if it is not the
best every time, a 90% solution for 10% of the complexity would be
awesome. I think GPs with parameter space warping are *probably* the
best solution but only a good implementation will show for sure.

Spearmint and hyperopt exist and work for more complex stuff but with
far more moving parts and complexity. Having a tool which is easy to
use
as the grid search and random search modules currently are would be a
big benefit.

My .02c

Kyle

On Mar 7, 2015 7:48 AM, Christof Angermueller
c.angermuel...@gmail.com
mailto:c.angermuel...@gmail.com wrote:

Hi Andreas (and others),

I am a PhD student in Bioinformatics at the University of
Cambridge,
(EBI/EMBL), supervised by Oliver Stegle and Zoubin Ghahramani.
In my
PhD, I apply and develop different machine learning algorithms
for
analyzing biological data.

There are different approaches for hyperparameter
optimization, some
of which you mentioned on the topics page:
* Sequential Model-Based Global Optimization (SMBO) -
http://www.cs.ubc.ca/labs/beta/Projects/SMAC/
* Gaussian Processes (GP) - Spearmint;
https://github.com/JasperSnoek/spearmint
* Tree-structured Parzen Estimator Approach (TPE) - Hyperopt:
http://hyperopt.github.io/hyperopt/

And more recent approaches based on neural networks:
* Deep Networks for Global Optimization (DNGO) -
http://arxiv.org/abs/1502.05700

The idea is to implement ONE of this approaches, right?

Do you prefer a particular approach due to theoretical or
practical
reasons?

Spearmint also supports distributing jobs on a cluster (SGE). I
imagine that this requires platform specific code, which could
be
difficult to maintain. What do you think?

Spearmint and hyperopt are already established python packages.
Another sklearn implementation might be considered as
redundant, are
hard to establish. Do you have a particular new feature in
mind?

Cheers,
Christof

--
Christof Angermueller
cangermuel...@gmail.com
mailto:cangermuel...@gmail.com
http://cangermueller.com

--
Dive into the World of Parallel Programming The Go Parallel
Website,
sponsored
by

Re: [Scikit-learn-general] [GSoC] Metric Learning

2015-03-18 Thread Joel Nothman

I don't know a lot about metric learning either, but it sounded like from
your initial statement that fit(X, D) where D is the target/known distance
between each point in X might be appropriate. I have no idea if this is how
it is formulated in the literature (your mention of asymmetric metrics
means it might be), but it seems an intuitive representation of the problem.

Your suggestion of similar and dissimilar groups could be represented
by D being a symmetric matrix with some distances 1 (dissimilar) and others
0 (similar), but you imply that some or the majority of cells would be
unknown (in which case a sparse D interpreting all non-explicit values as
unknown may be appropriate).

I would have thought in the case of Mahalanobis distances that transform
would transform each feature such that the resulting feature space was
Euclidean.

On 19 March 2015 at 08:47, Andreas Mueller t3k...@gmail.com wrote:

In summary, I think this does look like a good basis for a proposal :)

On 03/18/2015 05:14 PM, Artem wrote:

Do you think this interface would be useful enough?

One of mentioned methods (LMNN) actually uses prior knowledge in exactly
the same way, by comparing labels' equality. Though, it was designed to
facilitate KNN.

Authors of the other one (ITML) explicitly mention in the paper that one
can construct those sets S and D from labels.

Do you think it would make sense to use such a transformer in a pipeline
with a KNN classifier?
I feel that training both on the same labels might be a bit of an issue
with overfitting

Pipelining looks like a good way to combine these methods, but overfitting
could be a problem, indeed.
Not sure how severe it can be.

On Wed, Mar 18, 2015 at 10:07 PM, Andreas Mueller t3k...@gmail.com
wrote:

On 03/18/2015 02:53 PM, Artem wrote:

I mean that if we were solving classification, we would have y that
tells us which class each example belongs to. So if we pass this
classification's ground truth vector y to metric learning's fit, we can
form S and D inside by saying that observations from the same class should
be similar.

Ah, I got it now.

Only being able to transform to a distance to the training set is a
bit limiting

Sorry, I don't understand what you mean by this. Can you elaborate?

The metric does not memorize training samples, it finds a (linear
unless kernelized) transformation that makes similar examples cluster
together. Moreover, since the metric is completely determined by a PSD
matrix, we can compute its square root, and use to transform new data
without any supervision.

Ah, I think I misunderstood your proposal for the transformer interface.
Never mind.

Do you think this interface would be useful enough? I can think of a
couple of applications.
It would definitely fit well into the current scikit-learn framework.

Do you think it would make sense to use such a transformer in a pipeline
with a KNN classifier?
I feel that training both on the same labels might be a bit of an issue
with overfitting.

--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/

___
Scikit-learn-general mailing
listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for
all
things parallel software development, from weekly thought leadership blogs
to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net

1 2 3 4 >

1 - 100 of 384 matches

Mail list logo