Re: [scikit-learn] Supervised anomaly detection in time series

2016-08-08 Thread Amita Misra
Thanks for the pointers and papers. I'd definitely go through this approach
and see if it can be applied to my problem.

Thanks,
Amita

On Fri, Aug 5, 2016 at 4:40 PM, Albert Thomas 
wrote:

> Hi,
>
> About your question on how to learn the parameters of anomaly detection
> algorithms using only the negative samples in your case, Nicolas and I
> worked on this aspect recently. If you are interested you can have look at:
>
> - Learning hyperparameters for unsupervised anomaly detection:
> https://drive.google.com/file/d/0B8Dg3PBX90KNUTg5NGNOVnFPX0hDN
> mJsSTcybzZMSHNPYkd3/view
> - How to evaluate the quality of unsupervised anomaly Detection
> algorithms?:
> https://drive.google.com/file/d/0B8Dg3PBX90KNenV3WjRkR09Bakx5Y
> lNyMF9BUXVNem1hb0NR/view
>
> Best,
> Albert
>
> On Fri, Aug 5, 2016 at 9:34 PM Sebastian Raschka <
> m...@sebastianraschka.com> wrote:
>
>> > But this might be the kind of problem where you seriously ask how hard
>> it would be to gather more data.
>>
>>
>> Yeah, I agree, but this scenario is then typical in a sense of that it is
>> an anomaly detection problem rather than a classification problem. I.e.,
>> you don’t have enough positive labels to fit the model and thus you need to
>> do unsupervised learning to learn from the negative class only.
>>
>> Sure, supervised learning could work well, but I would also explore
>> unsupervised learning here and see how that works for you; maybe one-class
>> SVM as suggested or EM algorithm based mixture models (
>> http://scikit-learn.org/stable/modules/mixture.html)
>>
>> Best,
>> Sebastian
>>
>> > On Aug 5, 2016, at 2:55 PM, Jared Gabor  wrote:
>> >
>> > Lots of great suggestions on how to model your problem.  But this might
>> be the kind of problem where you seriously ask how hard it would be to
>> gather more data.
>> >
>> > On Thu, Aug 4, 2016 at 2:17 PM, Amita Misra  wrote:
>> > Hi,
>> >
>> > I am currently exploring the problem of speed bump detection using
>> accelerometer time series data.
>> > I have extracted some features based on mean, std deviation etc  within
>> a time window.
>> >
>> > Since the dataset is highly skewed ( I have just 5  positive samples
>> for every > 300 samples)
>> > I was looking into
>> >
>> > One ClassSVM
>> > covariance.EllipticEnvelope
>> > sklearn.ensemble.IsolationForest
>> > but I am not sure how to use them.
>> >
>> > What I get from docs
>> >
>> > separate the positive examples and train using only negative examples
>> > clf.fit(X_train)
>> > and then
>> > predict the positive examples using
>> > clf.predict(X_test)
>> >
>> >
>> > I am not sure what is then the role of positive examples in my training
>> dataset or how can I use them to improve my classifier so that I can
>> predict better on new samples.
>> >
>> >
>> > Can we do something like Cross validation to learn the parameters as in
>> normal binary SVM classification
>> >
>> > Thanks,?
>> > Amita
>> >
>> > Amita Misra
>> > Graduate Student Researcher
>> > Natural Language and Dialogue Systems Lab
>> > Baskin School of Engineering
>> > University of California Santa Cruz
>> >
>> >
>> >
>> >
>> >
>> > --
>> > Amita Misra
>> > Graduate Student Researcher
>> > Natural Language and Dialogue Systems Lab
>> > Baskin School of Engineering
>> > University of California Santa Cruz
>> >
>> >
>> > ___
>> > scikit-learn mailing list
>> > scikit-learn@python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> >
>> > ___
>> > scikit-learn mailing list
>> > scikit-learn@python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Supervised anomaly detection in time series

2016-08-05 Thread Albert Thomas
Hi,

About your question on how to learn the parameters of anomaly detection
algorithms using only the negative samples in your case, Nicolas and I
worked on this aspect recently. If you are interested you can have look at:

- Learning hyperparameters for unsupervised anomaly detection:
https://drive.google.com/file/d/0B8Dg3PBX90KNUTg5NGNOVnFPX0hDNmJsSTcybzZMSHNPYkd3/view
- How to evaluate the quality of unsupervised anomaly Detection algorithms?:
https://drive.google.com/file/d/0B8Dg3PBX90KNenV3WjRkR09Bakx5YlNyMF9BUXVNem1hb0NR/view


Best,
Albert

On Fri, Aug 5, 2016 at 9:34 PM Sebastian Raschka 
wrote:

> > But this might be the kind of problem where you seriously ask how hard
> it would be to gather more data.
>
>
> Yeah, I agree, but this scenario is then typical in a sense of that it is
> an anomaly detection problem rather than a classification problem. I.e.,
> you don’t have enough positive labels to fit the model and thus you need to
> do unsupervised learning to learn from the negative class only.
>
> Sure, supervised learning could work well, but I would also explore
> unsupervised learning here and see how that works for you; maybe one-class
> SVM as suggested or EM algorithm based mixture models (
> http://scikit-learn.org/stable/modules/mixture.html)
>
> Best,
> Sebastian
>
> > On Aug 5, 2016, at 2:55 PM, Jared Gabor  wrote:
> >
> > Lots of great suggestions on how to model your problem.  But this might
> be the kind of problem where you seriously ask how hard it would be to
> gather more data.
> >
> > On Thu, Aug 4, 2016 at 2:17 PM, Amita Misra  wrote:
> > Hi,
> >
> > I am currently exploring the problem of speed bump detection using
> accelerometer time series data.
> > I have extracted some features based on mean, std deviation etc  within
> a time window.
> >
> > Since the dataset is highly skewed ( I have just 5  positive samples for
> every > 300 samples)
> > I was looking into
> >
> > One ClassSVM
> > covariance.EllipticEnvelope
> > sklearn.ensemble.IsolationForest
> > but I am not sure how to use them.
> >
> > What I get from docs
> >
> > separate the positive examples and train using only negative examples
> > clf.fit(X_train)
> > and then
> > predict the positive examples using
> > clf.predict(X_test)
> >
> >
> > I am not sure what is then the role of positive examples in my training
> dataset or how can I use them to improve my classifier so that I can
> predict better on new samples.
> >
> >
> > Can we do something like Cross validation to learn the parameters as in
> normal binary SVM classification
> >
> > Thanks,?
> > Amita
> >
> > Amita Misra
> > Graduate Student Researcher
> > Natural Language and Dialogue Systems Lab
> > Baskin School of Engineering
> > University of California Santa Cruz
> >
> >
> >
> >
> >
> > --
> > Amita Misra
> > Graduate Student Researcher
> > Natural Language and Dialogue Systems Lab
> > Baskin School of Engineering
> > University of California Santa Cruz
> >
> >
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Supervised anomaly detection in time series

2016-08-05 Thread Sebastian Raschka
> But this might be the kind of problem where you seriously ask how hard it 
> would be to gather more data.  


Yeah, I agree, but this scenario is then typical in a sense of that it is an 
anomaly detection problem rather than a classification problem. I.e., you don’t 
have enough positive labels to fit the model and thus you need to do 
unsupervised learning to learn from the negative class only.

Sure, supervised learning could work well, but I would also explore 
unsupervised learning here and see how that works for you; maybe one-class SVM 
as suggested or EM algorithm based mixture models 
(http://scikit-learn.org/stable/modules/mixture.html)

Best,
Sebastian

> On Aug 5, 2016, at 2:55 PM, Jared Gabor  wrote:
> 
> Lots of great suggestions on how to model your problem.  But this might be 
> the kind of problem where you seriously ask how hard it would be to gather 
> more data.  
> 
> On Thu, Aug 4, 2016 at 2:17 PM, Amita Misra  wrote:
> Hi,
> 
> I am currently exploring the problem of speed bump detection using 
> accelerometer time series data.
> I have extracted some features based on mean, std deviation etc  within a 
> time window.
> 
> Since the dataset is highly skewed ( I have just 5  positive samples for 
> every > 300 samples)
> I was looking into 
> 
> One ClassSVM 
> covariance.EllipticEnvelope
> sklearn.ensemble.IsolationForest
> but I am not sure how to use them. 
> 
> What I get from docs
> 
> separate the positive examples and train using only negative examples
> clf.fit(X_train)
> and then
> predict the positive examples using
> clf.predict(X_test)
> 
> 
> I am not sure what is then the role of positive examples in my training 
> dataset or how can I use them to improve my classifier so that I can predict 
> better on new samples.
> 
> 
> Can we do something like Cross validation to learn the parameters as in 
> normal binary SVM classification
> 
> Thanks,?
> Amita
> 
> Amita Misra
> Graduate Student Researcher
> Natural Language and Dialogue Systems Lab
> Baskin School of Engineering
> University of California Santa Cruz
> 
> 
> 
> 
> 
> -- 
> Amita Misra
> Graduate Student Researcher
> Natural Language and Dialogue Systems Lab
> Baskin School of Engineering
> University of California Santa Cruz
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Supervised anomaly detection in time series

2016-08-05 Thread Amita Misra
Thanks everyone for the suggestions.

Actually we thought of gathering more data but the point is we do not have
many speed bumps in our driving area. If we drive over the same speed bump
again and again it may not add anything really novel to the data.

I think a combination of oversampling and sample_weight along with ROC may
be a good start for me.

Thanks,
Amita

On Fri, Aug 5, 2016 at 11:55 AM, Jared Gabor  wrote:

> Lots of great suggestions on how to model your problem.  But this might be
> the kind of problem where you seriously ask how hard it would be to gather
> more data.
>
> On Thu, Aug 4, 2016 at 2:17 PM, Amita Misra  wrote:
>
>> Hi,
>>
>> I am currently exploring the problem of speed bump detection using
>> accelerometer time series data.
>> I have extracted some features based on mean, std deviation etc  within a
>> time window.
>>
>> Since the dataset is highly skewed ( I have just 5  positive samples for
>> every > 300 samples)
>> I was looking into
>>
>> One ClassSVM
>> covariance.EllipticEnvelope
>> sklearn.ensemble.IsolationForest
>>
>> but I am not sure how to use them.
>>
>> What I get from docs
>> separate the positive examples and train using only negative examples
>>
>> clf.fit(X_train)
>>
>> and then
>> predict the positive examples using
>> clf.predict(X_test)
>>
>>
>> I am not sure what is then the role of positive examples in my training
>> dataset or how can I use them to improve my classifier so that I can
>> predict better on new samples.
>>
>>
>> Can we do something like Cross validation to learn the parameters as in
>> normal binary SVM classification
>>
>> Thanks,?
>> Amita
>>
>> Amita Misra
>> Graduate Student Researcher
>> Natural Language and Dialogue Systems Lab
>> Baskin School of Engineering
>> University of California Santa Cruz
>>
>>
>>
>>
>>
>> --
>> Amita Misra
>> Graduate Student Researcher
>> Natural Language and Dialogue Systems Lab
>> Baskin School of Engineering
>> University of California Santa Cruz
>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Supervised anomaly detection in time series

2016-08-05 Thread Qingkai Kong
I also worked on something similar, instead of using some algorithms deal
with unbalanced data, you can also try to create a balanced dataset either
using oversampling or downsampling. scikit-learn-contrib has already had a
project dealing with unbalanced data:
https://github.com/scikit-learn-contrib/imbalanced-learn.

Either you treat it as a classification problem or anomaly detection
problem (I prefer to treat it as a classification problem first) you all
need to find a better set of features in time domain or frequency domain.

On Fri, Aug 5, 2016 at 7:09 AM, Dale T Smith <dale.t.sm...@macys.com> wrote:

> To analyze unbalanced classifiers, use
>
>
>
> from sklearn.metrics import classification_report
>
>
>
>
>
> 
> __
> *Dale Smith* | Macy's Systems and Technology | IFS eCommerce | Data
> Science and Capacity Planning
> | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.sm...@macys.com
>
>
>
> *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=
> macys@python.org] *On Behalf Of *Pedro Pazzini
> *Sent:* Friday, August 5, 2016 9:33 AM
>
> *To:* Scikit-learn user and developer mailing list
> *Subject:* Re: [scikit-learn] Supervised anomaly detection in time series
>
>
>
> ⚠ EXT MSG:
>
> Just to add a few things to the discussion:
>
>1. For unbalanced problems, as far as I know, one of the best scores
>to evaluate a classifier is the Area Under the ROC curve:
>http://scikit-learn.org/stable/modules/generated/
>sklearn.metrics.roc_auc_score.html
>
> <http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html>.
>For that you will have to use clf.predict_proba(X_test) instead of
>clf.predict(X_test). I think that using the 'sample_weight' parameter as
>Smith said is a promising choice.
>2. Usually is recommend the normalization of each time series for
>comparing them. The Z-score normalization is one of the most used [Ref:
>http://wan.poly.edu/KDD2012/docs/p262.pdf
><http://wan.poly.edu/KDD2012/docs/p262.pdf>].
>3. There are some interesting dissimilarity measures such as DTW
>(Dynamic Time Warping), CID (Complex Invariant Distance), and others for
>comparing time series[Ref: https://www.icmc.usp.br/~
>gbatista/files/bracis2013_1.pdf
><https://www.icmc.usp.br/~gbatista/files/bracis2013_1.pdf>]. And there
>are also other approaches for comparing time series in the frequency domain
>such as FFT and DWT [Ref: http://infolab.usc.edu/csci599/Fall2003/Time%
>20Series/Efficient%20Similarity%20Search%20In%
>20Sequence%20Databases.pdf
>
> <http://infolab.usc.edu/csci599/Fall2003/Time%20Series/Efficient%20Similarity%20Search%20In%20Sequence%20Databases.pdf>
>].
>
> I hope it helps.
>
>
>
> 2016-08-05 9:26 GMT-03:00 Dale T Smith <dale.t.sm...@macys.com>:
>
> I don’t think you should treat this as an outlier detection problem. Why
> not try it as a classification problem? The dataset is highly unbalanced.
> Try
>
>
>
> http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.
> ExtraTreesClassifier.html
>
>
>
> Use sample_weight to tell the fit method about the class imbalance. But be
> sure to read up about unbalanced classification and the class_weight
> parameter to ExtraTreesClassifier. You cannot use the accuracy to find the
> best model, so read up on model validation in the sklearn User’s Guide. And
> when you do cross-validation to get the best hyperparameters, be sure you
> pass the sample weights as well.
>
>
>
> Time series data is a bit different to use with cross-validation. You may
> want to add features such as minutes since midnight, day of week,
> weekday/weekend. And make sure your cross-validation folds respect the time
> series nature of the problem.
>
>
>
> http://stackoverflow.com/questions/37583263/scikit-
> learn-cross-validation-custom-splits-for-time-series-data
>
>
>
>
>
> 
> __
> *Dale Smith* | Macy's Systems and Technology | IFS eCommerce | Data
> Science and Capacity Planning
> | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.sm...@macys.com
>
>
>
> *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=
> macys@python.org] *On Behalf Of *Nicolas Goix
> *Sent:* Thursday, August 4, 2016 9:13 PM
> *To:* Scikit-learn user and developer mailing list
> *Subject:* Re: [scikit-learn] Supervised anomaly detection in time series
>
>
>
> ⚠ EXT MSG:
>
> There are different ways of

Re: [scikit-learn] Supervised anomaly detection in time series

2016-08-05 Thread Dale T Smith
To analyze unbalanced classifiers, use

from sklearn.metrics import classification_report


__
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and 
Capacity Planning
| 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.sm...@macys.com

From: scikit-learn 
[mailto:scikit-learn-bounces+dale.t.smith=macys@python.org] On Behalf Of 
Pedro Pazzini
Sent: Friday, August 5, 2016 9:33 AM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] Supervised anomaly detection in time series

⚠ EXT MSG:
Just to add a few things to the discussion:

  1.  For unbalanced problems, as far as I know, one of the best scores to 
evaluate a classifier is the Area Under the ROC curve: 
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html.
 For that you will have to use clf.predict_proba(X_test) instead of 
clf.predict(X_test). I think that using the 'sample_weight' parameter as Smith 
said is a promising choice.
  2.  Usually is recommend the normalization of each time series for comparing 
them. The Z-score normalization is one of the most used [Ref: 
http://wan.poly.edu/KDD2012/docs/p262.pdf].
  3.  There are some interesting dissimilarity measures such as DTW (Dynamic 
Time Warping), CID (Complex Invariant Distance), and others for comparing time 
series[Ref: https://www.icmc.usp.br/~gbatista/files/bracis2013_1.pdf]. And 
there are also other approaches for comparing time series in the frequency 
domain such as FFT and DWT [Ref: 
http://infolab.usc.edu/csci599/Fall2003/Time%20Series/Efficient%20Similarity%20Search%20In%20Sequence%20Databases.pdf].

I hope it helps.

2016-08-05 9:26 GMT-03:00 Dale T Smith 
<dale.t.sm...@macys.com<mailto:dale.t.sm...@macys.com>>:
I don’t think you should treat this as an outlier detection problem. Why not 
try it as a classification problem? The dataset is highly unbalanced. Try

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

Use sample_weight to tell the fit method about the class imbalance. But be sure 
to read up about unbalanced classification and the class_weight parameter to 
ExtraTreesClassifier. You cannot use the accuracy to find the best model, so 
read up on model validation in the sklearn User’s Guide. And when you do 
cross-validation to get the best hyperparameters, be sure you pass the sample 
weights as well.

Time series data is a bit different to use with cross-validation. You may want 
to add features such as minutes since midnight, day of week, weekday/weekend. 
And make sure your cross-validation folds respect the time series nature of the 
problem.

http://stackoverflow.com/questions/37583263/scikit-learn-cross-validation-custom-splits-for-time-series-data


__
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and 
Capacity Planning
| 5985 State Bridge Road, Johns Creek, GA 30097 | 
dale.t.sm...@macys.com<mailto:dale.t.sm...@macys.com>

From: scikit-learn 
[mailto:scikit-learn-bounces+dale.t.smith<mailto:scikit-learn-bounces%2Bdale.t.smith>=macys@python.org<mailto:macys@python.org>]
 On Behalf Of Nicolas Goix
Sent: Thursday, August 4, 2016 9:13 PM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] Supervised anomaly detection in time series

⚠ EXT MSG:

There are different ways of aggregating estimators. A possibility can be to 
take the majority vote, or averaging decision functions.

On Aug 4, 2016 8:44 PM, "Amita Misra" 
<amis...@ucsc.edu<mailto:amis...@ucsc.edu>> wrote:
If I train multiple algorithms on different subsamples, then how do I get the 
final classifier that predicts unseen data?
I have very few positive samples since it is speed bump detection and we have 
very few speed bumps in a drive.
However, I think that  unseen new data would be quite similar to what I have in 
training data hence if I can correctly learn a classifier for these 5, I hope 
it should work well for unseen speed bumps.
Thanks,
Amita

On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix 
<goix.nico...@gmail.com<mailto:goix.nico...@gmail.com>> wrote:

You can evaluate the accuracy of your hyper-parameters on a few samples. Just 
don't use the accuracy as your performance measure.

For supervised classification, training multiple algorithms on small balanced 
subsamples usually works well, but 5 anomalies seems indeed to be very little.

Nicolas

On Aug 4, 2016 7:51 PM, "Amita Misra" 
<amis...@ucsc.edu<mailto:amis...@ucsc.edu>> wrote:
SubSample would remove a lot of information from the negative class.
I have more than 500 samples of negative class and just 5 samples of positive 
class.
Amita

On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix 
<goix.nico..

Re: [scikit-learn] Supervised anomaly detection in time series

2016-08-05 Thread Pedro Pazzini
Just to add a few things to the discussion:


   1. For unbalanced problems, as far as I know, one of the best scores to
   evaluate a classifier is the Area Under the ROC curve:
   
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html.
   For that you will have to use clf.predict_proba(X_test) instead of
   clf.predict(X_test). I think that using the 'sample_weight' parameter as
   Smith said is a promising choice.
   2. Usually is recommend the normalization of each time series for
   comparing them. The Z-score normalization is one of the most used [Ref:
   http://wan.poly.edu/KDD2012/docs/p262.pdf].
   3. There are some interesting dissimilarity measures such as DTW
   (Dynamic Time Warping), CID (Complex Invariant Distance), and others for
   comparing time series[Ref:
   https://www.icmc.usp.br/~gbatista/files/bracis2013_1.pdf]. And there are
   also other approaches for comparing time series in the frequency domain
   such as FFT and DWT [Ref:
   
http://infolab.usc.edu/csci599/Fall2003/Time%20Series/Efficient%20Similarity%20Search%20In%20Sequence%20Databases.pdf
   ].

I hope it helps.

2016-08-05 9:26 GMT-03:00 Dale T Smith <dale.t.sm...@macys.com>:

> I don’t think you should treat this as an outlier detection problem. Why
> not try it as a classification problem? The dataset is highly unbalanced.
> Try
>
>
>
> http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.
> ExtraTreesClassifier.html
>
>
>
> Use sample_weight to tell the fit method about the class imbalance. But be
> sure to read up about unbalanced classification and the class_weight
> parameter to ExtraTreesClassifier. You cannot use the accuracy to find the
> best model, so read up on model validation in the sklearn User’s Guide. And
> when you do cross-validation to get the best hyperparameters, be sure you
> pass the sample weights as well.
>
>
>
> Time series data is a bit different to use with cross-validation. You may
> want to add features such as minutes since midnight, day of week,
> weekday/weekend. And make sure your cross-validation folds respect the time
> series nature of the problem.
>
>
>
> http://stackoverflow.com/questions/37583263/scikit-
> learn-cross-validation-custom-splits-for-time-series-data
>
>
>
>
>
> 
> __
> *Dale Smith* | Macy's Systems and Technology | IFS eCommerce | Data
> Science and Capacity Planning
> | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.sm...@macys.com
>
>
>
> *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=
> macys@python.org] *On Behalf Of *Nicolas Goix
> *Sent:* Thursday, August 4, 2016 9:13 PM
> *To:* Scikit-learn user and developer mailing list
> *Subject:* Re: [scikit-learn] Supervised anomaly detection in time series
>
>
>
> ⚠ EXT MSG:
>
> There are different ways of aggregating estimators. A possibility can be
> to take the majority vote, or averaging decision functions.
>
>
>
> On Aug 4, 2016 8:44 PM, "Amita Misra" <amis...@ucsc.edu> wrote:
>
> If I train multiple algorithms on different subsamples, then how do I get
> the final classifier that predicts unseen data?
>
> I have very few positive samples since it is speed bump detection and we
> have very few speed bumps in a drive.
> However, I think that  unseen new data would be quite similar to what I
> have in training data hence if I can correctly learn a classifier for these
> 5, I hope it should work well for unseen speed bumps.
>
> Thanks,
> Amita
>
>
>
> On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix <goix.nico...@gmail.com>
> wrote:
>
> You can evaluate the accuracy of your hyper-parameters on a few samples.
> Just don't use the accuracy as your performance measure.
>
> For supervised classification, training multiple algorithms on small
> balanced subsamples usually works well, but 5 anomalies seems indeed to be
> very little.
>
> Nicolas
>
>
>
> On Aug 4, 2016 7:51 PM, "Amita Misra" <amis...@ucsc.edu> wrote:
>
> SubSample would remove a lot of information from the negative class.
>
> I have more than 500 samples of negative class and just 5 samples of
> positive class.
>
> Amita
>
>
>
> On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix <goix.nico...@gmail.com>
> wrote:
>
> Hi,
>
>
>
> Yes you can use your labeled data (you will need to sub-sample your normal
> class to have similar proportion normal-abnormal) to learn your
> hyper-parameters through CV.
>
>
>
> You can also try to use supervised classification algorithms on `not too
> highly unbalanced' sub-samples.
>
>
>
> 

Re: [scikit-learn] Supervised anomaly detection in time series

2016-08-05 Thread Dale T Smith
I don’t think you should treat this as an outlier detection problem. Why not 
try it as a classification problem? The dataset is highly unbalanced. Try

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

Use sample_weight to tell the fit method about the class imbalance. But be sure 
to read up about unbalanced classification and the class_weight parameter to 
ExtraTreesClassifier. You cannot use the accuracy to find the best model, so 
read up on model validation in the sklearn User’s Guide. And when you do 
cross-validation to get the best hyperparameters, be sure you pass the sample 
weights as well.

Time series data is a bit different to use with cross-validation. You may want 
to add features such as minutes since midnight, day of week, weekday/weekend. 
And make sure your cross-validation folds respect the time series nature of the 
problem.

http://stackoverflow.com/questions/37583263/scikit-learn-cross-validation-custom-splits-for-time-series-data


__
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and 
Capacity Planning
| 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.sm...@macys.com

From: scikit-learn 
[mailto:scikit-learn-bounces+dale.t.smith=macys@python.org] On Behalf Of 
Nicolas Goix
Sent: Thursday, August 4, 2016 9:13 PM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] Supervised anomaly detection in time series

⚠ EXT MSG:

There are different ways of aggregating estimators. A possibility can be to 
take the majority vote, or averaging decision functions.

On Aug 4, 2016 8:44 PM, "Amita Misra" 
<amis...@ucsc.edu<mailto:amis...@ucsc.edu>> wrote:
If I train multiple algorithms on different subsamples, then how do I get the 
final classifier that predicts unseen data?

I have very few positive samples since it is speed bump detection and we have 
very few speed bumps in a drive.
However, I think that  unseen new data would be quite similar to what I have in 
training data hence if I can correctly learn a classifier for these 5, I hope 
it should work well for unseen speed bumps.
Thanks,
Amita

On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix 
<goix.nico...@gmail.com<mailto:goix.nico...@gmail.com>> wrote:

You can evaluate the accuracy of your hyper-parameters on a few samples. Just 
don't use the accuracy as your performance measure.

For supervised classification, training multiple algorithms on small balanced 
subsamples usually works well, but 5 anomalies seems indeed to be very little.

Nicolas

On Aug 4, 2016 7:51 PM, "Amita Misra" 
<amis...@ucsc.edu<mailto:amis...@ucsc.edu>> wrote:
SubSample would remove a lot of information from the negative class.
I have more than 500 samples of negative class and just 5 samples of positive 
class.
Amita

On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix 
<goix.nico...@gmail.com<mailto:goix.nico...@gmail.com>> wrote:
Hi,

Yes you can use your labeled data (you will need to sub-sample your normal 
class to have similar proportion normal-abnormal) to learn your 
hyper-parameters through CV.

You can also try to use supervised classification algorithms on `not too highly 
unbalanced' sub-samples.

Nicolas

On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra 
<amis...@ucsc.edu<mailto:amis...@ucsc.edu>> wrote:
Hi,

I am currently exploring the problem of speed bump detection using 
accelerometer time series data.
I have extracted some features based on mean, std deviation etc  within a time 
window.
Since the dataset is highly skewed ( I have just 5  positive samples for every 
> 300 samples)
I was looking into

One ClassSVM
covariance.EllipticEnvelope
sklearn.ensemble.IsolationForest

but I am not sure how to use them.

What I get from docs
separate the positive examples and train using only negative examples

clf.fit(X_train)
and then
predict the positive examples using
clf.predict(X_test)

I am not sure what is then the role of positive examples in my training dataset 
or how can I use them to improve my classifier so that I can predict better on 
new samples.

Can we do something like Cross validation to learn the parameters as in normal 
binary SVM classification

Thanks,?
Amita

Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz





--
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz


___
scikit-learn mailing list
scikit-learn@python.org<mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


___
scikit-learn mailing list
scikit-learn@python.org<mailto:scikit-learn@

Re: [scikit-learn] Supervised anomaly detection in time series

2016-08-04 Thread Nicolas Goix
There are different ways of aggregating estimators. A possibility can be to
take the majority vote, or averaging decision functions.

On Aug 4, 2016 8:44 PM, "Amita Misra"  wrote:

> If I train multiple algorithms on different subsamples, then how do I get
> the final classifier that predicts unseen data?
>
>
> I have very few positive samples since it is speed bump detection and we
> have very few speed bumps in a drive.
> However, I think that  unseen new data would be quite similar to what I
> have in training data hence if I can correctly learn a classifier for these
> 5, I hope it should work well for unseen speed bumps.
>
> Thanks,
> Amita
>
> On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix 
> wrote:
>
>> You can evaluate the accuracy of your hyper-parameters on a few samples.
>> Just don't use the accuracy as your performance measure.
>>
>> For supervised classification, training multiple algorithms on small
>> balanced subsamples usually works well, but 5 anomalies seems indeed to be
>> very little.
>>
>> Nicolas
>>
>> On Aug 4, 2016 7:51 PM, "Amita Misra"  wrote:
>>
>>> SubSample would remove a lot of information from the negative class.
>>> I have more than 500 samples of negative class and just 5 samples of
>>> positive class.
>>>
>>> Amita
>>>
>>> On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix 
>>> wrote:
>>>
 Hi,

 Yes you can use your labeled data (you will need to sub-sample your
 normal class to have similar proportion normal-abnormal) to learn your
 hyper-parameters through CV.

 You can also try to use supervised classification algorithms on `not
 too highly unbalanced' sub-samples.

 Nicolas

 On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra  wrote:

> Hi,
>
> I am currently exploring the problem of speed bump detection using
> accelerometer time series data.
> I have extracted some features based on mean, std deviation etc
> within a time window.
>
> Since the dataset is highly skewed ( I have just 5  positive samples
> for every > 300 samples)
> I was looking into
>
> One ClassSVM
> covariance.EllipticEnvelope
> sklearn.ensemble.IsolationForest
>
> but I am not sure how to use them.
>
> What I get from docs
> separate the positive examples and train using only negative examples
>
> clf.fit(X_train)
>
> and then
> predict the positive examples using
> clf.predict(X_test)
>
>
> I am not sure what is then the role of positive examples in my
> training dataset or how can I use them to improve my classifier so that I
> can predict better on new samples.
>
>
> Can we do something like Cross validation to learn the parameters as
> in normal binary SVM classification
>
> Thanks,?
> Amita
>
> Amita Misra
> Graduate Student Researcher
> Natural Language and Dialogue Systems Lab
> Baskin School of Engineering
> University of California Santa Cruz
>
>
>
>
>
> --
> Amita Misra
> Graduate Student Researcher
> Natural Language and Dialogue Systems Lab
> Baskin School of Engineering
> University of California Santa Cruz
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>

 ___
 scikit-learn mailing list
 scikit-learn@python.org
 https://mail.python.org/mailman/listinfo/scikit-learn


>>>
>>>
>>> --
>>> Amita Misra
>>> Graduate Student Researcher
>>> Natural Language and Dialogue Systems Lab
>>> Baskin School of Engineering
>>> University of California Santa Cruz
>>>
>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
> Amita Misra
> Graduate Student Researcher
> Natural Language and Dialogue Systems Lab
> Baskin School of Engineering
> University of California Santa Cruz
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Supervised anomaly detection in time series

2016-08-04 Thread Amita Misra
If I train multiple algorithms on different subsamples, then how do I get
the final classifier that predicts unseen data?


I have very few positive samples since it is speed bump detection and we
have very few speed bumps in a drive.
However, I think that  unseen new data would be quite similar to what I
have in training data hence if I can correctly learn a classifier for these
5, I hope it should work well for unseen speed bumps.

Thanks,
Amita

On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix  wrote:

> You can evaluate the accuracy of your hyper-parameters on a few samples.
> Just don't use the accuracy as your performance measure.
>
> For supervised classification, training multiple algorithms on small
> balanced subsamples usually works well, but 5 anomalies seems indeed to be
> very little.
>
> Nicolas
>
> On Aug 4, 2016 7:51 PM, "Amita Misra"  wrote:
>
>> SubSample would remove a lot of information from the negative class.
>> I have more than 500 samples of negative class and just 5 samples of
>> positive class.
>>
>> Amita
>>
>> On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix 
>> wrote:
>>
>>> Hi,
>>>
>>> Yes you can use your labeled data (you will need to sub-sample your
>>> normal class to have similar proportion normal-abnormal) to learn your
>>> hyper-parameters through CV.
>>>
>>> You can also try to use supervised classification algorithms on `not too
>>> highly unbalanced' sub-samples.
>>>
>>> Nicolas
>>>
>>> On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra  wrote:
>>>
 Hi,

 I am currently exploring the problem of speed bump detection using
 accelerometer time series data.
 I have extracted some features based on mean, std deviation etc  within
 a time window.

 Since the dataset is highly skewed ( I have just 5  positive samples
 for every > 300 samples)
 I was looking into

 One ClassSVM
 covariance.EllipticEnvelope
 sklearn.ensemble.IsolationForest

 but I am not sure how to use them.

 What I get from docs
 separate the positive examples and train using only negative examples

 clf.fit(X_train)

 and then
 predict the positive examples using
 clf.predict(X_test)


 I am not sure what is then the role of positive examples in my training
 dataset or how can I use them to improve my classifier so that I can
 predict better on new samples.


 Can we do something like Cross validation to learn the parameters as in
 normal binary SVM classification

 Thanks,?
 Amita

 Amita Misra
 Graduate Student Researcher
 Natural Language and Dialogue Systems Lab
 Baskin School of Engineering
 University of California Santa Cruz





 --
 Amita Misra
 Graduate Student Researcher
 Natural Language and Dialogue Systems Lab
 Baskin School of Engineering
 University of California Santa Cruz


 ___
 scikit-learn mailing list
 scikit-learn@python.org
 https://mail.python.org/mailman/listinfo/scikit-learn


>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>>
>> --
>> Amita Misra
>> Graduate Student Researcher
>> Natural Language and Dialogue Systems Lab
>> Baskin School of Engineering
>> University of California Santa Cruz
>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Supervised anomaly detection in time series

2016-08-04 Thread Amita Misra
SubSample would remove a lot of information from the negative class.
I have more than 500 samples of negative class and just 5 samples of
positive class.

Amita

On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix  wrote:

> Hi,
>
> Yes you can use your labeled data (you will need to sub-sample your normal
> class to have similar proportion normal-abnormal) to learn your
> hyper-parameters through CV.
>
> You can also try to use supervised classification algorithms on `not too
> highly unbalanced' sub-samples.
>
> Nicolas
>
> On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra  wrote:
>
>> Hi,
>>
>> I am currently exploring the problem of speed bump detection using
>> accelerometer time series data.
>> I have extracted some features based on mean, std deviation etc  within a
>> time window.
>>
>> Since the dataset is highly skewed ( I have just 5  positive samples for
>> every > 300 samples)
>> I was looking into
>>
>> One ClassSVM
>> covariance.EllipticEnvelope
>> sklearn.ensemble.IsolationForest
>>
>> but I am not sure how to use them.
>>
>> What I get from docs
>> separate the positive examples and train using only negative examples
>>
>> clf.fit(X_train)
>>
>> and then
>> predict the positive examples using
>> clf.predict(X_test)
>>
>>
>> I am not sure what is then the role of positive examples in my training
>> dataset or how can I use them to improve my classifier so that I can
>> predict better on new samples.
>>
>>
>> Can we do something like Cross validation to learn the parameters as in
>> normal binary SVM classification
>>
>> Thanks,?
>> Amita
>>
>> Amita Misra
>> Graduate Student Researcher
>> Natural Language and Dialogue Systems Lab
>> Baskin School of Engineering
>> University of California Santa Cruz
>>
>>
>>
>>
>>
>> --
>> Amita Misra
>> Graduate Student Researcher
>> Natural Language and Dialogue Systems Lab
>> Baskin School of Engineering
>> University of California Santa Cruz
>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Supervised anomaly detection in time series

2016-08-04 Thread Nicolas Goix
Hi,

Yes you can use your labeled data (you will need to sub-sample your normal
class to have similar proportion normal-abnormal) to learn your
hyper-parameters through CV.

You can also try to use supervised classification algorithms on `not too
highly unbalanced' sub-samples.

Nicolas

On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra  wrote:

> Hi,
>
> I am currently exploring the problem of speed bump detection using
> accelerometer time series data.
> I have extracted some features based on mean, std deviation etc  within a
> time window.
>
> Since the dataset is highly skewed ( I have just 5  positive samples for
> every > 300 samples)
> I was looking into
>
> One ClassSVM
> covariance.EllipticEnvelope
> sklearn.ensemble.IsolationForest
>
> but I am not sure how to use them.
>
> What I get from docs
> separate the positive examples and train using only negative examples
>
> clf.fit(X_train)
>
> and then
> predict the positive examples using
> clf.predict(X_test)
>
>
> I am not sure what is then the role of positive examples in my training
> dataset or how can I use them to improve my classifier so that I can
> predict better on new samples.
>
>
> Can we do something like Cross validation to learn the parameters as in
> normal binary SVM classification
>
> Thanks,?
> Amita
>
> Amita Misra
> Graduate Student Researcher
> Natural Language and Dialogue Systems Lab
> Baskin School of Engineering
> University of California Santa Cruz
>
>
>
>
>
> --
> Amita Misra
> Graduate Student Researcher
> Natural Language and Dialogue Systems Lab
> Baskin School of Engineering
> University of California Santa Cruz
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn