Hi,
About your question on how to learn the parameters of anomaly detection
algorithms using only the negative samples in your case, Nicolas and I
worked on this aspect recently. If you are interested you can have look at:
- Learning hyperparameters for unsupervised anomaly detection:
> But this might be the kind of problem where you seriously ask how hard it
> would be to gather more data.
Yeah, I agree, but this scenario is then typical in a sense of that it is an
anomaly detection problem rather than a classification problem. I.e., you don’t
have enough positive
Thanks everyone for the suggestions.
Actually we thought of gathering more data but the point is we do not have
many speed bumps in our driving area. If we drive over the same speed bump
again and again it may not add anything really novel to the data.
I think a combination of oversampling and
I also worked on something similar, instead of using some algorithms deal
with unbalanced data, you can also try to create a balanced dataset either
using oversampling or downsampling. scikit-learn-contrib has already had a
project dealing with unbalanced data:
To analyze unbalanced classifiers, use
from sklearn.metrics import classification_report
__
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and
Capacity Planning
| 5985 State
Just to add a few things to the discussion:
1. For unbalanced problems, as far as I know, one of the best scores to
evaluate a classifier is the Area Under the ROC curve:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html.
For that you will have to
I don’t think you should treat this as an outlier detection problem. Why not
try it as a classification problem? The dataset is highly unbalanced. Try
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
Use sample_weight to tell the fit method about the