[Scikit-learn-general] what is the right strategy to set up decisionTree for this fraud detection problem?

2015-08-14 Thread Rex X
The data sets are online transactions. For each one, we label it as "fraud" or "good". This is a binary classification problem. With decisionTree, we can identify those combined conditions that are likely to trigger a "fraud". I am willing to hear advice. The features include: transaction amount,

Re: [Scikit-learn-general] what is the right strategy to set up decisionTree for this fraud detection problem?

2015-08-14 Thread Rex X
.r-project.org/web/packages/party/party.pdf > > here for each split a statistical test is performed and this should make > the model more robust and easier to interpret. > > I do not know if there is something similar here on sklearn. > > Best, > Luca > > On Fri, Aug 14, 20

[Scikit-learn-general] How to extract the decision tree rule of each leaf node into Pandas Dataframe query?

2015-08-18 Thread Rex X
Hi everyone, Is it possible to extract the decision tree rule associated with each leaf node into a Pandas Dataframe query ? So that one can view the corresponding Dataframe content by feeding in the decision rule.

[Scikit-learn-general] How to encode categorical feature values containing special characters?

2015-08-20 Thread Rex X
Hi fellows, I found some numbers generated are not right, and it took me half a day of debugging. Finally, I found it was due to the loaded CSV file contains special characters for a few categorical features. These categorical values are all in UNICODE in different languages, Hindi, Chinese, Engli

Re: [Scikit-learn-general] How to extract the decision tree rule of each leaf node into Pandas Dataframe query?

2015-08-20 Thread Rex X
> Why are you interested in a single leave? the query seems to be able to >> return "only" a single boolean. >> It is probably more efficient to traverse the full tree for each data >> point if you are interested in all the leafs. >> >> >> On 08/18/2

Re: [Scikit-learn-general] How to extract the decision tree rule of each leaf node into Pandas Dataframe query?

2015-08-27 Thread Rex X
ty points to be higher than those of bad > activity points. The classifier will automatically try to fit the good > activity better. > > If I misinterpreted what you asked, let me know. I wasn't exactly sure > what your first sentence meant. > > On Thu, Aug 20, 2015 at 10:53

Re: [Scikit-learn-general] How to extract the decision tree rule of each leaf node into Pandas Dataframe query?

2015-08-27 Thread Rex X
> > print np.arange(n)[ (raw[:,1] >= 0.05) && (norm[:,1] >= 0.30) ] > ``` > > I hope that makes sense! > Let me know if you have any other questions > > On Thu, Aug 27, 2015 at 11:12 AM, Rex X wrote: > >> Hi Jacob, >> >> Let's consider

Re: [Scikit-learn-general] How to extract the decision tree rule of each leaf node into Pandas Dataframe query?

2015-08-27 Thread Rex X
a, iris.target) > > # pandas queries must have valid python identifiers for column names > feature_names = [name.replace(' ','_').replace('(','').replace(')','') for > name in iris.feature_names] > leaf_queries = get_queri

[Scikit-learn-general] How to do tree Pruning with scikit-learn?

2015-08-30 Thread Rex X
Tree pruning process is very important to get a better decision tree. One idea is to recursively remove the leaf node which cause least hurt to the decision tree. Any idea how to do this for the following sample case? import pandas as pd > from sklearn.datasets import load_iris > from sklearn i

[Scikit-learn-general] Is there any attribute saying the number of samples of each class in one decision tree node?

2015-08-30 Thread Rex X
DecisionTreeClassifier.tree_.n_node_samples is the total number of samples in all classes of one node, and DecisionTreeClassifier.tree_.value is the computed weight for each class of one node. Only if the sample_weight and class_weight of this DecisionTreeClassifier is one, then this attribute equa

Re: [Scikit-learn-general] How to do tree Pruning with scikit-learn?

2015-08-30 Thread Rex X
: > Tree pruning is currently not supported in sklearn. > > On Sun, Aug 30, 2015 at 6:44 AM, Rex X wrote: > >> Tree pruning process is very important to get a better decision tree. >> >> One idea is to recursively remove the leaf node which cause least hurt to >> the

Re: [Scikit-learn-general] Is there any attribute saying the number of samples of each class in one decision tree node?

2015-08-30 Thread Rex X
: > This value is computed while building the tree, but is not kept in the > tree. > > On Sun, Aug 30, 2015 at 7:02 AM, Rex X wrote: > >> DecisionTreeClassifier.tree_.n_node_samples is the total number of >> samples in all classes of one node, and >> Decision

Re: [Scikit-learn-general] How to do tree Pruning with scikit-learn?

2015-08-30 Thread Rex X
On Sun, Aug 30, 2015 at 10:37 AM, Rex X wrote: > >> Hi Jacob, >> >> Is there anything we can do to get better generalized decision rules? >> >> For example, after one tree fitting, select top (N-1) features by >> feature_importance, and then do the fitting a

Re: [Scikit-learn-general] Is there any attribute saying the number of samples of each class in one decision tree node?

2015-08-30 Thread Rex X
is for your own personal >> use, it may be easier to write a small script which successively applies >> the rules of the tree to your data to see how many points from each class >> are present. >> >> On Sun, Aug 30, 2015 at 10:50 AM, Rex X wrote: >> >>>

Re: [Scikit-learn-general] Is there any attribute saying the number of samples of each class in one decision tree node?

2015-08-30 Thread Rex X
track of the nodes they > > traverse. You should have a look at the implementation of `apply` to > > get started. > > > > Hope this helps, > > Gilles > > > > On 30 August 2015 at 21:55, Rex X wrote: > >> Hi Trevor, > >> > >> Ye

Re: [Scikit-learn-general] Is there any attribute saying the number of samples of each class in one decision tree node?

2015-08-30 Thread Rex X
o write a small script which successively applies > the rules of the tree to your data to see how many points from each class > are present. > > On Sun, Aug 30, 2015 at 10:50 AM, Rex X wrote: > >> Hi Jacob and Trevor, >> >> Which part of the sou

[Scikit-learn-general] How to prune combination of logical operators in python?

2015-09-02 Thread Rex X
This question was raised, when I tried to merge decision tree rules. For example, 1. Given one combination of logical operators: 1a. 'A>1 and B=2 and A>3' should be pruned to 'A>3 and B=2'; 1b. 'A="abc"<0.5 and B=2 and A="xyz">=0.5' should be pruned to 'B=2 and A="xyz"', where A="

Re: [Scikit-learn-general] How to do tree Pruning with scikit-learn?

2015-09-02 Thread Rex X
bles, too. > > > > On 08/30/2015 03:44 PM, Rex X wrote: > > Jacob, I agree with both of your points about the ensemble methods. They > can give quite good prediction result. > > But the question is to interpret these models. We want to extract specific > decision rules, as fraud

Re: [Scikit-learn-general] Is there any attribute saying the number of samples of each class in one decision tree node?

2015-09-02 Thread Rex X
each > > node, you can iterate over your data X and increment counters, e.g., > > by doing counters[path(clf.tree_, X[i])] += 1, where counters is a > > numpy array of size tree_.node_count. > > > > Hope this helps, > > Gilles > > > > On 30 August 2015 at 2

Re: [Scikit-learn-general] Is there any attribute saying the number of samples of each class in one decision tree node?

2015-09-02 Thread Rex X
100]) > > > > # [0, 2, 12] > > > > Now to derive statistics like the number of samples reaching each > > node, you can iterate over your data X and increment counters, e.g., > > by doing counters[path(clf.tree_, X[i])] += 1, where counters is a > > numpy array

[Scikit-learn-general] DecisionTree: How to split categorical features into two subsets instead of a single value and the rest?

2015-09-11 Thread Rex X
Given categorical attributes, for instance city = ['a', 'b', 'c', 'd', 'e', 'f'] With DictVectorizer(), we can transform "city" into a sparse matrix, using 1-of-k representation. But for each split, the decisionTree evaluate only one single attribute, say city == 'a' - True or False? What I want

Re: [Scikit-learn-general] DecisionTree: How to split categorical features into two subsets instead of a single value and the rest?

2015-09-12 Thread Rex X
pe > wrote: > >> Hi Rex, >> >> This is currently not supported in scikit-learn. >> >> Gilles >> >> On 12 September 2015 at 05:02, Rex X wrote: >> > Given categorical attributes, for instance >> > city = ['a', 'b',

[Scikit-learn-general] What is the best way to migrate existing scikit-learn code to PySpark cluster to do scalable machine learning?

2015-09-12 Thread Rex X
What is the best way to migrate existing scikit-learn code to PySpark cluster? Then we can bring together the full power of both scikit-learn and spark, to do scalable machine learning. Currently I use multiprocessing module of Python to boost the speed. But this only works for one node, while the

Re: [Scikit-learn-general] What is the best way to migrate existing scikit-learn code to PySpark cluster to do scalable machine learning?

2015-09-12 Thread Rex X
use case. >> >> Maybe you should have a look at MLlib >> (https://spark.apache.org/mllib/), which implements a bunch of machine >> learning algorithms (including forests) on top of Spark. >> >> Best, >> Gilles >> >> On 12 September 2015 at 20:11, Rex