[scikit-learn] Unable to connect HDInsight hive to python

2018-08-12 Thread Debabrata Ghosh
Hi All,
   Greetings ! Wish you are doing good ! I am just
reaching out to you in case if you have any answer or help me direct to the
right forum please:

We are facing with hive connectivity from the python on Azure HDinsights,
We have installed required SASL,thrift_sasl(0.2.1) and Thirft (0.9.3)
packages on Ubuntu , but some how when we are trying to connect Hive using
following packages we are getting errors , It would be really great help if
you could provide some pointers based on your experience

Example 1: from impala.dbapi import connect conn=connect(host="localhost",
port=10001 , auth_mechanism="PLAIN", user="admin", password="PWD") (tried
both 127.0.0.1:1/10001)

Example 2:

import pyhs2 conn = pyhs2.connect(host='localhost ',
port=1,authMechanism="PLAIN", user='admin',
password=,database='default')

Example 3:

from pyhive import hive conn = hive.Connection(host="localhost",
port=10001, username="admin", password=None, auth='NONE')

Across all of the above examples we are getting the error message:
thrift.transport.TTransport.TTransportException: Tsocket read 0 bytes
Thanks,
Debu
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] class_weight: How to assign a higher weightage to values in a specific column as opposed to values in another column

2017-01-23 Thread Debabrata Ghosh
What would be the sample command for achieving it ? Sorry a bit new in this
area and that's why I will be better able to understand it through certain
example commands .

Thanks again !

On Tue, Jan 24, 2017 at 6:58 AM, Josh Vredevoogd <cleverl...@gmail.com>
wrote:

> If you do not want the weights to be uniform by class, then you need to
> generate weights for each sample and pass the sample weight vector to the
> fit method of the classifier.
>
> On Mon, Jan 23, 2017 at 4:48 PM, Debabrata Ghosh <mailford...@gmail.com>
> wrote:
>
>> Thanks Josh for your quick feedback ! It's quite helpful indeed .
>>
>> Further to it , I am having another burning question. In my sample
>> dataset , I have 2 label columns (let's say x and y)
>>
>> My objective is to give the labels within column 'x' 10 times more weight
>> as compared to labels within column y.
>>
>> My question is the parameter class_weight={0: 1, 1: 10} works for a
>> single column, i.e., within a single column I have assigned 10 times weight
>> to the positive labels.
>>
>> But my objective is to provide a 10 times weight to the positive labels
>> within column 'x' as compared to the positive labels within column 'y'.
>>
>> May I please get a feedback from you around how to achieve this please.
>> Thanks for your help in advance !
>>
>> On Mon, Jan 23, 2017 at 9:56 AM, Josh Vredevoogd <cleverl...@gmail.com>
>> wrote:
>>
>>> If you undersample, taking only 10% of the negative class, the
>>> classifier will see different combinations of attributes and produce a
>>> different fit to explain those distributions. In the worse case, imagine
>>> you are classifying birds and through sampling you eliminate all `red`
>>> examples. Your classifier likely now will not understand that red objects
>>> can be birds. That's an overly simple example, but given a classifier
>>> capable of exploring and explaining feature combinations, less obvious
>>> versions of this are bound to happen.
>>>
>>> The extrapolation only works in the other direction: if you manually
>>> duplicate samples by the sampling factor, you should get the exact same fit
>>> as if you increased the class weight.
>>>
>>> Hope that helps,
>>> Josh
>>>
>>>
>>> On Sun, Jan 22, 2017 at 5:00 AM, Debabrata Ghosh <mailford...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Josh !
>>>>
>>>> I have used the parameter class_weight={0: 1, 1: 10} and the model code
>>>> has run successfully. However, just to get a further clarity around it's
>>>> concept, I am having another question for you please. I did the following 2
>>>> tests:
>>>>
>>>> 1. In my dataset , I have 1 million negative classes and 10,000
>>>> positive classes. First I ran my model code without supplying any
>>>> class_weight parameter and it gave me certain True Positive and False
>>>> Positive results.
>>>>
>>>> 2. Now in the second test, I had the same 1 million negative classes
>>>> but reduced the positive classes to 1000 . But this time, I supplied the
>>>> parameter class_weight={0: 1, 1: 10} and got my True Positive and False
>>>> Positive Results
>>>>
>>>> My question is , when I multiply the results obtained from my second
>>>> test with a factor of 10, I don't match with the results obtained from my
>>>> first test. In other words, say I get the true positive against a threshold
>>>> from the second test as 8 , while the true positive from the first test
>>>> against the same threshold is 260. I am getting similar observations for
>>>> the false positive results wherein if I multiply the results obtained in
>>>> the second test by 10, I don't come close to the results obtained from the
>>>> first set.
>>>>
>>>> Is my expectation correct ? Is my way of executing the test (i.e.,
>>>> reducing the the positive classes by 10 times and then feeding a class
>>>> weight of 10 times the negative classes) and comparing the results with a
>>>> model run without any class weight parameter correct ?
>>>>
>>>> Please let me know as per your convenience as this will help me a big
>>>> way to understand the concept further.
>>>>
>>>> Thanks in advance !
>>>>
>>>> On Sun, Jan 22, 2017 at 1:56 AM, Josh Vredevoogd <cleverl...@gmail.com>
>>>> wrote:
>>&

Re: [scikit-learn] Query regarding parameter class_weight in Random Forest Classifier

2017-01-22 Thread Debabrata Ghosh
Thanks Josh !

I have used the parameter class_weight={0: 1, 1: 10} and the model code has
run successfully. However, just to get a further clarity around it's
concept, I am having another question for you please. I did the following 2
tests:

1. In my dataset , I have 1 million negative classes and 10,000 positive
classes. First I ran my model code without supplying any class_weight
parameter and it gave me certain True Positive and False Positive results.

2. Now in the second test, I had the same 1 million negative classes but
reduced the positive classes to 1000 . But this time, I supplied the
parameter class_weight={0: 1, 1: 10} and got my True Positive and False
Positive Results

My question is , when I multiply the results obtained from my second test
with a factor of 10, I don't match with the results obtained from my first
test. In other words, say I get the true positive against a threshold from
the second test as 8 , while the true positive from the first test against
the same threshold is 260. I am getting similar observations for the false
positive results wherein if I multiply the results obtained in the second
test by 10, I don't come close to the results obtained from the first set.

Is my expectation correct ? Is my way of executing the test (i.e., reducing
the the positive classes by 10 times and then feeding a class weight of 10
times the negative classes) and comparing the results with a model run
without any class weight parameter correct ?

Please let me know as per your convenience as this will help me a big way
to understand the concept further.

Thanks in advance !

On Sun, Jan 22, 2017 at 1:56 AM, Josh Vredevoogd <cleverl...@gmail.com>
wrote:

> The class_weight parameter doesn't behave the way you're expecting.
>
> The value in class_weight is the weight applied to each sample in that
> class - in your example, each class zero sample has weight 0.001 and each
> class one sample has weight 0.999, so each class one samples carries 999
> times the weight of a class zero sample.
>
> If you would like each class one sample to have ten times the weight, you
> would set `class_weight={0: 1, 1: 10}` or `class_weight={0:0.1, 1:1}`
> equivalently.
>
>
> On Sat, Jan 21, 2017 at 10:18 AM, Debabrata Ghosh <mailford...@gmail.com>
> wrote:
>
>> Hi All,
>>  Greetings !
>>
>>   I have a very basic question regarding the usage of the
>> parameter class_weight in scikit learn's Random Forest Classifier's fit
>> method.
>>
>>   I have a fairly unbalanced sample and my positive class :
>> negative class ratio is 1:100. In other words, I have a million records
>> corresponding to negative class and 10,000 records corresponding to
>> positive class. I have trained the random forest classifier model using the
>> above record set successfully.
>>
>>   Further, for a different problem, I want to test the
>> parameter class_weight. So, I am setting the class_weight as [0:0.001 ,
>> 1:0.999] and I have tried running my model on the same dataset as mentioned
>> in the above paragraph but with the positive class records reduced to 1000
>> [because now each positive class is given approximately 10 times more
>> weight than a negative class]. However, the model run results are very very
>> different between the 2 runs (with and without class_weight). And I
>> expected a similar run results.
>>
>> Would you please be able to let me know where am I
>> getting wrong. I know it's something silly but just want to improve on my
>> concept.
>>
>> Thanks !
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library

2016-12-28 Thread Debabrata Ghosh
Thanks Naoya ! This has worked and I am able to generate the .dot files.

Cheers,

Debu

On Thu, Dec 29, 2016 at 10:20 AM, Naoya Kanai <nao...@gmail.com> wrote:

> The ‘tree’ name is clashing between the sklearn.tree module and the
> DecisionTreeClassifier objects in the loop.
>
> You can change the import to
>
> from sklearn.tree import export_graphviz
>
> and modify the method call accordingly.
> ​
>
> On Wed, Dec 28, 2016 at 8:38 PM, Debabrata Ghosh <mailford...@gmail.com>
> wrote:
>
>> Hi Guillaume,
>>   Thanks for your feedback ! I am
>> still getting an error, while attempting to print the trees. Here is a
>> snapshot of my code. I know I may be missing something very silly, but
>> still wanted to check and see how this works.
>>
>> >>> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1)
>> >>> clf.fit(p_features_train,p_labels_train)
>> RandomForestClassifier(bootstrap=True, class_weight=None,
>> criterion='gini',
>> max_depth=None, max_features='auto', max_leaf_nodes=None,
>> min_samples_leaf=1, min_samples_split=2,
>> min_weight_fraction_leaf=0.0, n_estimators=5000, n_jobs=-1,
>> oob_score=False, random_state=None, verbose=0,
>> warm_start=False)
>> >>> for idx_tree, tree in enumerate(clf.estimators_):
>> ... export_graphviz(tree, out_file='{}.dot'.format(idx_tree))
>> ...
>> Traceback (most recent call last):
>>   File "", line 2, in 
>> NameError: name 'export_graphviz' is not defined
>> >>> for idx_tree, tree in enumerate(clf.estimators_):
>> ... tree.export_graphviz(tree, out_file='{}.dot'.format(idx_tree))
>> ...
>> Traceback (most recent call last):
>>   File "", line 2, in 
>> AttributeError: 'DecisionTreeClassifier' object has no attribute
>> 'export_graphviz'
>>
>> Just to give you  a background about the libraries, I have imported the
>> following libraries:
>>
>> from sklearn.ensemble import RandomForestClassifier
>> from sklearn import tree
>>
>> Thanks again as always !
>>
>> Cheers,
>>
>> On Thu, Dec 29, 2016 at 1:04 AM, Guillaume Lemaître <
>> g.lemaitr...@gmail.com> wrote:
>>
>>> after the fit you need this call:
>>> for idx_tree, tree in enumerate(clf.estimators_):
>>>  export_graphviz(tree, out_file='{}.dot'.format(idx_tree))
>>>
>>>
>>>
>>> On 28 December 2016 at 20:25, Debabrata Ghosh <mailford...@gmail.com>
>>> wrote:
>>>
>>>> Hi Guillaume,
>>>>   With respect to the following point you
>>>> mentioned:
>>>> You can visualize the trees with sklearn.tree.export_graphviz:
>>>> http://scikit-learn.org/stable/modules/generated/sklearn.tre
>>>> e.export_graphviz.html
>>>>
>>>> I couldn't find a direct method for exporting the
>>>> RandomForestClassifier trees. Accordingly, I attempted for a workaround
>>>> using the following code but still no success:
>>>>
>>>> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1)
>>>> clf.fit(p_features_train,p_labels_train)
>>>> for i, tree in enumerate(clf.estimators_):
>>>> with open('tree_' + str(i) + '.dot', 'w') as dotfile:
>>>>  tree.export_graphviz(clf, dotfile)
>>>>
>>>> Would you please be able to help me with the piece of code which I need
>>>> to execute for exporting the RandomForestClassifier trees.
>>>>
>>>> Cheers,
>>>>
>>>> Debu
>>>>
>>>>
>>>> On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lemaître <
>>>> g.lemaitr...@gmail.com> wrote:
>>>>
>>>>> On 27 December 2016 at 18:17, Debabrata Ghosh <mailford...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Dear Joel, Andrew and Roman,
>>>>>> Thank you very
>>>>>> much for your individual feedback ! It's very helpful indeed ! A few more
>>>>>> points related to my model execution:
>>>>>>
>>>>>> 1. By the term "scoring" I meant the process of executing the model
>>>>>> once again without retraining it. So , for training the model I used
>>>>>> RandomForestClassifer library and for my scoring (execution withou

Re: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library

2016-12-28 Thread Debabrata Ghosh
Hi Guillaume,
  Thanks for your feedback ! I am still
getting an error, while attempting to print the trees. Here is a snapshot
of my code. I know I may be missing something very silly, but still wanted
to check and see how this works.

>>> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1)
>>> clf.fit(p_features_train,p_labels_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=5000, n_jobs=-1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
>>> for idx_tree, tree in enumerate(clf.estimators_):
... export_graphviz(tree, out_file='{}.dot'.format(idx_tree))
...
Traceback (most recent call last):
  File "", line 2, in 
NameError: name 'export_graphviz' is not defined
>>> for idx_tree, tree in enumerate(clf.estimators_):
... tree.export_graphviz(tree, out_file='{}.dot'.format(idx_tree))
...
Traceback (most recent call last):
  File "", line 2, in 
AttributeError: 'DecisionTreeClassifier' object has no attribute
'export_graphviz'

Just to give you  a background about the libraries, I have imported the
following libraries:

from sklearn.ensemble import RandomForestClassifier
from sklearn import tree

Thanks again as always !

Cheers,

On Thu, Dec 29, 2016 at 1:04 AM, Guillaume Lemaître <g.lemaitr...@gmail.com>
wrote:

> after the fit you need this call:
> for idx_tree, tree in enumerate(clf.estimators_):
>  export_graphviz(tree, out_file='{}.dot'.format(idx_tree))
>
>
>
> On 28 December 2016 at 20:25, Debabrata Ghosh <mailford...@gmail.com>
> wrote:
>
>> Hi Guillaume,
>>   With respect to the following point you
>> mentioned:
>> You can visualize the trees with sklearn.tree.export_graphviz:
>> http://scikit-learn.org/stable/modules/generated/sklearn.tre
>> e.export_graphviz.html
>>
>> I couldn't find a direct method for exporting the RandomForestClassifier
>> trees. Accordingly, I attempted for a workaround using the following code
>> but still no success:
>>
>> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1)
>> clf.fit(p_features_train,p_labels_train)
>> for i, tree in enumerate(clf.estimators_):
>> with open('tree_' + str(i) + '.dot', 'w') as dotfile:
>>  tree.export_graphviz(clf, dotfile)
>>
>> Would you please be able to help me with the piece of code which I need
>> to execute for exporting the RandomForestClassifier trees.
>>
>> Cheers,
>>
>> Debu
>>
>>
>> On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lemaître <
>> g.lemaitr...@gmail.com> wrote:
>>
>>> On 27 December 2016 at 18:17, Debabrata Ghosh <mailford...@gmail.com>
>>> wrote:
>>>
>>>> Dear Joel, Andrew and Roman,
>>>> Thank you very
>>>> much for your individual feedback ! It's very helpful indeed ! A few more
>>>> points related to my model execution:
>>>>
>>>> 1. By the term "scoring" I meant the process of executing the model
>>>> once again without retraining it. So , for training the model I used
>>>> RandomForestClassifer library and for my scoring (execution without
>>>> retraining) I have used joblib.dump and joblib.load
>>>>
>>>
>>> Go probably with the terms: training, validating, and testing.
>>> This is pretty much standard. Scoring is just the value of a
>>> metric given some data (training data, validation data, or
>>> testing data).
>>>
>>>
>>>>
>>>> 2. I have used the parameter n_estimator = 5000 while training my
>>>> model. Besides it , I have used n_jobs = -1 and haven't used any other
>>>> parameter
>>>>
>>>
>>> You should probably check those other parameters and understand
>>>  what are their effects. You should really check the link of Roman
>>> since GridSearchCV can help you to decide how to fix the parameters.
>>> http://scikit-learn.org/stable/modules/generated/sklearn.mod
>>> el_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
>>> Additionally, 5000 trees seems a lot to me.
>>>
>>>
>>>>
>>>> 3. For my "scoring" activity (executing the model without retraining
>>>> it) is there an alternate approach to joblib library ?
>>>>
>&g

Re: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library

2016-12-28 Thread Debabrata Ghosh
Hi Guillaume,
  With respect to the following point you mentioned:
You can visualize the trees with sklearn.tree.export_graphviz:
http://scikit-learn.org/stable/modules/generated/sklearn.tre
e.export_graphviz.html

I couldn't find a direct method for exporting the RandomForestClassifier
trees. Accordingly, I attempted for a workaround using the following code
but still no success:

clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1)
clf.fit(p_features_train,p_labels_train)
for i, tree in enumerate(clf.estimators_):
with open('tree_' + str(i) + '.dot', 'w') as dotfile:
 tree.export_graphviz(clf, dotfile)

Would you please be able to help me with the piece of code which I need to
execute for exporting the RandomForestClassifier trees.

Cheers,

Debu


On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lemaître <g.lemaitr...@gmail.com
> wrote:

> On 27 December 2016 at 18:17, Debabrata Ghosh <mailford...@gmail.com>
> wrote:
>
>> Dear Joel, Andrew and Roman,
>> Thank you very much
>> for your individual feedback ! It's very helpful indeed ! A few more points
>> related to my model execution:
>>
>> 1. By the term "scoring" I meant the process of executing the model once
>> again without retraining it. So , for training the model I used
>> RandomForestClassifer library and for my scoring (execution without
>> retraining) I have used joblib.dump and joblib.load
>>
>
> Go probably with the terms: training, validating, and testing.
> This is pretty much standard. Scoring is just the value of a
> metric given some data (training data, validation data, or
> testing data).
>
>
>>
>> 2. I have used the parameter n_estimator = 5000 while training my model.
>> Besides it , I have used n_jobs = -1 and haven't used any other parameter
>>
>
> You should probably check those other parameters and understand
>  what are their effects. You should really check the link of Roman
> since GridSearchCV can help you to decide how to fix the parameters.
> http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.
> GridSearchCV.html#sklearn.model_selection.GridSearchCV
> Additionally, 5000 trees seems a lot to me.
>
>
>>
>> 3. For my "scoring" activity (executing the model without retraining it)
>> is there an alternate approach to joblib library ?
>>
>
> Joblib only store data. There is not link with scoring (Check Roman answer)
>
>
>>
>> 4. When I execute my scoring job (joblib method) on a dataset , which is
>> completely different to my training dataset then I get similar True
>> Positive Rate and False Positive Rate as of training
>>
>
> It is what you should get.
>
>
>>
>> 5. However, when I execute my scoring job on the same dataset used for
>> training my model then I get very high TPR and FPR.
>>
>
> You are testing on some data which you used while training. Probably,
> one of the first rule is to not do that. If you want to evaluate in some
> way your classifier, have a separate set (test set) and only test on that
> one. As previously mentioned by Roman, 80% of your data are already
> known by the RandomForestClassifier and will be perfectly classified.
>
>
>>
>>   Is there mechanism
>> through which I can visualise the trees created by my RandomForestClassifer
>> algorithm ? While I dumped the model using joblib.dump , there are a bunch
>> of .npy files created. Will those contain the trees ?
>>
>
> You can visualize the trees with sklearn.tree.export_graphviz:
> http://scikit-learn.org/stable/modules/generated/
> sklearn.tree.export_graphviz.html
>
> The bunch of npy are the data needed to load the RandomForestClassifier
> which
> you previously dumped.
>
>
>>
>> Thanks in advance !
>>
>> Cheers,
>>
>> Debu
>>
>> On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman <joel.noth...@gmail.com>
>> wrote:
>>
>>> Your model is overfit to the training data. Not to say that it's
>>> necessarily possible to get a better fit. The default settings for trees
>>> lean towards a tight fit, so you might modify their parameters to increase
>>> regularisation. Still, you should not expect that evaluating a model's
>>> performance on its training data will be indicative of its general
>>> performance. This is why we use held-out test sets and cross-validation.
>>>
>>> On 27 December 2016 at 20:51, Roman Yurchak <rth.yurc...@gmail.com>
>>> wrote:
>>>
>>>> Hi Debu,
>

Re: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library

2016-12-27 Thread Debabrata Ghosh
Thanks Guillaume for your quick feedback ! Appreciate it a lot.

I will definitely try out the links you have given. Another quick one
please. My objective is to execute the model without retraining it. Let me
get you an example here to elaborate this - I train my model on a huge set
of data (historic 6 months worth of data) and finalise my model. Now going
forward I need to run my model against smaller set of data (daily data) and
for that I wouldn't need to retrain my model daily.

Given the above scenario, I wanted to confirm once more whether after
training the model if I use joblib.dump and then while executing the model
on daily basis, if I use joblib.load then is this a good approach. I am
using joblib.dump(clf, 'model.pkl') and for loading , I am using
joblib.load('model.pkl). I amn't leveraging any of the *.npy files
generated in the folder.

Now, as you mentioned that joblib is a mechanism to save the data but my
objective is not to load the data used during the model training but only
the algorithm so that I can run the model on a fresh set of data after
loading data. And indeed my model is running fine after I execute the
joblib.load ('model.pkl) command but I wanted to confirm what it's doing
internally.

Thanks in advance !

Cheers,

Debu

On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lemaître <g.lemaitr...@gmail.com
> wrote:

> On 27 December 2016 at 18:17, Debabrata Ghosh <mailford...@gmail.com>
> wrote:
>
>> Dear Joel, Andrew and Roman,
>> Thank you very much
>> for your individual feedback ! It's very helpful indeed ! A few more points
>> related to my model execution:
>>
>> 1. By the term "scoring" I meant the process of executing the model once
>> again without retraining it. So , for training the model I used
>> RandomForestClassifer library and for my scoring (execution without
>> retraining) I have used joblib.dump and joblib.load
>>
>
> Go probably with the terms: training, validating, and testing.
> This is pretty much standard. Scoring is just the value of a
> metric given some data (training data, validation data, or
> testing data).
>
>
>>
>> 2. I have used the parameter n_estimator = 5000 while training my model.
>> Besides it , I have used n_jobs = -1 and haven't used any other parameter
>>
>
> You should probably check those other parameters and understand
>  what are their effects. You should really check the link of Roman
> since GridSearchCV can help you to decide how to fix the parameters.
> http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.
> GridSearchCV.html#sklearn.model_selection.GridSearchCV
> Additionally, 5000 trees seems a lot to me.
>
>
>>
>> 3. For my "scoring" activity (executing the model without retraining it)
>> is there an alternate approach to joblib library ?
>>
>
> Joblib only store data. There is not link with scoring (Check Roman answer)
>
>
>>
>> 4. When I execute my scoring job (joblib method) on a dataset , which is
>> completely different to my training dataset then I get similar True
>> Positive Rate and False Positive Rate as of training
>>
>
> It is what you should get.
>
>
>>
>> 5. However, when I execute my scoring job on the same dataset used for
>> training my model then I get very high TPR and FPR.
>>
>
> You are testing on some data which you used while training. Probably,
> one of the first rule is to not do that. If you want to evaluate in some
> way your classifier, have a separate set (test set) and only test on that
> one. As previously mentioned by Roman, 80% of your data are already
> known by the RandomForestClassifier and will be perfectly classified.
>
>
>>
>>   Is there mechanism
>> through which I can visualise the trees created by my RandomForestClassifer
>> algorithm ? While I dumped the model using joblib.dump , there are a bunch
>> of .npy files created. Will those contain the trees ?
>>
>
> You can visualize the trees with sklearn.tree.export_graphviz:
> http://scikit-learn.org/stable/modules/generated/
> sklearn.tree.export_graphviz.html
>
> The bunch of npy are the data needed to load the RandomForestClassifier
> which
> you previously dumped.
>
>
>>
>> Thanks in advance !
>>
>> Cheers,
>>
>> Debu
>>
>> On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman <joel.noth...@gmail.com>
>> wrote:
>>
>>> Your model is overfit to the training data. Not to say that it's
>>> necessarily possible to get a better fit. The default settings for trees
>>> lean towards 

Re: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library

2016-12-27 Thread Debabrata Ghosh
Dear Joel, Andrew and Roman,
Thank you very much for
your individual feedback ! It's very helpful indeed ! A few more points
related to my model execution:

1. By the term "scoring" I meant the process of executing the model once
again without retraining it. So , for training the model I used
RandomForestClassifer library and for my scoring (execution without
retraining) I have used joblib.dump and joblib.load

2. I have used the parameter n_estimator = 5000 while training my model.
Besides it , I have used n_jobs = -1 and haven't used any other parameter

3. For my "scoring" activity (executing the model without retraining it) is
there an alternate approach to joblib library ?

4. When I execute my scoring job (joblib method) on a dataset , which is
completely different to my training dataset then I get similar True
Positive Rate and False Positive Rate as of training

5. However, when I execute my scoring job on the same dataset used for
training my model then I get very high TPR and FPR.

  Is there mechanism
through which I can visualise the trees created by my RandomForestClassifer
algorithm ? While I dumped the model using joblib.dump , there are a bunch
of .npy files created. Will those contain the trees ?

Thanks in advance !

Cheers,

Debu

On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman 
wrote:

> Your model is overfit to the training data. Not to say that it's
> necessarily possible to get a better fit. The default settings for trees
> lean towards a tight fit, so you might modify their parameters to increase
> regularisation. Still, you should not expect that evaluating a model's
> performance on its training data will be indicative of its general
> performance. This is why we use held-out test sets and cross-validation.
>
> On 27 December 2016 at 20:51, Roman Yurchak  wrote:
>
>> Hi Debu,
>>
>> On 27/12/16 08:18, Andrew Howe wrote:
>> >  5. I got a prediction result with True Positive Rate (TPR) as 10-12
>> > % on probability thresholds above 0.5
>>
>> Getting a high True Positive Rate (recall) is not a sufficient condition
>> for a well behaved model. Though 0.1 recall is still pretty bad. You
>> could look at the precision at the same time (or consider, for instance,
>> the F1 score).
>>
>> >  7. I reloaded the model in a different python instance from the
>> > pickle file mentioned above and did my scoring , i.e., used
>> > joblib library load method and then instantiated prediction
>> > (predict_proba method) on the entire set of my original 600 K
>> > records
>> >   Another question – is there an alternate model scoring
>> > library (apart from joblib, the one I am using) ?
>>
>> Joblib is not a scoring library; once you load a model from disk with
>> joblib you should get ~ the same RandomForestClassifier estimator object
>> as before saving it.
>>
>> >  8. Now when I am running (scoring) my model using
>> > joblib.predict_proba on the entire set of original data (600 K),
>> > I am getting a True Positive rate of around 80%.
>>
>> That sounds normal, considering what you are doing. Your entire set
>> consists of 80% of training set (for which the recall, I imagine, would
>> be close to 1.0) and 20 %  test set (with a recall of 0.1), so on
>> average you would get a recall close to 0.8 for the complete set. Unless
>> I missed something.
>>
>>
>> >  9. I did some  further analysis and figured out that during the
>> > training process, when the model was predicting on the test
>> > sample of 120K it could only predict 10-12% of 120K data beyond
>> > a probability threshold of 0.5. When I am now trying to score my
>> > model on the entire set of 600 K records, it appears that the
>> > model is remembering some of it’s past behavior and data and
>> > accordingly throwing 80% True positive rate
>>
>> It feels like your RandomForestClassifier is not properly tuned. A
>> recall of 0.1 on the test set is quite low. It could be worth trying to
>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using some
>> other metric than the recall to evaluate the performance.
>>
>>
>> Roman
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library

2016-12-26 Thread Debabrata Ghosh
Hi Joel,

Thanks for your quick feedback – I certainly understand
what you mean and please allow me to explain one more time through a
sequence of steps corresponding to the approach I followed:



   1. I considered a dataset containing 600 K (0.6 million) records for
   training my model using scikit learn’s Random Forest Classifier library



   1. I did a training and test sample split on 600 k – forming 480 K
   training dataset and 120 K test dataset (80:20 split)



   1. I trained scikit learn’s Random Forest Classifier model on the 480 K
   (80% split) training sample



   1. Then I ran prediction (predict_proba method of scikit learn’s RF
   library) on the 120 K test sample



   1. I got a prediction result with True Positive Rate (TPR) as 10-12 % on
   probability thresholds above 0.5



   1. I saved the above Random Forest Classifier model using scikit learn’s
   joblib library (dump method) in the form of a pickle file



   1. I reloaded the model in a different python instance from the pickle
   file mentioned above and did my scoring , i.e., used joblib library load
   method and then instantiated prediction (predict_proba method) on the
   entire set of my original 600 K records



   1. Now when I am running (scoring) my model using joblib.predict_proba
   on the entire set of original data (600 K), I am getting a True Positive
   rate of around 80%.



   1. I did some  further analysis and figured out that during the training
   process, when the model was predicting on the test sample of 120K it could
   only predict 10-12% of 120K data beyond a probability threshold of 0.5.
   When I am now trying to score my model on the entire set of 600 K records,
   it appears that the model is remembering some of it’s past behavior and
   data and accordingly throwing 80% True positive rate



   1. When I tried to score the model using joblib.predict_proba on a
   completely disjoint dataset from the one used for training (i.e., no
   overlap between training and scoring data) then it’s giving me the right
   True Positive Rate (in the range of 10 – 12%)

  *Here lies my question once again:* Should I be using 2 different
input datasets (completely exclusive / disjoint) for training and scoring
the models ? In case the input datasets for scoring and training overlaps
then I get incorrect results. Will that be a fair assumption ?

  Another question – is there an alternate model scoring library
(apart from joblib, the one I am using) ?


 Thanks once again for your feedback in advance !


Cheers,


Debu

On Tue, Dec 27, 2016 at 1:56 AM, Joel Nothman <joel.noth...@gmail.com>
wrote:

> Hi Debu,
>
> Your post is terminologically confusing, so I'm not sure I've understood
> your problem. Where is the "different sample" used for scoring coming from?
> Is it possible it is more related to the training data than the test sample?
>
> Joel
>
> On 27 December 2016 at 05:28, Debabrata Ghosh <mailford...@gmail.com>
> wrote:
>
>> Dear All,
>>
>> Greetings!
>>
>> I need some urgent guidance and help
>> from you all in model scoring. What I mean by model scoring is around the
>> following steps:
>>
>>
>>
>>1. I have trained a Random Classifier model using scikit-learn
>>(RandomForestClassifier library)
>>2. Then I have generated the True Positive and False Positive
>>predictions on my test data set using predict_proba method (I have 
>> splitted
>>my data into training and test samples in 80:20 ratio)
>>3. Finally, I have dumped the model into a pkl file.
>>4. Next in another instance, I have loaded the .pkl file
>>5. I have initiated job_lib.predict_proba method for predicting the
>>True Positive and False positives on a different sample. I am terming this
>>step as scoring whether I am predicting without retraining the model
>>
>> My question is when I generate the True Positive Rate on
>> the test data set (as part of model training approach), the rate which I am
>> getting is 10 – 12%. But when I do the scoring (using the steps mentioned
>> above), my True Positive Rate is shooting high upto 80%. Although, I am
>> happy to get a very high TPR but my question is whether getting such a high
>> TPR during the scoring phase is an expected outcome? In other words,
>> whether achieving a high TPR through joblib is an accepted outcome
>> vis-à-vis getting the TPR on training / test data set.
>>
>> Your views on the above ask will be really helpful as I
>> am very confused whether to consider scoring the model using joblib.
>> Otherwis

[scikit-learn] Scikit Learn Random Classifier - TPR and FPR plotted on matplotlib

2016-12-14 Thread Debabrata Ghosh
Hi All,
  I have run scikit-learn Random Forest Classifier
algorithm against a dataset and here is my TPR and FPR against various
thresholds:

[image: Inline image 1]

Further I have plotted the above values in matplotlib and am getting a very
low AUC. Here is my matplotlib code. Can I understand the interpretation of
the graph from you please.Is my model Ok or is there something wrong ?
Appreciate for a quick response please.

import matplotlib.pyplot as plt
import numpy as np
from sklearn import metrics
plt.title('Receiver Operating Characteristic')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
fpr =
[0.0002337345394340,0.0001924870472260,0.0001626973851550,0.950977673794,

0.721826427097,0.538505429739,0.389557119386,0.263523933702,
   0.137490748018]

tpr = [0.196736382441,0.189841415766,0.181222707424,
   0.170555108608,0.164348925411,0.157894736842,
   0.151344518501,0.144104803493,0.132383360147]

roc_auc = metrics.auc(fpr, tpr)

plt.plot([0, 1], [0, 1],'r--')
plt.plot(fpr, tpr, 'bo-', label = 'AUC = %0.9f' % roc_auc)
plt.legend(loc = 'lower right')

plt.show()

[image: Inline image 2]
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Need Urgent help please in resolving JobLibMemoryError

2016-12-09 Thread Debabrata Ghosh
Thanks Piotr for your feedback !

I did look into the sparkit-learn yesterday but couldn't locate the fact
that it contained RandomForestClassifier method in it. I would need to
request customer for downloading this for me as I don't have permission for
that. May I please get your possible help whether sparkit-learn will have
the following methods (corresponding to skikit learn):

1.sklearn.ensemble -> RandomForestClassifier

2.sklearn.cross_validation -> StratifiedKFold

3.sklearn.cross_validation -> train_test_split

Do we have a URl for sparkit-learn similar to skikit learn where all the
methods are listed

I have figured out that sparkit-learn needs to be downloaded from
https://pypi.python.org/pypi/sparkit-learn but apart from it does anything
else need to be downloaded.

Just wanted to check once before requesting my customer as otherwise it
would be a bit embarrassing.

Thanks again !

Cheers,

Debu

On Fri, Dec 9, 2016 at 3:37 PM, Piotr Bialecki <piotr.biale...@hotmail.de>
wrote:

> Hi Debu,
>
> I have not worked with pyspark yet and cannot resolve your error,
> but have you tried out sparkit-learn?
> https://github.com/lensacom/sparkit-learn
>
> It seems to be a package combining pyspark with sklearn and it also has a
> RandomForest and other classifiers:
> (SparkRandomForestClassifier, https://github.com/lensacom/
> sparkit-learn/blob/master/splearn/ensemble/__init__.py)
>
>
> Greets,
> Piotr
>
> On 09.12.2016 10:56, Debabrata Ghosh wrote:
>
> Hi Piotr,
>  Yes, I did use n_jobs = - 1 as well. But the code
> didn't run successfully. On my output screen , I got the following message
> instead of the JobLibMemoryError:
>
> 16/12/08 22:12:26 INFO YarnExtensionServices: In shutdown hook for
> org.apache.spark.scheduler.cluster.YarnExtensionServices$$anon$1@176b071d
> 16/12/08 22:12:26 INFO YarnHistoryService: Shutting down: pushing out 0
> events
> 16/12/08 22:12:26 INFO YarnHistoryService: Event handler thread stopping
> the service
> 16/12/08 22:12:26 INFO YarnHistoryService: Stopping dequeue service, final
> queue size is 0
> 16/12/08 22:12:26 INFO YarnHistoryService: Stopped: Service History
> Service in state History Service: STOPPED endpoint=
> <https://w3-01.ibm.com/tools/forms/ica/icaroute.nsf/bysrcall/ica201612786?OpenDocument>
> http://servername.com:8188/ws/v1/timeline/
> <http://toplxhdmp001.rails.rwy.bnsf.com:8188/ws/v1/timeline/>; bonded to
> ATS=false; listening=true; batchSize=3; flush count=17; current queue
> size=0; total number queued=52, processed=50; post failures=0;
> 16/12/08 22:12:26 INFO SparkContext: Invoking stop() from shutdown hook
> 16/12/08 22:12:26 INFO YarnHistoryService: History service stopped;
> ignoring queued event : [1481256746854]: SparkListenerApplicationEnd(14
> 81256746854)
>
>  Just to get you a background I am executing the
> scikit-learn Random Classifier using pyspark command. I am not getting what
> has gone wrong while using n_jobs = -1 and suddenly the program is shutting
> down certain services. Please can you suggest a remedy as I have been given
> the task to run this via pyspark itself.
>
>   Thanks in advance !
>
> Cheers,
>
> Debu
>
> On Fri, Dec 9, 2016 at 2:48 PM, Piotr Bialecki <piotr.biale...@hotmail.de>
> wrote:
>
>> Hi Debu,
>>
>> it seems that you run out of memory.
>> Try using fewer processes.
>> I don't think that n_jobs = 1000 will perform as you wish.
>>
>> Setting n_jobs to -1 uses the number of cores in your system.
>>
>>
>> Greets,
>> Piotr
>>
>>
>> On 09.12.2016 08:16, Debabrata Ghosh wrote:
>>
>> Hi All,
>>
>>   Greetings !
>>
>>
>>
>> I am getting JoblibMemoryError while executing a scikit-learn
>> RandomForestClassifier code. Here is my algorithm in short:
>>
>>
>>
>> from sklearn.ensemble import RandomForestClassifier
>>
>> from sklearn.cross_validation import train_test_split
>>
>> import pandas as pd
>>
>> import numpy as np
>>
>> clf = RandomForestClassifier(n_estimators=5000, n_jobs=1000)
>>
>> clf.fit(p_input_features_train,p_input_labels_train)
>>
>>
>> The dataframe p_input_features contain 134 columns (features) and 5
>> million rows (observations). The exact *error message* is given below:
>>
>>
>> Executing Random Forest Classifier
>> Traceback (most recent call last):
>>   File "/home/user/rf_fold.py", line 43, in 
>> clf.fit(p_features_train,p_labels_tra

Re: [scikit-learn] Need Urgent help please in resolving JobLibMemoryError

2016-12-09 Thread Debabrata Ghosh
Hi Piotr,
 Yes, I did use n_jobs = - 1 as well. But the code
didn't run successfully. On my output screen , I got the following message
instead of the JobLibMemoryError:

16/12/08 22:12:26 INFO YarnExtensionServices: In shutdown hook for
org.apache.spark.scheduler.cluster.YarnExtensionServices$$anon$1@176b071d
16/12/08 22:12:26 INFO YarnHistoryService: Shutting down: pushing out 0
events
16/12/08 22:12:26 INFO YarnHistoryService: Event handler thread stopping
the service
16/12/08 22:12:26 INFO YarnHistoryService: Stopping dequeue service, final
queue size is 0
16/12/08 22:12:26 INFO YarnHistoryService: Stopped: Service History Service
in state History Service: STOPPED endpoint=
<https://w3-01.ibm.com/tools/forms/ica/icaroute.nsf/bysrcall/ica201612786?OpenDocument>
http://servername.com:8188/ws/v1/timeline/
<http://toplxhdmp001.rails.rwy.bnsf.com:8188/ws/v1/timeline/>; bonded to
ATS=false; listening=true; batchSize=3; flush count=17; current queue
size=0; total number queued=52, processed=50; post failures=0;
16/12/08 22:12:26 INFO SparkContext: Invoking stop() from shutdown hook
16/12/08 22:12:26 INFO YarnHistoryService: History service stopped;
ignoring queued event : [1481256746854]: SparkListenerApplicationEnd(
1481256746854)

 Just to get you a background I am executing the
scikit-learn Random Classifier using pyspark command. I am not getting what
has gone wrong while using n_jobs = -1 and suddenly the program is shutting
down certain services. Please can you suggest a remedy as I have been given
the task to run this via pyspark itself.

  Thanks in advance !

Cheers,

Debu

On Fri, Dec 9, 2016 at 2:48 PM, Piotr Bialecki <piotr.biale...@hotmail.de>
wrote:

> Hi Debu,
>
> it seems that you run out of memory.
> Try using fewer processes.
> I don't think that n_jobs = 1000 will perform as you wish.
>
> Setting n_jobs to -1 uses the number of cores in your system.
>
>
> Greets,
> Piotr
>
>
> On 09.12.2016 08:16, Debabrata Ghosh wrote:
>
> Hi All,
>
>   Greetings !
>
>
>
> I am getting JoblibMemoryError while executing a scikit-learn
> RandomForestClassifier code. Here is my algorithm in short:
>
>
>
> from sklearn.ensemble import RandomForestClassifier
>
> from sklearn.cross_validation import train_test_split
>
> import pandas as pd
>
> import numpy as np
>
> clf = RandomForestClassifier(n_estimators=5000, n_jobs=1000)
>
> clf.fit(p_input_features_train,p_input_labels_train)
>
>
> The dataframe p_input_features contain 134 columns (features) and 5
> million rows (observations). The exact *error message* is given below:
>
>
> Executing Random Forest Classifier
> Traceback (most recent call last):
>   File "/home/user/rf_fold.py", line 43, in 
> clf.fit(p_features_train,p_labels_train)
>   File "/var/opt/ lib/python2.7/site-packages/sklearn/ensemble/forest.py",
> line 290, in fit
> for i, t in enumerate(trees))
>   File 
> "/var/opt/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py",
> line 810, in __call__
> self.retrieve()
>   File "/var/opt/lib 
> /python2.7/site-packages/sklearn/externals/joblib/parallel.py",
> line 757, in retrieve
> raise exception
> sklearn.externals.joblib.my_exceptions.JoblibMemoryError:
> JoblibMemoryError
> 
> ___
> Multiprocessing exception:
> 
> ...
>
> /var/opt/lib/python2.7/site-packages/sklearn/ensemble/forest.py in
> fit(self=RandomForestClassifier(bootstrap=True, class_wei...te=None,
> verbose=0,
> warm_start=False), X=array([[ 0.,  0.,
> 0., 0.,  0.]], dtype=float32),
> y=array([[ 0.],
>[ 0.],
>[ 0.],
>...,
>[ 0.],
>[ 0.],
>[ 0.]]), sample_weight=None)
> 285 trees = Parallel(n_jobs=self.n_jobs,
> verbose=self.verbose,
> 286  backend="threading")(
> 287 delayed(_parallel_build_trees)(
> 288 t, self, X, y, sample_weight, i, len(trees),
> 289 verbose=self.verbose, class_weight=self.class_
> weight)
> --> 290 for i, t in enumerate(trees))
> i = 4999
> 291
> 292 # Collect newly grown trees
> 293 self.estimators_.extend(trees)
> 294
>
> 
> ...
>
>
>
> Please can you help me to identify a possible reso

[scikit-learn] Need Urgent help please in resolving JobLibMemoryError

2016-12-08 Thread Debabrata Ghosh
Hi All,

  Greetings !



I am getting JoblibMemoryError while executing a scikit-learn
RandomForestClassifier code. Here is my algorithm in short:



from sklearn.ensemble import RandomForestClassifier

from sklearn.cross_validation import train_test_split

import pandas as pd

import numpy as np

clf = RandomForestClassifier(n_estimators=5000, n_jobs=1000)

clf.fit(p_input_features_train,p_input_labels_train)


The dataframe p_input_features contain 134 columns (features) and 5 million
rows (observations). The exact *error message* is given below:


Executing Random Forest Classifier
Traceback (most recent call last):
  File "/home/user/rf_fold.py", line 43, in 
clf.fit(p_features_train,p_labels_train)
  File "/var/opt/ lib/python2.7/site-packages/sklearn/ensemble/forest.py",
line 290, in fit
for i, t in enumerate(trees))
  File
"/var/opt/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py",
line 810, in __call__
self.retrieve()
  File "/var/opt/lib
/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 757,
in retrieve
raise exception
sklearn.externals.joblib.my_exceptions.JoblibMemoryError: JoblibMemoryError
___
Multiprocessing exception:
...

/var/opt/lib/python2.7/site-packages/sklearn/ensemble/forest.py in
fit(self=RandomForestClassifier(bootstrap=True, class_wei...te=None,
verbose=0,
warm_start=False), X=array([[ 0.,  0.,
0., 0.,  0.]], dtype=float32),
y=array([[ 0.],
   [ 0.],
   [ 0.],
   ...,
   [ 0.],
   [ 0.],
   [ 0.]]), sample_weight=None)
285 trees = Parallel(n_jobs=self.n_jobs,
verbose=self.verbose,
286  backend="threading")(
287 delayed(_parallel_build_trees)(
288 t, self, X, y, sample_weight, i, len(trees),
289 verbose=self.verbose,
class_weight=self.class_weight)
--> 290 for i, t in enumerate(trees))
i = 4999
291
292 # Collect newly grown trees
293 self.estimators_.extend(trees)
294

...



Please can you help me to identify a possible resolution to this.


Thanks,

Debu
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn