We should survey what other packages use. I'll have a look at what
lightning uses later.
Mathieu
On Sat, Sep 13, 2014 at 2:23 AM, Andy wrote:
> +1 of cleaning up __init__.py (maybe no implementations at all?)
> +1 for making private methods start with underscore (which will break
> everything
Hi all,
The following solved my issue:
def pre_tokenized(doc):
"""doc is a list of tokenized lists (with pre-tokenized avlues)
will be passed to sklearn to bypass analyzer"""
return doc
tfidf = TfidfVectorizer(analyzer=self.pre_tokenized)
tfidf.fit(content)
Seems lik
Hi all,
I am trying to do tfidf/lsa on pre-tokenized data (MeSH tags for any
biology folks out there) and am trying to skip tokenization since
pre-processing has already done so.
Unfortunately I am having trouble follow the 'tips and tricks' in the doc:
Some tips and tricks:
If documents are pre
I’m now getting this:
'Quantizer' object has no attribute 'get_params'
Do I need to add some other classes to the declaration?
Thanks,
From: Joel Nothman [mailto:[email protected]]
Sent: Thursday, September 11, 2014 9:37 PM
To: scikit-learn-general
Subject: Re: [Scikit-learn-general] binar
Here the link to the issue
https://github.com/scikit-learn/scikit-learn/issues/3455
Arnaud
On 12 Sep 2014, at 20:01, Arnaud Joly wrote:
> If you want to work on custom oob scoring, there is an issue opened
> for it.
>
> Best regards,
> Arnaud
>
> On 12 Sep 2014, at 19:01, Josh Wasserstein wr
If you want to work on custom ooh scoring, there is an issue opened
for it.
Best regards,
Arnaud
On 12 Sep 2014, at 19:01, Josh Wasserstein wrote:
> Thanks! Couldn't find it on the documentation. I may try adding that to a PR.
>
> Josh
>
> On Fri, Sep 12, 2014 at 10:07 AM, Arnaud Joly wrote:
Thanks for the suggestions.
With that fix, scaling+gridsearch is giving me the same results (w.r.t. my own
gridsearch). I will try add binning as well.
Thank you again!
From: Andy [mailto:[email protected]]
Sent: Friday, September 12, 2014 1:18 PM
To: [email protected]
+1 of cleaning up __init__.py (maybe no implementations at all?)
+1 for making private methods start with underscore (which will break
everything ^^)
Also we need to add utils to the References then.
No idea how to decide what should be public and what not, though.
On 09/08/2014 04:01 PM, Mat
On 09/12/2014 06:20 PM, Pagliari, Roberto wrote:
I added
import sklearn.base.TransformerMixin
but it says no module named TransofrmerMixin
Because TransformerMixin is not a module but a class.
You have to do
from sklearn.base import TransformerMixin
*From:*Joel Nothman [mailto:joel.noth..
As Laurent said using StandardScaler again is not necessary.
If you don't provide code for your custom grid-search, it is hard to say
what the difference might be ;)
Are the same parameters selected and are the scores during the
grid-search the same?
On 09/12/2014 06:31 PM, Pagliari, Robert
Yes, exactly.
Le 12 sept. 2014 18:31, "Luca Puggini" a écrit :
> Hey thanks a lot,
> so basically in random Forest the split is done like in the algorithm
> described in your thesis except that the search is not done on all the
> variables but only on a random subset of them? (usually sqrt(p) or
Thanks! Couldn't find it on the documentation. I may try adding that to a
PR.
Josh
On Fri, Sep 12, 2014 at 10:07 AM, Arnaud Joly wrote:
> Hi,
>
> The r2_score metric is used.
>
> Best regards,
> Arnaud
>
> On 12 Sep 2014, at 16:04, Josh Wasserstein wrote:
>
> What error metric is used for this
Hi Roberto,
You do not need to scale here (you can remove the 3 first lines), the
point of the pipeline is actually to not have to do this:
After this I make the predictions
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_pr
Hi Andy,
I don't think the accuracy is an issue. I explicitly provided a score function
and the problem persists.
With my own gridsearch I don't use pipeline, just stratifiedKFold and average
for every combination of the parameters.
This is an example with scaling+svm using sklearn pipeline:
Hey thanks a lot,
so basically in random Forest the split is done like in the algorithm
described in your thesis except that the search is not done on all the
variables but only on a random subset of them? (usually sqrt(p) or
something like that)
Let me know.
Thanks,
Luca
Hi Luca,
>
> The "best"
I added
import sklearn.base.TransformerMixin
but it says no module named TransofrmerMixin
From: Joel Nothman [mailto:[email protected]]
Sent: Thursday, September 11, 2014 9:37 PM
To: scikit-learn-general
Subject: Re: [Scikit-learn-general] binarizer with more levels
Good point. It should
Hi Roberto.
GridSearchCV uses accuracy for selection if not other method is
specified, so there should be no difference.
Could you provide code?
Do you also create a pipeline when using your own grid search? I would
imagine there is some difference in how you do the fitting in the pipeline.
Thank you,
I’m not seeing “sklearn.base”. Which module do I need to import to be able to
use it?
Thanks,
From: Joel Nothman [mailto:[email protected]]
Sent: Thursday, September 11, 2014 9:37 PM
To: scikit-learn-general
Subject: Re: [Scikit-learn-general] binarizer with more levels
Good p
Regarding my previous question, I suspect the difference lies in the scoring
function.
What is the default scoring function used by gridsearch?
In my own implementation I am using
number of correctly classified samples (no weighting) / total number of samples
sklearn gridsearch function must b
Hi Luca,
The "best" strategy consists in finding the best threshold, that is the one
that maximizes impurity decrease, when trying to partition a node into a
left and right nodes. By contrast, "random" does not look for the best
split and simply draw the discretization threshold at random.
For fu
I am comparing the results of sklearn cross-validation and my own cross
validation.
I tested linearSVC under the following conditions:
- Data scaling per grid search
- Data scaling + 2-level quantization, per grid search
Specifically, I have done the following:
Sklearn gridSe
Hi,
I am using random forest classifier and this algorithm train a tree defined
as :
DecisionTreeClassifier(criterion='gini', max_depth=None,
max_features='auto',
max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, random_state=198200
Hi,
The r2_score metric is used.
Best regards,
Arnaud
On 12 Sep 2014, at 16:04, Josh Wasserstein wrote:
> What error metric is used for this?
>
> Josh
--
Want excitement?
Manually upgrade your production database.
Wh
What error metric is used for this?
Josh
--
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick
Thanks for all the suggestions. I will try them and let you know.
El 10/09/14 16:46, "Andy" escribió:
>On 09/10/2014 09:07 AM, Gael Varoquaux wrote:
>> How are you measuring your errors? If you are using the zero-one loss
>> (accuracy score), you are taking in account only the binary decisions,
25 matches
Mail list logo