t3k...@gmail.com> wrote:
> The 280k were the staring of the sequence, while the 70k were from a
> shuffled bit, right?
>
>
> On 04/12/2016 08:35 PM, Joel Nothman wrote:
>
> I don't think we can deny this is strange, certainly for real-world, IID
> data!
>
> On 13 April 2016
o believe this is a software problem rather than a data
> problem. If your data was accidentally a duplicate of the dataset, you
> could certainly get 100%.
>
> On 13 April 2016 at 10:10, Juan Nunez-Iglesias <jni.s...@gmail.com> wrote:
>
>> Hallelujah! I'd given up on
hile your confirmation
> used the beginning of the dataset vs the rest.
> Your data is probably not IID.
>
>
>
> On 03/10/2016 01:08 AM, Juan Nunez-Iglesias wrote:
>
> Hi all,
>
> TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"
>
Hi all,
TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"
samples (280K), it falsely shows accuracy of 1.0 for full trees
(max_depth=None). This doesn't happen for fewer samples.
Longer version:
I'm trying to optimise RF hyperparameters using GridSearchCV for the first
time.
This ship has probably sailed, but imho predict_proba is a much more common
method to use... I would call the current predict_proba just predict, and
rename predict something like predict_thresholded, predict_discrete or
predict_labels. (This was my very first experience with sklearn... I used
--
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly
Hmm, not sure why this didn't render:
Hi all,
Sorry for cross posting but we are trying to get as many great submissions
as possible! I'll keep things short with Raniere Silva's summary:
*Long version:*
http://ilovesymposia.com/2015/02/04/call-for-code-nominations-for-elegant-scipy/
.
*Short
On 02/02/2015 10:46 AM, Juan Nunez-Iglesias wrote:
Hi all,
*TL;DR version:*
I'm looking for a classifier that will get the *exact same model* for
shuffled versions of the training data. I thought GaussianNB would do
the trick but either I don't understand it, or some kind of numerical
Hi all,
TL;DR version:
I'm looking for a classifier that will get the *exact same model* for shuffled
versions of the training data. I thought GaussianNB would do the trick but
either I don't understand it, or some kind of numerical instability prevents it
from achieving the same model on
Nope, the Py2 RF was saved with joblib!
The SO response might work for standard pickling though, I'll give that a try,
thanks!
On Fri, Jan 23, 2015 at 11:18 AM, Sebastian Raschka se.rasc...@gmail.com
wrote:
Sorry, I think my previous message was a little bit ambiguous.
What I would try
.
On Fri, Jan 23, 2015 at 1:38 PM, Joel Nothman joel.noth...@gmail.com
wrote:
Could you provide the traceback when using pickle? The joblib error is
about zipping, which should not be applicable there...
On 23 January 2015 at 13:30, Juan Nunez-Iglesias jni.s...@gmail.com wrote:
Nope, the Py2 RF
Juan Nunez-Iglesias jni.s...@gmail.com
wrote:
Since this question went unanswered:
By the way, is there a mailing list for scikit-image?
a
href=https://groups.google.com/forum/#!forum/scikit-image;
https://groups.google.com/forum/#!forum/scikit-image/a
Sorry for the delay... Long
Hey Robert,
I'm going to be at PyCon-AU, including the sprints. I don't really have a
sprint topic yet! So if you're thinking of some sklearn sprinting, I might
be up for it!
Juan.
On Tue, May 13, 2014 at 11:43 PM, Gael Varoquaux
gael.varoqu...@normalesup.org wrote:
On Wed, May 14, 2014 at
Hi All,
I have a biggish dataset (to use Gaƫl's terminology ;), 45K samples x 300
features, that I want to cluster. I have very heterogeneous features -- some
are continuous, others are quasi-continuous (high counts), others are
discrete (counts of rare events), others are angles (uniformly
On Mon, Feb 3, 2014 at 5:49 AM, Andy t3k...@gmail.com wrote:
We should have an FAQ.
It should include
What is the project name? scikit-learn, not scikit or SciKit nor sci-kit
learn.
How do you pronounce the project name? sy-kit learn. sci stands for
science!
Do you want to add this
Of course, some feature = some value can also be expressed as F(some
feature), so really, moving all of the feature transformation up front
should allow you to do everything you suggested. I understand the
convenience of using custom functions in some cases, but at least the
workaround here is
On Sat, Dec 21, 2013 at 10:28 AM, Peter Prettenhofer
peter.prettenho...@gmail.com wrote:
Actually, I'd propose to turn off multiprocessing at prediction time -
this might backfire quite easily.
For the more ignorant among us, can you give an example? I don't understand
why this would be
On Fri, Dec 20, 2013 at 9:15 AM, Su, Jian, Ph.D. su.j...@mayo.edu wrote:
As Ryan pointed out, joblib is the solution. One bad thing is it creates
multiple files.
If I remember correctly, I fixed the multiple files issue by passing
compress=3 as a keyword argument to joblib.dump. That does
Hi all,
Can anyone tell me why a simple import statement is resulting in a warning
on an unrelated import?
from skimage.segmentation import slic
Contrast with:
import sklearn.ensemble from skimage.segmentation import
Ah, mystery solved, thanks Joel!
On Fri, Dec 13, 2013 at 8:38 AM, Joel Nothman joel.noth...@gmail.comwrote:
It relates to a recently-fixed issue other than the one Olivier notes (see
https://github.com/scikit-learn/scikit-learn/issues/2531).
Because scikit-learn considers
I'll also point out that np.copy has an order argument, so you can get
back a Fortran-ordered array by doing
X_train = X_train.copy(order='F') # lets materialize the view
On Tue, Nov 26, 2013 at 11:53 PM, Peter Prettenhofer
peter.prettenho...@gmail.com wrote:
2013/11/26 Olivier Grisel
@Olivier, you just blew my mind, as I did not know about git grep! =D
On Fri, Oct 4, 2013 at 12:06 AM, Olivier Grisel olivier.gri...@ensta.orgwrote:
Sounds good. Please also add a minmax_scale function while you are at
it. I often miss that one too when doing interactive data exploration
in
I'd vote for HierarchicalClustering, since, as Robert said, agglomerative
is not necessarily hierarchical. Is Agglomerative really any more
descriptive? That's not obvious to me.
Also, the equivalent standard function in R is hclust, so that's something.
=)
On Tue, Jul 23, 2013 at 9:33 PM,
On Wed, Jul 24, 2013 at 1:58 AM, Lars Buitinck l.j.buiti...@uva.nl wrote:
And hierarchical isn't necessarily agglomerative. The alternative is
something like HAClustering, which to me sounds like high-availability
computer clusters.
Are you saying you could do a top-down hierarchy? I
On Tue, Jun 4, 2013 at 8:16 PM, Peter Prettenhofer
peter.prettenho...@gmail.com wrote:
I believe more in my results than in my expertise - and so should you :-)
**
+1! There's very very few examples of theory trumping data in history...
And a bajillion of the converse.
I also think Joel put
@Alex: I don't have a workaround for you but this seems like a useful
addition. I don't know how hard it would be, but you should definitely
raise it as an issue on the github issues page for the project:
https://github.com/scikit-learn/scikit-learn/issues?sort=updatedstate=open
On Wed, Apr 24,
Amazingly, it's 0.11!
http://www.enthought.com/products/epdlibraries.php
However iirc sudo easy_install -U sklearn should work within EPD to get
the latest stable... Which still doesn't help for AdaBoost. =P
On Fri, Mar 8, 2013 at 7:41 AM, Andreas Mueller amuel...@ais.uni-bonn.dewrote:
Hi
More precisely, I think David wants a function that will take a set of RFs
and return a new classifier object that does all the weighted averaging
Andy suggested for you transparently. And the answer is no, sklearn doesn't
have such a function. =)
As an aside, the OOB values will no longer be
I don't think that a system running Cython to regenerate C files based on
timestamp is an option. Indeed, because timestamps are not a reliable
indicator, it would run too often, and we would end up with new C code
checked in git by mistake.
For the similar reasons, I'd like running Cython
I hear about make, cmake, md5hashes, git post-commit hooks... It seems to
me that we don't really have any problems with the current system. Its
drawback are that a developer does have to run cython after modifying a
.pyx files.
-1
+1 to make/cmake. This is the kind of manual, error-prone
Hi All,
I'm aiming to contribute a pull request to sklearn speeding up some metrics
code, but I have a couple of questions.
1. What is the convention for contributing Cython code? In particular:
- is a .pxd always necessary?
- I notice that in sklearn/tree/setup.py, no mention is made of the
31 matches
Mail list logo