Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-04-14 Thread Juan Nunez-Iglesias
t3k...@gmail.com> wrote: > The 280k were the staring of the sequence, while the 70k were from a > shuffled bit, right? > > > On 04/12/2016 08:35 PM, Joel Nothman wrote: > > I don't think we can deny this is strange, certainly for real-world, IID > data! > > On 13 April 2016

Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-04-12 Thread Juan Nunez-Iglesias
o believe this is a software problem rather than a data > problem. If your data was accidentally a duplicate of the dataset, you > could certainly get 100%. > > On 13 April 2016 at 10:10, Juan Nunez-Iglesias <jni.s...@gmail.com> wrote: > >> Hallelujah! I'd given up on

Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-04-12 Thread Juan Nunez-Iglesias
hile your confirmation > used the beginning of the dataset vs the rest. > Your data is probably not IID. > > > > On 03/10/2016 01:08 AM, Juan Nunez-Iglesias wrote: > > Hi all, > > TL;DR: when I run GridSearchCV with RandomForestClassifier and "many" >

[Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-03-09 Thread Juan Nunez-Iglesias
Hi all, TL;DR: when I run GridSearchCV with RandomForestClassifier and "many" samples (280K), it falsely shows accuracy of 1.0 for full trees (max_depth=None). This doesn't happen for fewer samples. Longer version: I'm trying to optimise RF hyperparameters using GridSearchCV for the first time.

Re: [Scikit-learn-general] CV scores vs scores on a manual split

2015-02-19 Thread Juan Nunez-Iglesias
This ship has probably sailed, but imho predict_proba is a much more common method to use... I would call the current predict_proba just predict, and rename predict something like predict_thresholded, predict_discrete or predict_labels. (This was my very first experience with sklearn... I used

[Scikit-learn-general] Call for code nominations for Elegant SciPy!

2015-02-05 Thread Juan Nunez-Iglesias
-- Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly

Re: [Scikit-learn-general] Call for code nominations for Elegant SciPy!

2015-02-05 Thread Juan Nunez-Iglesias
Hmm, not sure why this didn't render: Hi all, Sorry for cross posting but we are trying to get as many great submissions as possible! I'll keep things short with Raniere Silva's summary: *Long version:* http://ilovesymposia.com/2015/02/04/call-for-code-nominations-for-elegant-scipy/ . *Short

Re: [Scikit-learn-general] Classifier that is perfectly stable given shuffled training data

2015-02-03 Thread Juan Nunez-Iglesias
On 02/02/2015 10:46 AM, Juan Nunez-Iglesias wrote: Hi all, *TL;DR version:* I'm looking for a classifier that will get the *exact same model* for shuffled versions of the training data. I thought GaussianNB would do the trick but either I don't understand it, or some kind of numerical

[Scikit-learn-general] Classifier that is perfectly stable given shuffled training data

2015-02-02 Thread Juan Nunez-Iglesias
Hi all, TL;DR version: I'm looking for a classifier that will get the *exact same model* for shuffled versions of the training data. I thought GaussianNB would do the trick but either I don't understand it, or some kind of numerical instability prevents it from achieving the same model on

Re: [Scikit-learn-general] Sharing objects between Python 2 and 3

2015-01-22 Thread Juan Nunez-Iglesias
Nope, the Py2 RF was saved with joblib! The SO response might work for standard pickling though, I'll give that a try, thanks! On Fri, Jan 23, 2015 at 11:18 AM, Sebastian Raschka se.rasc...@gmail.com wrote: Sorry, I think my previous message was a little bit ambiguous. What I would try

Re: [Scikit-learn-general] Sharing objects between Python 2 and 3

2015-01-22 Thread Juan Nunez-Iglesias
. On Fri, Jan 23, 2015 at 1:38 PM, Joel Nothman joel.noth...@gmail.com wrote: Could you provide the traceback when using pickle? The joblib error is about zipping, which should not be applicable there... On 23 January 2015 at 13:30, Juan Nunez-Iglesias jni.s...@gmail.com wrote: Nope, the Py2 RF

Re: [Scikit-learn-general] scikit-image paper

2014-08-17 Thread Juan Nunez-Iglesias
Juan Nunez-Iglesias jni.s...@gmail.com wrote: Since this question went unanswered: By the way, is there a mailing list for scikit-image? a href=https://groups.google.com/forum/#!forum/scikit-image; https://groups.google.com/forum/#!forum/scikit-image/a Sorry for the delay... Long

Re: [Scikit-learn-general] My talk was approved for PyCon AU 2014!

2014-07-18 Thread Juan Nunez-Iglesias
Hey Robert, I'm going to be at PyCon-AU, including the sprints. I don't really have a sprint topic yet! So if you're thinking of some sklearn sprinting, I might be up for it! Juan. On Tue, May 13, 2014 at 11:43 PM, Gael Varoquaux gael.varoqu...@normalesup.org wrote: On Wed, May 14, 2014 at

[Scikit-learn-general] Removing confounding factors before clustering

2014-02-18 Thread Juan Nunez-Iglesias
Hi All, I have a biggish dataset (to use Gaƫl's terminology ;), 45K samples x 300 features, that I want to cluster. I have very heterogeneous features -- some are continuous, others are quasi-continuous (high counts), others are discrete (counts of rare events), others are angles (uniformly

Re: [Scikit-learn-general] Contributing to Scikit

2014-02-02 Thread Juan Nunez-Iglesias
On Mon, Feb 3, 2014 at 5:49 AM, Andy t3k...@gmail.com wrote: We should have an FAQ. It should include What is the project name? scikit-learn, not scikit or SciKit nor sci-kit learn. How do you pronounce the project name? sy-kit learn. sci stands for science! Do you want to add this

Re: [Scikit-learn-general] Custom splitting criterion for decision tree classifier

2014-01-12 Thread Juan Nunez-Iglesias
Of course, some feature = some value can also be expressed as F(some feature), so really, moving all of the feature transformation up front should allow you to do everything you suggested. I understand the convenience of using custom functions in some cases, but at least the workaround here is

Re: [Scikit-learn-general] Releasing joblib 0.8a

2013-12-21 Thread Juan Nunez-Iglesias
On Sat, Dec 21, 2013 at 10:28 AM, Peter Prettenhofer peter.prettenho...@gmail.com wrote: Actually, I'd propose to turn off multiprocessing at prediction time - this might backfire quite easily. For the more ignorant among us, can you give an example? I don't understand why this would be

Re: [Scikit-learn-general] Save trained classifier

2013-12-19 Thread Juan Nunez-Iglesias
On Fri, Dec 20, 2013 at 9:15 AM, Su, Jian, Ph.D. su.j...@mayo.edu wrote: As Ryan pointed out, joblib is the solution. One bad thing is it creates multiple files. If I remember correctly, I fixed the multiple files issue by passing compress=3 as a keyword argument to joblib.dump. That does

[Scikit-learn-general] import sklearn.ensemble alters behaviour of scikit-image

2013-12-12 Thread Juan Nunez-Iglesias
Hi all, Can anyone tell me why a simple import statement is resulting in a warning on an unrelated import? from skimage.segmentation import slic Contrast with: import sklearn.ensemble from skimage.segmentation import

Re: [Scikit-learn-general] import sklearn.ensemble alters behaviour of scikit-image

2013-12-12 Thread Juan Nunez-Iglesias
Ah, mystery solved, thanks Joel! On Fri, Dec 13, 2013 at 8:38 AM, Joel Nothman joel.noth...@gmail.comwrote: It relates to a recently-fixed issue other than the one Olivier notes (see https://github.com/scikit-learn/scikit-learn/issues/2531). Because scikit-learn considers

Re: [Scikit-learn-general] Array memory layout and slicing

2013-11-26 Thread Juan Nunez-Iglesias
I'll also point out that np.copy has an order argument, so you can get back a Fortran-ordered array by doing X_train = X_train.copy(order='F') # lets materialize the view On Tue, Nov 26, 2013 at 11:53 PM, Peter Prettenhofer peter.prettenho...@gmail.com wrote: 2013/11/26 Olivier Grisel

Re: [Scikit-learn-general] sklearn.preprocessing: robust scaling and general refactoring of scaling functionality

2013-10-06 Thread Juan Nunez-Iglesias
@Olivier, you just blew my mind, as I did not know about git grep! =D On Fri, Oct 4, 2013 at 12:06 AM, Olivier Grisel olivier.gri...@ensta.orgwrote: Sounds good. Please also add a minmax_scale function while you are at it. I often miss that one too when doing interactive data exploration in

Re: [Scikit-learn-general] Name of a hierarchical agglomerative clustering object

2013-07-23 Thread Juan Nunez-Iglesias
I'd vote for HierarchicalClustering, since, as Robert said, agglomerative is not necessarily hierarchical. Is Agglomerative really any more descriptive? That's not obvious to me. Also, the equivalent standard function in R is hclust, so that's something. =) On Tue, Jul 23, 2013 at 9:33 PM,

Re: [Scikit-learn-general] Name of a hierarchical agglomerative clustering object

2013-07-23 Thread Juan Nunez-Iglesias
On Wed, Jul 24, 2013 at 1:58 AM, Lars Buitinck l.j.buiti...@uva.nl wrote: And hierarchical isn't necessarily agglomerative. The alternative is something like HAClustering, which to me sounds like high-availability computer clusters. Are you saying you could do a top-down hierarchy? I

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-06 Thread Juan Nunez-Iglesias
On Tue, Jun 4, 2013 at 8:16 PM, Peter Prettenhofer peter.prettenho...@gmail.com wrote: I believe more in my results than in my expertise - and so should you :-) ** +1! There's very very few examples of theory trumping data in history... And a bajillion of the converse. I also think Joel put

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-24 Thread Juan Nunez-Iglesias
@Alex: I don't have a workaround for you but this seems like a useful addition. I don't know how hard it would be, but you should definitely raise it as an issue on the github issues page for the project: https://github.com/scikit-learn/scikit-learn/issues?sort=updatedstate=open On Wed, Apr 24,

Re: [Scikit-learn-general] Get every package once and for all

2013-03-07 Thread Juan Nunez-Iglesias
Amazingly, it's 0.11! http://www.enthought.com/products/epdlibraries.php However iirc sudo easy_install -U sklearn should work within EPD to get the latest stable... Which still doesn't help for AdaBoost. =P On Fri, Mar 8, 2013 at 7:41 AM, Andreas Mueller amuel...@ais.uni-bonn.dewrote: Hi

Re: [Scikit-learn-general] Combining Random Forests

2013-01-09 Thread Juan Nunez-Iglesias
More precisely, I think David wants a function that will take a set of RFs and return a new classifier object that does all the weighted averaging Andy suggested for you transparently. And the answer is no, sklearn doesn't have such a function. =) As an aside, the OOB values will no longer be

Re: [Scikit-learn-general] rebuilding cython extensions from .pyx file

2012-10-15 Thread Juan Nunez-Iglesias
I don't think that a system running Cython to regenerate C files based on timestamp is an option. Indeed, because timestamps are not a reliable indicator, it would run too often, and we would end up with new C code checked in git by mistake. For the similar reasons, I'd like running Cython

Re: [Scikit-learn-general] rebuilding cython extensions from .pyx file

2012-10-14 Thread Juan Nunez-Iglesias
I hear about make, cmake, md5hashes, git post-commit hooks... It seems to me that we don't really have any problems with the current system. Its drawback are that a developer does have to run cython after modifying a .pyx files. -1 +1 to make/cmake. This is the kind of manual, error-prone

[Scikit-learn-general] Contributing Cython to sklearn

2012-08-21 Thread Juan Nunez-Iglesias
Hi All, I'm aiming to contribute a pull request to sklearn speeding up some metrics code, but I have a couple of questions. 1. What is the convention for contributing Cython code? In particular: - is a .pxd always necessary? - I notice that in sklearn/tree/setup.py, no mention is made of the