Re: [Scikit-learn-general] Distributed RandomForests

2013-04-24 Thread Brian Holt
Hi Youssef, You're trying to do exactly what I did. First thing to note is that the Microsoft guys don't precompute the features, rather they compute them on the fly. That means that they only need enough memory to store the depth images, and since they have a 1000 core cluster, computing the feat

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

2013-04-23 Thread Brian Holt
At the moment your three options are 1) get more memory 2) do feature selection - 400k features on 200k samples seems to me to contain a lot of redundant information or irrelevant features 3) submit a PR to support dense matrices - this is going to be a lot of work and I doubt it's worth it. All t

Re: [Scikit-learn-general] Our own Olivier Grisel giving a scipy keynote

2013-04-16 Thread Brian Holt
Congratulations Olivier! On Apr 17, 2013 7:13 AM, "Gilles Louppe" wrote: > Congratulations are in order :-) > > > On 17 April 2013 08:06, Peter Prettenhofer > wrote: > >> That's great - congratulations Olivier! >> >> Definitely, no pressure ;-) >> >> >> 2013/4/17 Ronnie Ghose >> >>> wow :O cong

Re: [Scikit-learn-general] Finding dimentions of faces on an image

2013-03-19 Thread Brian Holt
As Gilles says, the scanning windows approach is pretty common for object (and face) detection. Have you looked at the Viola Jones paper? It's the standard for face detection and now that we have adaboost classifiers you should be able to knock up an example quite quickly. Scikit Image might be qui

Re: [Scikit-learn-general] sdss_photoz NaN problem in Exercise 7.2

2013-03-15 Thread Brian Holt
Unfortunately I recently moved to Ubuntu so I'm not going to be of much help right now... On Mar 15, 2013 11:48 AM, "george manus" wrote: > Brian Holt writes: > > > > > > > Up until very recently I was working on windows 7 64bit without any > troub

Re: [Scikit-learn-general] sdss_photoz NaN problem in Exercise 7.2

2013-03-14 Thread Brian Holt
Up until very recently I was working on windows 7 64bit without any trouble. Are you using the Enthought Python Distribution or pythonxy or are you building scikit learn for yourself? On Mar 14, 2013 9:46 PM, "george manus" wrote: > > > Leon Palafox writes: > > > > > > > What is the issue you'v

Re: [Scikit-learn-general] LOF implementation

2013-01-30 Thread Brian Holt
Is it any one of these? acronyms.thefreedictionary.com/LOF On Jan 30, 2013 2:21 PM, "Andreas Mueller" wrote: > On 01/30/2013 03:15 PM, Oğuz Yarımtepe wrote: > > I haven't seen any LOF implementation at the library. Any further > > plans about it or a way to implement it? > > > > > What is LOF? T

Re: [Scikit-learn-general] Panda / Tree and Random Forest

2012-10-24 Thread Brian Holt
I'm with GIGO. The name of the model (classifier or regressor) should be enough clue to the user which they should use for their problem. On Oct 24, 2012 5:59 PM, "Andreas Mueller" wrote: > Am 24.10.2012 18:53, schrieb Mathieu Blondel: > > > > On Thu, Oct 25, 2012 at 1:39 AM, Gael Varoquaux < >

Re: [Scikit-learn-general] Panda / Tree and Random Forest

2012-10-24 Thread Brian Holt
If you want rules you can create an exporter similar to the graphviz one. But just to be clear this tree implementation is CART not C4.5, so you shouldn't be expecting that the tree stores rules in your format. Brian On Oct 24, 2012 5:19 PM, "Didier Vila" wrote: > >>>Ok - then that's the problem

Re: [Scikit-learn-general] rebuilding cython extensions from .pyx file

2012-10-14 Thread Brian Holt
> Just to make it clear: adding a dependency on make or cmake is just not an option. These tools are not part of the standard Python build chain. Are you sure? We already use make in scikit-learn... On 15 October 2012 07:45, Andreas Mueller wrote: > Am 15.10.2012 08:36, schrieb Mathieu Blonde

Re: [Scikit-learn-general] rebuilding cython extensions from .pyx file

2012-10-12 Thread Brian Holt
If we wanted to support MSVC then I'd strongly suggest using CMake, in fact I'd recommend CMake anyway and just generate makefiles. -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Reli

Re: [Scikit-learn-general] rebuilding cython extensions from .pyx file

2012-10-12 Thread Brian Holt
Make is bundled with cygwin so I see no reason why it wouldn't work under windows. -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly wha

Re: [Scikit-learn-general] Progress and difficulties for 0.12.1 Bugfix release

2012-10-08 Thread Brian Holt
Or (1000[L], 200[L])? The ellipses are a bit general in that they can match anything. -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly

Re: [Scikit-learn-general] 0.12.1 Bugfix release

2012-10-07 Thread Brian Holt
The latest build still has the L suffix doctest failures and still has the fastMCD bad n_trials exception. However the spectral tests are looking a bit better with 1 new failure: == ERROR: Tests the FastMCD algorithm implementatio

Re: [Scikit-learn-general] Progress and difficulties for 0.12.1 Bugfix release

2012-10-07 Thread Brian Holt
Gael, Your idea of using `print`, which calls str(), does actually work on longs, as does calling int(): In [9]: int(2000L) Out[9]: 2000 In [10]: str(2000L) Out[10]: '2000' However, it doesn't have the desired effect on a tuple of longs In [11]: str( (1000L,200L) ) Out[11]: '(1000L, 200L)' So

Re: [Scikit-learn-general] Progress and difficulties for 0.12.1 Bugfix release

2012-10-07 Thread Brian Holt
Hi Gael, I'm not sure its what you want to hear: In [3]: import sklearn.datasets In [4]: digits = sklearn.datasets.load_digits() In [5]: digits.data.shape Out[5]: (1797L, 64L) In [6]: print digits.data.shape (1797L, 64L) On 7 October 2012 15:57, Gael Varoquaux wrote: > Thanks a lot Brian, >

Re: [Scikit-learn-general] 0.12.1 Bugfix release

2012-10-07 Thread Brian Holt
Sorry guys, I've had loads of stuff on and I might have a chance to look at it still tonight but don't bank on it... On Oct 7, 2012 8:12 PM, "Andreas Müller" wrote: > > > > > Can you reproduce the docstring issues? I cannot. I think that they > > can > > be solved simply by adding a 'print' in th

Re: [Scikit-learn-general] Progress and difficulties for 0.12.1 Bugfix release

2012-10-07 Thread Brian Holt
Doctest failures: == FAIL: Doctest: sklearn.datasets.base.load_boston -- Traceback (most recent call last): File "C:\Python27\lib\doctest.py", line 2201, in run

Re: [Scikit-learn-general] 0.12.1 Bugfix release

2012-10-06 Thread Brian Holt
It seems that 0.12.X fixes these 2 errors that are present in master without introducing others: == ERROR: test_locally_linear.test_lle_manifold -- Traceback (mos

Re: [Scikit-learn-general] 0.12.1 Bugfix release

2012-10-06 Thread Brian Holt
Hi Gael, Here are the results of Win7 64bit build EPD64bit 7.1.3, cygwin, numpy 1.6.1 Ran 1294 tests in 110.393s FAILED (SKIP=11, errors=3, failures=9) The 9 failures are all Doctest failures where integers are suffixed by 'L' on 64bit machines fail string comparisons to the number without an 'L

Re: [Scikit-learn-general] 0.12.1 Bugfix release

2012-09-30 Thread Brian Holt
I can help with the windows build... Brian On Sep 30, 2012 4:18 PM, "Gael Varoquaux" wrote: > Hey list, > > Next week end Andy and I are going to release an 0.12.1 bugfix release. > This will be a bug fix release: no additional feature compared to the > 0.12. > > If you want to help us, you can

Re: [Scikit-learn-general] optimizing ensemble method based classifier

2012-09-12 Thread Brian Holt
imators=10 > clf = RandomForestClassifier(n_estimators=10, oob_score=True) > clf.fit(X,y) > print clf.oob_score_ > > clf.oob_score_ will give oob accuracy. > > But I would also like to know what percent of data is used to calculate > this score? > > > > > On Wed,

Re: [Scikit-learn-general] optimizing ensemble method based classifier

2012-09-12 Thread Brian Holt
You're absolutely right, you can simply use the oob estimate as your measure of generalisability. No need for GridSearchCV... On Sep 12, 2012 12:09 PM, "Sheila the angel" wrote: > Hello all, > I want to optimize n_estimators and max_features for ensemble methods (say > forRandomForestClassifier )

Re: [Scikit-learn-general] ANN: scikit-learn 0.12

2012-09-08 Thread Brian Holt
Hi Aliabbas, By coincidence I've just spent the last 2 hours debugging my windows build and I've just finally got it sorted, so I can empathise with you! May I suggest that you download the Enthought 64bit distribution? It comes with sklearn 0.11 already and works out of the box. You'll need to s

Re: [Scikit-learn-general] How to upgrade to development

2012-08-31 Thread Brian Holt
Hi Marcos, The easiest option is always to uninstall version 0.11. Failing that, try putting the new location at the beginning of your PYTHONPATH. Cheers Brian On Sep 1, 2012 3:36 AM, "Marcos Wolff" wrote: > for compiling yes: > > git clone git://github.com/scikit-learn/scikit-learn.git > cd sc

Re: [Scikit-learn-general] congrats to emanuele !

2012-08-30 Thread Brian Holt
Woohoo! I might be a bit biased though :) Well done emanuele and well done Scikit-Learn for being such an awesome project! On 30 August 2012 16:10, Alexandre Gramfort wrote: >> Congrats indeed! Which of the 2 competitions did you / he won? > > the first and guess with what? ... Random forest ...

Re: [Scikit-learn-general] ndarray is not fortran contiguous

2012-08-02 Thread Brian Holt
Thanks Jim, I'm on numpy 1.3.0, which might be the problem. Its not a show stopper for me, I think I've found a way not to end up with this case. Regards Brian On 2 August 2012 15:54, Jim Vickroy wrote: > On 8/2/2012 8:27 AM, Brian Holt wrote: >> Thanks Jim, >> >&

Re: [Scikit-learn-general] ndarray is not fortran contiguous

2012-08-02 Thread Brian Holt
Thanks Jim, Could you try it again with X = np.array([[0]]) Note the double "[" bracket - this is what causes the problem for me. Cheers Brian On 2 August 2012 15:23, Jim Vickroy wrote: > On 8/2/2012 6:05 AM, Brian Holt wrote: >> Hi list, >> >> I'm refa

[Scikit-learn-general] ndarray is not fortran contiguous

2012-08-02 Thread Brian Holt
Hi list, I'm refactoring the tree module to introduce lazy argsorting and my unit tests are failing with: Exception ValueError: ValueError(u'ndarray is not Fortran contiguous',) in 'sklearn.tree._tree.Tree.recursive_partition' ignored I think I've pinned down the problem to this minimal samp

Re: [Scikit-learn-general] Unable to call fit() on random forest classifier when it is encapsulated in separate class

2012-07-19 Thread Brian Holt
0.4441, 0.011 , 0.046 , 0.4921, 0.078 ], > dtype=float32) > > Also, Y has some values = -1.0. > > regards > shankar. > > > > > > > On Thu, Jul 19, 2012 at 4:58 PM, Brian Holt wrote: >> >> Hi Shankar, >> >&g

Re: [Scikit-learn-general] Unable to call fit() on random forest classifier when it is encapsulated in separate class

2012-07-19 Thread Brian Holt
Hi Shankar, Can you paste a small snippet of your data (X_train, Y_train) that reproduces this behaviour? Cheers Brian -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and

Re: [Scikit-learn-general] Question about Scikit-learn Decision Tree using Mixed Type inputs and String Inputs

2012-06-07 Thread Brian Holt
Hi Randy, You're right that the current implementation doesn't support non-numeric types (for efficiency and compatibility with other sklearn classifiers), but you're also right that trees can theoretically support any type as input so long as the < operator is defined for it. I'm not sure whet

Re: [Scikit-learn-general] Decision tree pruning

2012-03-13 Thread Brian Holt
Decision trees tend to overfit, so they are most often used (unpruned) in a forest. That said, I think it would be a useful contribution to our offering. Brian -Original Message- From: Charanpal Dhanjal Date: Tue, 13 Mar 2012 11:20:45 To: Reply-To: scikit-learn-general@lists.sourcefo

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-10 Thread Brian Holt
http://research.microsoft.com/pubs/12/decisionForests_MSR_TR_2011_114.pdf -- Write once. Port to many. Get the SDK and tools to simplify cross-platform app development. Create new or port existing apps to sell to consu

Re: [Scikit-learn-general] Question and comments on RandomForests

2012-01-10 Thread Brian Holt
Hi Andy, The best way to understand the min_density parameter is to think of it as 'the minimum subset population density'. The idea is that if this density parameter gets too low, then the program should copy the points and proceed to split using the copied subset. As an example, assume that the

Re: [Scikit-learn-general] Tutorial on decision trees

2011-11-29 Thread Brian Holt
As a follow up, I found a description of the parallel tree training algorithm [2] that MSR used. Regards, Brian [2] http://budiu.info/work/budiu-biglearn11.pdf -- All the data continuously generated in your IT infrastruct

[Scikit-learn-general] Tutorial on decision trees

2011-11-16 Thread Brian Holt
For those who might be interested, there was a very interesting tutorial on decision trees[1] presented by Antonio Criminisi and Jamie Shotton (the guys at MSR behind the human pose estimation algorithm for the Kinect) at ICCV last week. Their approach differs from the implementation that exists i

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-11-04 Thread Brian Holt
>I have myself made a lot of changes in tree.py and _tree.pyx in a lot of places in the code. Wouldn't it be easier for you to merge your code into my files? As I see in [1, 2] your changes are localized, and hence it would be quicker for you to merge them into my files than for me merging all my c

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-11-04 Thread Brian Holt
@pprett: Thanks for doing the hard work to change the tree into a numpy representation. I have been thinking a lot about it, and I was just about to implement it, but you've got there first. I have a few suggestions after looking at your code that I'd like to try out, so I might make a clone. ---

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-28 Thread Brian Holt
> Right, but it seems to me that this is exactly what we want to test the> > hyothesis. Maybe I am being dense, as I m a bit rushing through my mail,> but > it seems to me that if you keep a reference to a, then you compensate> for > the difference that was pointed out in the discussion below, i

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-28 Thread Brian Holt
>Still, almost 4 minutes just to extend the python heap and reallocate >a bunch of already allocated objects seems unlikely. Also I don't >understand why the Python interpreter would need to "move" allocated >object: it can just grow the heap, reallocate a larger buffer list (if >needed, with just

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-28 Thread Brian Holt
> Interesting. This hypothesis should be testable, for instance by keeping> a > reference on 'a', appending it to a list. I'd be interested in the> results, > if you mind trying out Brian. I'm not sure I understand. I thought that by appending to a list I am keeping a reference to the object.

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-28 Thread Brian Holt
cPickle with HIGHEST_PROTOCOL is significantly faster, it averages 15 seconds to load the 10 tree forest compared to the 5 minutes without. What still confuses me is why loading the forests and storing them in a list should be any slower than loading them individually. In other words, why should

Re: [Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-27 Thread Brian Holt
Firstly, thanks for all the helpful comments. I didn't know that the protocol made such a big difference, so until now in ignorance I've been using the default. That said, I left a test running last night on one of our centre's servers and it took 8hrs to load 20 forests ( each with 10 trees, dep

[Scikit-learn-general] Storing and loading decision tree classifiers

2011-10-26 Thread Brian Holt
Once a Decision Tree ( or a forest ) has been trained, I almost always want to save the resulting classifier to disk and then load the classifier at a later stage for testing. My dataset is 5.2GB on disk: (690K * 2K) float32s. I can load this into memory using `np.load('dataset.npy')` in 20 secon

Re: [Scikit-learn-general] bibtex entry for the 0.9 release

2011-10-22 Thread Brian Holt
I'd like to cite this paper, but I can't find it anywhere in www.jmlr.org? Does anyone have a link? -- The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is grow

Re: [Scikit-learn-general] np.matrix: accept or reject?

2011-10-21 Thread Brian Holt
What is the difference between `asarray` and `asanyarray`? The documentation for `asanyarray` says: Convert the input to an ndarray, but pass ndarray subclasses through. The documentation for `asarray` says: Convert the input to an array. What I don't get is why `asanyarray` won't convert a `matr

Re: [Scikit-learn-general] np.matrix: accept or reject?

2011-10-20 Thread Brian Holt
> I vote for CONVERTING and in addition we should implement a common test suite that checks for input types/shape of our estimators (AFAIR this was proposed by Mathieu a while ago). +1 On 20 October 2011 14:15, Peter Prettenhofer wrote: > Thanks for raising this issue Lars. > > I vote for CONVER

[Scikit-learn-general] A possible solution to templated types in cython

2011-10-19 Thread Brian Holt
This is cross-posted from the scikits.image mailing list; It was so interesting, I thought it a waste not to use the opportunity. We've had a number of discussions on cython types, and how we wish that cython would support some sort of templates. This would be very useful for the `tree` module (t

Re: [Scikit-learn-general] Log approximation in tree entropy criterion

2011-10-17 Thread Brian Holt
+1 even though its not as accurate. If the tests pass, then its accurate enough IMHO. -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, sec

[Scikit-learn-general] Meanshift with one cluster centre

2011-10-05 Thread Brian Holt
Is there a way to specify the number of cluster centres required by meanshift? From the documentation and a bit of playing around, it seems like algorithm decides how many cluster centres to discover... -- All the data con

[Scikit-learn-general] Request for comments on pull request #310 (Decision Tree)

2011-10-05 Thread Brian Holt
Hi, PR310 is nearly ready to be merged, if anyone has any further comment, please let me know. Link: https://github.com/scikit-learn/scikit-learn/pull/310 This pull request contains an implementation of Classification and Regression Trees. This version is highly optimised and is significantly fas

Re: [Scikit-learn-general] Bayesian inference

2011-09-19 Thread Brian Holt
> As for the Bayesian inference setting, there is already PyMC, and I think the > focus should be on improving that project rather than trying to make > scikit-learn do everything. Thanks David! I've spent hours looking for a package that does inference in python (hence this email) and PyMC looks

[Scikit-learn-general] Bayesian inference

2011-09-19 Thread Brian Holt
Does [Bayesian Inference](http://en.wikipedia.org/wiki/Bayesian_inference) fall under the scope of scikit-learn? Probabilistic graphical models are an exciting field in machine learning, with the theory going back at least as far as 1982. If it is of interest, then the obvious question is: do we r