Re: [Scikit-learn-general] Distributed RandomForests

2013-04-27 Thread Andreas Mueller
Hi Youssef. I would strongly advise you to use a image specific random forest implementation. There is a very good implementation by some other MSRC people: http://research.microsoft.com/en-us/downloads/03e0ca05-8aa9-49f6-801f-bb23846dc147/ It implements a much more complicated model, decision

Re: [Scikit-learn-general] Distributed RandomForests

2013-04-26 Thread Youssef Barhomi
Thank you Peter, I found that the feature extraction was taking a lot of extra memory and that was not related to wiseRF, so you were right. Actually, from top it seems the training part was taking only an extra 20% of memory than the size of the dataset itself, wich is pretty impressive. So at

Re: [Scikit-learn-general] Distributed RandomForests

2013-04-25 Thread Gilles Louppe
Hi Youssef, Regarding memory usage, you should know that it'll basically blow up if you increase the number of jobs. With the current implementation, you'll need O(n_jobs * |X| * 2) in memory space (where |X| is the size of X, in bytes). That issue stems from the use of joblib which basically

Re: [Scikit-learn-general] Distributed RandomForests

2013-04-25 Thread Peter Prettenhofer
Hi Youssef, please make sure that you use the latest version of sklearn (= 0.13) - we did some enhancements to the sub-sampling procedure lately. Looking at the RandomForest code - it seems that the jobs=-1 should not be the issue for the parallel training of the trees since ``n_jobs =

Re: [Scikit-learn-general] Distributed RandomForests

2013-04-25 Thread Youssef Barhomi
thank you very much Peter, you are right about the n_jobs, something was going wrong with that. When n_jobs = -1, for larger dataset (1E6 for this case), no cpu was being used and the process was hanging for a while. getting n_jobs = 1 made everything work. yes, I will look into the iPython

Re: [Scikit-learn-general] Distributed RandomForests

2013-04-25 Thread Youssef Barhomi
ohh makes total sense now!! thank you Gilles!! Y On Thu, Apr 25, 2013 at 2:38 AM, Gilles Louppe g.lou...@gmail.com wrote: Hi Youssef, Regarding memory usage, you should know that it'll basically blow up if you increase the number of jobs. With the current implementation, you'll need

Re: [Scikit-learn-general] Distributed RandomForests

2013-04-25 Thread Youssef Barhomi
Hi Brian, thanks for your feedback. were you able to reproduce their results? how big was your dataset that you have processed so far with an RF? the MS people have used a distributed RF, so yes, the features I am guessing were being computed in parallel on all these cores. Though, I am still new

Re: [Scikit-learn-general] Distributed RandomForests

2013-04-25 Thread Ronnie Ghose
I've tried larger data sets. It wasn't pretty, much fewer features though On Apr 25, 2013 4:03 AM, Peter Prettenhofer peter.prettenho...@gmail.com wrote: Hi Youssef, please make sure that you use the latest version of sklearn (= 0.13) - we did some enhancements to the sub-sampling procedure

Re: [Scikit-learn-general] Distributed RandomForests

2013-04-24 Thread Brian Holt
Hi Youssef, You're trying to do exactly what I did. First thing to note is that the Microsoft guys don't precompute the features, rather they compute them on the fly. That means that they only need enough memory to store the depth images, and since they have a 1000 core cluster, computing the