Re: [Scikit-learn-general] threading error when training a RFC on a big dataset
Chris Lin iirc has advocated partitioning the examples then concatenation the individual classifiers. You could do that and then do a second pass of learning: find the 1% of examples that are the hardest for the ensemble and learn over them. Regardless, it will be adhoc unless you use an out of core algorithm. Von meinem iPhone gesendet On Sep 24, 2012, at 12:23 PM, Christian Jauvin cjau...@gmail.com wrote: Thank you Olivier for these suggestions. I'd try/test them with pleasure, but meanwhile I discovered that there was just no way the dataset I was trying to use would ever fit in the 72GB of memory of the machine I'm using. So I just scaled it down, and obviously this error is not happening anymore. But I'd be curious to know if there are any mechanism I could use to allow a Random Forest classifier to work with bigger datasets (than what simply fits in memory)? Thanks! On 22 September 2012 16:18, Olivier Grisel olivier.gri...@ensta.org wrote: 2012/9/22 Christian Jauvin cjau...@gmail.com: Hi, I have been doing multiple experiments using a RandomForestClassifier (trained with the parallel code option) recently, without encountering any particular problem. However as soon as I began using a much bigger dataset (with the exact same code), I got this threading error: Exception in thread Thread-2: Traceback (most recent call last): File /usr/lib/python2.7/threading.py, line 551, in __bootstrap_inner self.run() File /usr/lib/python2.7/threading.py, line 504, in run self.__target(*self.__args, **self.__kwargs) File /usr/lib/python2.7/multiprocessing/pool.py, line 319, in _handle_tasks put(task) SystemError: NULL result without error in PyObject_Call I can provide additional details of course, but first maybe there is something in particular I should be aware of, about size or memory limit of the underlying objects in question? It can be a memory error as the current implementation is very bad at managing the memory. You can try to replace the joblib folder in the sklearn source tree by the pickling-pool branch of my repo: https://github.com/joblib/joblib/pull/44 That should help a lot. You can further memmap your original dataset has explained in the following doc to get even better memory usage reduction: https://github.com/ogrisel/joblib/blob/pickling-pool/doc/parallel_numpy.rst You might also want to set the TMP environment variable to a folder on a big partition. I am very interested in any feedback while using this branch. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -- How fast is your code? 3 out of 4 devs don\\\'t know how their code performs in production. Find out how slow your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219672;13503038;z? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] threading error when training a RFC on a big dataset
2012/9/24 Joseph Turian jos...@metaoptimize.com: Chris Lin iirc has advocated partitioning the examples then concatenation the individual classifiers. You could do that and then do a second pass of learning: find the 1% of examples that are the hardest for the ensemble and learn over them. Regardless, it will be adhoc unless you use an out of core algorithm. Interesting, do you have a link to the paper? Gilles' paper I was mentioning previously is here: http://www.cs.bris.ac.uk/~flach/ECMLPKDD2012papers/1125540.pdf -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] threading error when training a RFC on a big dataset
Thank you Olivier for these suggestions. I'd try/test them with pleasure, but meanwhile I discovered that there was just no way the dataset I was trying to use would ever fit in the 72GB of memory of the machine I'm using. So I just scaled it down, and obviously this error is not happening anymore. But I'd be curious to know if there are any mechanism I could use to allow a Random Forest classifier to work with bigger datasets (than what simply fits in memory)? Thanks! On 22 September 2012 16:18, Olivier Grisel olivier.gri...@ensta.org wrote: 2012/9/22 Christian Jauvin cjau...@gmail.com: Hi, I have been doing multiple experiments using a RandomForestClassifier (trained with the parallel code option) recently, without encountering any particular problem. However as soon as I began using a much bigger dataset (with the exact same code), I got this threading error: Exception in thread Thread-2: Traceback (most recent call last): File /usr/lib/python2.7/threading.py, line 551, in __bootstrap_inner self.run() File /usr/lib/python2.7/threading.py, line 504, in run self.__target(*self.__args, **self.__kwargs) File /usr/lib/python2.7/multiprocessing/pool.py, line 319, in _handle_tasks put(task) SystemError: NULL result without error in PyObject_Call I can provide additional details of course, but first maybe there is something in particular I should be aware of, about size or memory limit of the underlying objects in question? It can be a memory error as the current implementation is very bad at managing the memory. You can try to replace the joblib folder in the sklearn source tree by the pickling-pool branch of my repo: https://github.com/joblib/joblib/pull/44 That should help a lot. You can further memmap your original dataset has explained in the following doc to get even better memory usage reduction: https://github.com/ogrisel/joblib/blob/pickling-pool/doc/parallel_numpy.rst You might also want to set the TMP environment variable to a folder on a big partition. I am very interested in any feedback while using this branch. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -- How fast is your code? 3 out of 4 devs don\\\'t know how their code performs in production. Find out how slow your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219672;13503038;z? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] threading error when training a RFC on a big dataset
I think @glouppe is likely to contribute some evolution for the ensembles of trees models once he gets back from ECML 2012 where he has a paper on those issues. -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] threading error when training a RFC on a big dataset
Hi, I have been doing multiple experiments using a RandomForestClassifier (trained with the parallel code option) recently, without encountering any particular problem. However as soon as I began using a much bigger dataset (with the exact same code), I got this threading error: Exception in thread Thread-2: Traceback (most recent call last): File /usr/lib/python2.7/threading.py, line 551, in __bootstrap_inner self.run() File /usr/lib/python2.7/threading.py, line 504, in run self.__target(*self.__args, **self.__kwargs) File /usr/lib/python2.7/multiprocessing/pool.py, line 319, in _handle_tasks put(task) SystemError: NULL result without error in PyObject_Call I can provide additional details of course, but first maybe there is something in particular I should be aware of, about size or memory limit of the underlying objects in question? Thanks, Christian -- How fast is your code? 3 out of 4 devs don\\\'t know how their code performs in production. Find out how slow your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219672;13503038;z? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general