Re: [Scikit-learn-general] threading error when training a RFC on a big dataset

2012-09-25 Thread Joseph Turian
Chris Lin iirc has advocated partitioning the examples then concatenation the 
individual classifiers.

You could do that and then do a second pass of learning: find the 1% of 
examples that are the hardest for the ensemble and learn over them.

Regardless, it will be adhoc unless you use an out of core algorithm.

Von meinem iPhone gesendet

On Sep 24, 2012, at 12:23 PM, Christian Jauvin cjau...@gmail.com wrote:

 Thank you Olivier for these suggestions.
 
 I'd try/test them with pleasure, but meanwhile I discovered that there
 was just no way the dataset I was trying to use would ever fit in the
 72GB of memory of the machine I'm using. So I just scaled it down, and
 obviously this error is not happening anymore.
 
 But I'd be curious to know if there are any mechanism I could use to
 allow a Random Forest classifier to work with bigger datasets (than
 what simply fits in memory)?
 
 Thanks!
 
 
 On 22 September 2012 16:18, Olivier Grisel olivier.gri...@ensta.org wrote:
 2012/9/22 Christian Jauvin cjau...@gmail.com:
 Hi,
 
 I have been doing multiple experiments using a RandomForestClassifier
 (trained with the parallel code option) recently, without encountering
 any particular problem. However as soon as I began using a much bigger
 dataset (with the exact same code), I got this threading error:
 
 Exception in thread Thread-2:
 Traceback (most recent call last):
  File /usr/lib/python2.7/threading.py, line 551, in __bootstrap_inner
self.run()
  File /usr/lib/python2.7/threading.py, line 504, in run
self.__target(*self.__args, **self.__kwargs)
  File /usr/lib/python2.7/multiprocessing/pool.py, line 319, in 
 _handle_tasks
put(task)
 SystemError: NULL result without error in PyObject_Call
 
 I can provide additional details of course, but first maybe there is
 something in particular I should be aware of, about size or memory
 limit of the underlying objects in question?
 
 
 It can be a memory error as the current implementation is very bad at
 managing the memory.
 
 You can try to replace the joblib folder in the sklearn source tree by
 the pickling-pool branch of my repo:
 
 https://github.com/joblib/joblib/pull/44
 
 That should help a lot. You can further memmap your original dataset
 has explained in the following doc to get even better memory usage
 reduction:
 
 https://github.com/ogrisel/joblib/blob/pickling-pool/doc/parallel_numpy.rst
 
 You might also want to set the TMP environment variable to a folder on
 a big partition.
 
 I am very interested in any feedback while using this branch.
 
 --
 Olivier
 http://twitter.com/ogrisel - http://github.com/ogrisel
 
 --
 How fast is your code?
 3 out of 4 devs don\\\'t know how their code performs in production.
 Find out how slow your code is with AppDynamics Lite.
 http://ad.doubleclick.net/clk;262219672;13503038;z?
 http://info.appdynamics.com/FreeJavaPerformanceDownload.html
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and 
 threat landscape has changed and how IT managers can respond. Discussions 
 will include endpoint security, mobile security and the latest in malware 
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] threading error when training a RFC on a big dataset

2012-09-25 Thread Olivier Grisel
2012/9/24 Joseph Turian jos...@metaoptimize.com:
 Chris Lin iirc has advocated partitioning the examples then concatenation the 
 individual classifiers.

 You could do that and then do a second pass of learning: find the 1% of 
 examples that are the hardest for the ensemble and learn over them.

 Regardless, it will be adhoc unless you use an out of core algorithm.

Interesting, do you have a link to the paper?

Gilles' paper I was mentioning previously is here:
http://www.cs.bris.ac.uk/~flach/ECMLPKDD2012papers/1125540.pdf

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] threading error when training a RFC on a big dataset

2012-09-24 Thread Christian Jauvin
Thank you Olivier for these suggestions.

I'd try/test them with pleasure, but meanwhile I discovered that there
was just no way the dataset I was trying to use would ever fit in the
72GB of memory of the machine I'm using. So I just scaled it down, and
obviously this error is not happening anymore.

But I'd be curious to know if there are any mechanism I could use to
allow a Random Forest classifier to work with bigger datasets (than
what simply fits in memory)?

Thanks!


On 22 September 2012 16:18, Olivier Grisel olivier.gri...@ensta.org wrote:
 2012/9/22 Christian Jauvin cjau...@gmail.com:
 Hi,

 I have been doing multiple experiments using a RandomForestClassifier
 (trained with the parallel code option) recently, without encountering
 any particular problem. However as soon as I began using a much bigger
 dataset (with the exact same code), I got this threading error:

 Exception in thread Thread-2:
 Traceback (most recent call last):
   File /usr/lib/python2.7/threading.py, line 551, in __bootstrap_inner
 self.run()
   File /usr/lib/python2.7/threading.py, line 504, in run
 self.__target(*self.__args, **self.__kwargs)
   File /usr/lib/python2.7/multiprocessing/pool.py, line 319, in 
 _handle_tasks
 put(task)
 SystemError: NULL result without error in PyObject_Call

 I can provide additional details of course, but first maybe there is
 something in particular I should be aware of, about size or memory
 limit of the underlying objects in question?


 It can be a memory error as the current implementation is very bad at
 managing the memory.

 You can try to replace the joblib folder in the sklearn source tree by
 the pickling-pool branch of my repo:

 https://github.com/joblib/joblib/pull/44

 That should help a lot. You can further memmap your original dataset
 has explained in the following doc to get even better memory usage
 reduction:

 https://github.com/ogrisel/joblib/blob/pickling-pool/doc/parallel_numpy.rst

 You might also want to set the TMP environment variable to a folder on
 a big partition.

 I am very interested in any feedback while using this branch.

 --
 Olivier
 http://twitter.com/ogrisel - http://github.com/ogrisel

 --
 How fast is your code?
 3 out of 4 devs don\\\'t know how their code performs in production.
 Find out how slow your code is with AppDynamics Lite.
 http://ad.doubleclick.net/clk;262219672;13503038;z?
 http://info.appdynamics.com/FreeJavaPerformanceDownload.html
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] threading error when training a RFC on a big dataset

2012-09-24 Thread Olivier Grisel
I think @glouppe is likely to contribute some evolution for the ensembles
of trees models once he gets back from ECML 2012 where he has a paper on
those issues.
--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] threading error when training a RFC on a big dataset

2012-09-22 Thread Christian Jauvin
Hi,

I have been doing multiple experiments using a RandomForestClassifier
(trained with the parallel code option) recently, without encountering
any particular problem. However as soon as I began using a much bigger
dataset (with the exact same code), I got this threading error:

Exception in thread Thread-2:
Traceback (most recent call last):
  File /usr/lib/python2.7/threading.py, line 551, in __bootstrap_inner
self.run()
  File /usr/lib/python2.7/threading.py, line 504, in run
self.__target(*self.__args, **self.__kwargs)
  File /usr/lib/python2.7/multiprocessing/pool.py, line 319, in _handle_tasks
put(task)
SystemError: NULL result without error in PyObject_Call

I can provide additional details of course, but first maybe there is
something in particular I should be aware of, about size or memory
limit of the underlying objects in question?

Thanks,

Christian

--
How fast is your code?
3 out of 4 devs don\\\'t know how their code performs in production.
Find out how slow your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219672;13503038;z?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general