I fear you have to do the plumbing all yourself. This is the same for all commercial and non-commercial libraries/analytics packages. It often also depends on the functional requirements on how you distribute.
Le sam. 12 sept. 2015 à 20:18, Rex X <dnsr...@gmail.com> a écrit : > Hi everyone, > > What is the best way to migrate existing scikit-learn code to PySpark > cluster? Then we can bring together the full power of both scikit-learn and > spark, to do scalable machine learning. (I know we have MLlib. But the > existing code base is big, and some functions are not fully supported yet.) > > Currently I use multiprocessing module of Python to boost the speed. But > this only works for one node, while the data set is small. > > For many real cases, we may need to deal with gigabytes or even terabytes > of data, with thousands of raw categorical attributes, which can lead to > millions of discrete features, using 1-of-k representation. > > For these cases, one solution is to use distributed memory. That's why I > am considering spark. And spark support Python! > With Pyspark, we can import scikit-learn. > > But the question is how to make the scikit-learn code, decisionTree > classifier for example, running in distributed computing mode, to benefit > the power of Spark? > > > Best, > Rex >