I should point out that I'm not sure what the performance of that project is.
I'd expect that native data frame in PySpark will be significantly more
efficient than their DictRDD.
It would be interesting to see a performance comparison for the pipelines
relative to native Spark ML
I fear you have to do the plumbing all yourself. This is the same for all
commercial and non-commercial libraries/analytics packages. It often also
depends on the functional requirements on how you distribute.
Le sam. 12 sept. 2015 à 20:18, Rex X a écrit :
> Hi everyone,
>
>
Jorn and Nick,
Thanks for answering.
Nick, the sparkit-learn project looks interesting. Thanks for mentioning it.
Rex
On Sat, Sep 12, 2015 at 12:05 PM, Nick Pentreath
wrote:
> You might want to check out https://github.com/lensacom/sparkit-learn
>
Hi everyone,
What is the best way to migrate existing scikit-learn code to PySpark
cluster? Then we can bring together the full power of both scikit-learn and
spark, to do scalable machine learning. (I know we have MLlib. But the
existing code base is big, and some functions are not fully
You might want to check out https://github.com/lensacom/sparkit-learn
Though it's true for random
Forests / trees you will need to use MLlib
—
Sent from Mailbox
On Sat, Sep 12, 2015 at 9:00 PM, Jörn Franke wrote:
> I fear you have to do the plumbing all yourself.