[SparkML] RandomForestModel vs PipelineModel API on a Driver.

Eugene Morozov Thu, 17 Dec 2015 09:07:39 -0800

Hi!

I'm looking for a way to run prediction for learned model in the most
performant way. It might happen that some users might want to predict just
couple of samples (literally one or two), but some other would run
prediction for tens of thousands. It's not a surprise there is an overhead
to load data into cluster even for couple of samples. So, to avoid such an
overhead the one might run prediction directly on a driver.


It's possible with mlib API, because RandomForestModel allows me to provide
just feature Vector instead of RDD.
But it looks like it's not possible with PipelineModel API, which receives
only DataFrame (which in turns required to have SQLContext and SparkContext
to build those).

Is there any workaround for PipelineModel API?
Do you think it'd be useful for someone else besides me? If so I will file
a feature request.
--
Be well!
Jean Morozov

[SparkML] RandomForestModel vs PipelineModel API on a Driver.

Reply via email to