Hi! I'm looking for a way to run prediction for learned model in the most performant way. It might happen that some users might want to predict just couple of samples (literally one or two), but some other would run prediction for tens of thousands. It's not a surprise there is an overhead to load data into cluster even for couple of samples. So, to avoid such an overhead the one might run prediction directly on a driver.
It's possible with mlib API, because RandomForestModel allows me to provide just feature Vector instead of RDD. But it looks like it's not possible with PipelineModel API, which receives only DataFrame (which in turns required to have SQLContext and SparkContext to build those). Is there any workaround for PipelineModel API? Do you think it'd be useful for someone else besides me? If so I will file a feature request. -- Be well! Jean Morozov