Hello, I have been prototyping a text classification model that my company would like to eventually put into production. Our technology stack is currently Java based but we would like to be able to build our models in Spark/MLlib and then export something like a PMML file which can be used for model scoring in real-time.
I have been using scikit learn where I am able to take the training data convert the text data into a sparse data format and then take the other features and use the dictionary vectorizer to do one-hot encoding for the other categorical variables. All of those things seem to be possible in mllib but I am still puzzled about how that can be packaged in such a way that the incoming data can be first made into feature vectors and then evaluated as well. Are there any best practices for this type of thing in Spark? I hope this is clear but if there are any confusions then please let me know. Thanks, Chirag
