Hello,
I am working on optimising the performance of a Java ML/NLP application
based on Spark / SparkNLP. For prediction, I am applying a trained model
on a Spark dataset which consists of one column with only one row. The
dataset is created like this:
List<String> textList = Collections.singletonList(text);
Dataset<Row> data = sparkSession
.createDataset(textList, Encoders.STRING())
.withColumnRenamed(COL_VALUE, COL_TEXT);
The predictions are created like this:
PipelineModel fittedPipeline = pipeline.fit(dataset);
Dataset<Row> prediction = fittedPipeline.transform(dataset);
We noticed that the performance isn't quite as good as expected. After
profiling the application with VisualVM, I noticed that the problem is
with org.apache.spark.sql.Encoders.STRING() in the creation of the
dataset, which by itself takes up about 75% of the time for the whole
prediction method call.
So, is there a simpler and more efficient way of creating the required
dataset, consisting of one column and one String row?
Thanks a lot.
Cheers,
Martin