Hello,

I am working on optimising the performance of a Java ML/NLP application based on Spark / SparkNLP. For prediction, I am applying a trained model on a Spark dataset which consists of one column with only one row. The dataset is created like this:

    List<String> textList = Collections.singletonList(text);
    Dataset<Row> data = sparkSession
        .createDataset(textList, Encoders.STRING())
        .withColumnRenamed(COL_VALUE, COL_TEXT);

The predictions are created like this:

    PipelineModel fittedPipeline = pipeline.fit(dataset);

    Dataset<Row> prediction = fittedPipeline.transform(dataset);

We noticed that the performance isn't quite as good as expected. After profiling the application with VisualVM, I noticed that the problem is with org.apache.spark.sql.Encoders.STRING() in the creation of the dataset, which by itself takes up about 75% of the time for the whole prediction method call.

So, is there a simpler and more efficient way of creating the required dataset, consisting of one column and one String row?

Thanks a lot.

Cheers,

Martin

Reply via email to