Encoders.STRING() causing performance problems in Java application

martin Thu, 17 Feb 2022 22:42:25 -0800


Hello,

I am working on optimising the performance of a Java ML/NLP applicationbased on Spark / SparkNLP. For prediction, I am applying a trained modelon a Spark dataset which consists of one column with only one row. Thedataset is created like this:


    List<String> textList = Collections.singletonList(text);
    Dataset<Row> data = sparkSession
        .createDataset(textList, Encoders.STRING())
        .withColumnRenamed(COL_VALUE, COL_TEXT);

The predictions are created like this:

    PipelineModel fittedPipeline = pipeline.fit(dataset);

    Dataset<Row> prediction = fittedPipeline.transform(dataset);

We noticed that the performance isn't quite as good as expected. Afterprofiling the application with VisualVM, I noticed that the problem iswith org.apache.spark.sql.Encoders.STRING() in the creation of thedataset, which by itself takes up about 75% of the time for the wholeprediction method call.

So, is there a simpler and more efficient way of creating the requireddataset, consisting of one column and one String row?


Thanks a lot.

Cheers,

Martin

Encoders.STRING() causing performance problems in Java application

Reply via email to