Re: Encoders.STRING() causing performance problems in Java application

martin Fri, 18 Feb 2022 03:44:10 -0800

Addendum: I have tried to replace localIterator with a forEach() call onthe dataset directly, but this hasn't improved the performance.

If the forEach call is the issue, there probably isn't much that can bedone to further improve things, other than perhaps trying to batch theprediction calls instead of running them line by line on the input file.


Cheers,

Martin

Am 2022-02-18 09:41, schrieb mar...@wunderlich.com:

I have been able to partially fix this issue by creating a static finalfield (i.e. a constant) for Encoders.STRING(). This removes thebottleneck associated with instantiating this Encoder. However, thismoved the performance issue only to these two methods:
org.apache.spark.sql.SparkSession.createDataset (in the code below)

org.apache.spark.sql.Dataset.toLocalIterator ()

(ca. 40% each of execution time)
The second one is called when extracting the prediction results fromthe dataset:
Dataset<Row> datasetWithPredictions = predictor.predict(text);
Dataset<Row> tokensWithPredictions =datasetWithPredictions.select(TOKEN_RESULT, TOKEN_BEGIN, TOKEN_END,PREDICTION_RESULT);
Iterator<Row> rowIt = tokensWithPredictions.toLocalIterator();

while(rowIt.hasNext()) {
Row row = rowIt.next();
[...] // do stuff here to convert the row

Any ideas of how I might be able to further optimize this?

Cheers,

Martin

Am 2022-02-18 07:42, schrieb mar...@wunderlich.com:
Hello,
I am working on optimising the performance of a Java ML/NLPapplication based on Spark / SparkNLP. For prediction, I am applying atrained model on a Spark dataset which consists of one column withonly one row. The dataset is created like this:
List<String> textList = Collections.singletonList(text);
Dataset<Row> data = sparkSession
.createDataset(textList, Encoders.STRING())
.withColumnRenamed(COL_VALUE, COL_TEXT);

The predictions are created like this:

PipelineModel fittedPipeline = pipeline.fit(dataset);

Dataset<Row> prediction = fittedPipeline.transform(dataset);
We noticed that the performance isn't quite as good as expected. Afterprofiling the application with VisualVM, I noticed that the problem iswith org.apache.spark.sql.Encoders.STRING() in the creation of thedataset, which by itself takes up about 75% of the time for the wholeprediction method call.
So, is there a simpler and more efficient way of creating the requireddataset, consisting of one column and one String row?
Thanks a lot.

Cheers,

Martin

Re: Encoders.STRING() causing performance problems in Java application

Reply via email to