Addendum: I have tried to replace localIterator with a forEach() call on the dataset directly, but this hasn't improved the performance.

If the forEach call is the issue, there probably isn't much that can be done to further improve things, other than perhaps trying to batch the prediction calls instead of running them line by line on the input file.

Cheers,

Martin

Am 2022-02-18 09:41, schrieb mar...@wunderlich.com:

I have been able to partially fix this issue by creating a static final field (i.e. a constant) for Encoders.STRING(). This removes the bottleneck associated with instantiating this Encoder. However, this moved the performance issue only to these two methods:

org.apache.spark.sql.SparkSession.createDataset (in the code below)

org.apache.spark.sql.Dataset.toLocalIterator ()

(ca. 40% each of execution time)

The second one is called when extracting the prediction results from the dataset:

Dataset<Row> datasetWithPredictions = predictor.predict(text);

Dataset<Row> tokensWithPredictions = datasetWithPredictions.select(TOKEN_RESULT, TOKEN_BEGIN, TOKEN_END, PREDICTION_RESULT);

Iterator<Row> rowIt = tokensWithPredictions.toLocalIterator();

while(rowIt.hasNext()) {
Row row = rowIt.next();
[...] // do stuff here to convert the row

Any ideas of how I might be able to further optimize this?

Cheers,

Martin

Am 2022-02-18 07:42, schrieb mar...@wunderlich.com:

Hello,

I am working on optimising the performance of a Java ML/NLP application based on Spark / SparkNLP. For prediction, I am applying a trained model on a Spark dataset which consists of one column with only one row. The dataset is created like this:

List<String> textList = Collections.singletonList(text);
Dataset<Row> data = sparkSession
.createDataset(textList, Encoders.STRING())
.withColumnRenamed(COL_VALUE, COL_TEXT);

The predictions are created like this:

PipelineModel fittedPipeline = pipeline.fit(dataset);

Dataset<Row> prediction = fittedPipeline.transform(dataset);

We noticed that the performance isn't quite as good as expected. After profiling the application with VisualVM, I noticed that the problem is with org.apache.spark.sql.Encoders.STRING() in the creation of the dataset, which by itself takes up about 75% of the time for the whole prediction method call.

So, is there a simpler and more efficient way of creating the required dataset, consisting of one column and one String row?

Thanks a lot.

Cheers,

Martin

Reply via email to