I have been able to partially fix this issue by creating a static final field (i.e. a constant) for Encoders.STRING(). This removes the bottleneck associated with instantiating this Encoder. However, this moved the performance issue only to these two methods:

org.apache.spark.sql.SparkSession.createDataset (in the code below)

org.apache.spark.sql.Dataset.toLocalIterator ()

(ca. 40% each of execution time)

The second one is called when extracting the prediction results from the dataset:

Dataset<Row> datasetWithPredictions = predictor.predict(text);

Dataset<Row> tokensWithPredictions = datasetWithPredictions.select(TOKEN_RESULT, TOKEN_BEGIN, TOKEN_END, PREDICTION_RESULT);

Iterator<Row> rowIt = tokensWithPredictions.toLocalIterator();

while(rowIt.hasNext()) {
    Row row = rowIt.next();
    [...] // do stuff here to convert the row

Any ideas of how I might be able to further optimize this?

Cheers,

Martin

Am 2022-02-18 07:42, schrieb mar...@wunderlich.com:

Hello,

I am working on optimising the performance of a Java ML/NLP application based on Spark / SparkNLP. For prediction, I am applying a trained model on a Spark dataset which consists of one column with only one row. The dataset is created like this:

List<String> textList = Collections.singletonList(text);
Dataset<Row> data = sparkSession
.createDataset(textList, Encoders.STRING())
.withColumnRenamed(COL_VALUE, COL_TEXT);

The predictions are created like this:

PipelineModel fittedPipeline = pipeline.fit(dataset);

Dataset<Row> prediction = fittedPipeline.transform(dataset);

We noticed that the performance isn't quite as good as expected. After profiling the application with VisualVM, I noticed that the problem is with org.apache.spark.sql.Encoders.STRING() in the creation of the dataset, which by itself takes up about 75% of the time for the whole prediction method call.

So, is there a simpler and more efficient way of creating the required dataset, consisting of one column and one String row?

Thanks a lot.

Cheers,

Martin

Reply via email to