Re: Encoders.STRING() causing performance problems in Java application

martin Fri, 18 Feb 2022 00:41:55 -0800

I have been able to partially fix this issue by creating a static finalfield (i.e. a constant) for Encoders.STRING(). This removes thebottleneck associated with instantiating this Encoder. However, thismoved the performance issue only to these two methods:


org.apache.spark.sql.SparkSession.createDataset (in the code below)

org.apache.spark.sql.Dataset.toLocalIterator ()

(ca. 40% each of execution time)

The second one is called when extracting the prediction results from thedataset:


Dataset<Row> datasetWithPredictions = predictor.predict(text);

Dataset<Row> tokensWithPredictions =datasetWithPredictions.select(TOKEN_RESULT, TOKEN_BEGIN, TOKEN_END,PREDICTION_RESULT);


Iterator<Row> rowIt = tokensWithPredictions.toLocalIterator();

while(rowIt.hasNext()) {
    Row row = rowIt.next();
    [...] // do stuff here to convert the row

Any ideas of how I might be able to further optimize this?

Cheers,

Martin

Am 2022-02-18 07:42, schrieb mar...@wunderlich.com:

Hello,
I am working on optimising the performance of a Java ML/NLP applicationbased on Spark / SparkNLP. For prediction, I am applying a trainedmodel on a Spark dataset which consists of one column with only onerow. The dataset is created like this:
List<String> textList = Collections.singletonList(text);
Dataset<Row> data = sparkSession
.createDataset(textList, Encoders.STRING())
.withColumnRenamed(COL_VALUE, COL_TEXT);

The predictions are created like this:

PipelineModel fittedPipeline = pipeline.fit(dataset);

Dataset<Row> prediction = fittedPipeline.transform(dataset);
We noticed that the performance isn't quite as good as expected. Afterprofiling the application with VisualVM, I noticed that the problem iswith org.apache.spark.sql.Encoders.STRING() in the creation of thedataset, which by itself takes up about 75% of the time for the wholeprediction method call.
So, is there a simpler and more efficient way of creating the requireddataset, consisting of one column and one String row?
Thanks a lot.

Cheers,

Martin

Re: Encoders.STRING() causing performance problems in Java application

Reply via email to