I have been able to partially fix this issue by creating a static final
field (i.e. a constant) for Encoders.STRING(). This removes the
bottleneck associated with instantiating this Encoder. However, this
moved the performance issue only to these two methods:
org.apache.spark.sql.SparkSession.createDataset (in the code below)
org.apache.spark.sql.Dataset.toLocalIterator ()
(ca. 40% each of execution time)
The second one is called when extracting the prediction results from the
dataset:
Dataset<Row> datasetWithPredictions = predictor.predict(text);
Dataset<Row> tokensWithPredictions =
datasetWithPredictions.select(TOKEN_RESULT, TOKEN_BEGIN, TOKEN_END,
PREDICTION_RESULT);
Iterator<Row> rowIt = tokensWithPredictions.toLocalIterator();
while(rowIt.hasNext()) {
Row row = rowIt.next();
[...] // do stuff here to convert the row
Any ideas of how I might be able to further optimize this?
Cheers,
Martin
Am 2022-02-18 07:42, schrieb mar...@wunderlich.com:
Hello,
I am working on optimising the performance of a Java ML/NLP application
based on Spark / SparkNLP. For prediction, I am applying a trained
model on a Spark dataset which consists of one column with only one
row. The dataset is created like this:
List<String> textList = Collections.singletonList(text);
Dataset<Row> data = sparkSession
.createDataset(textList, Encoders.STRING())
.withColumnRenamed(COL_VALUE, COL_TEXT);
The predictions are created like this:
PipelineModel fittedPipeline = pipeline.fit(dataset);
Dataset<Row> prediction = fittedPipeline.transform(dataset);
We noticed that the performance isn't quite as good as expected. After
profiling the application with VisualVM, I noticed that the problem is
with org.apache.spark.sql.Encoders.STRING() in the creation of the
dataset, which by itself takes up about 75% of the time for the whole
prediction method call.
So, is there a simpler and more efficient way of creating the required
dataset, consisting of one column and one String row?
Thanks a lot.
Cheers,
Martin