Addendum: I have tried to replace localIterator with a forEach() call on
the dataset directly, but this hasn't improved the performance.
If the forEach call is the issue, there probably isn't much that can be
done to further improve things, other than perhaps trying to batch the
prediction calls instead of running them line by line on the input file.
Cheers,
Martin
Am 2022-02-18 09:41, schrieb mar...@wunderlich.com:
I have been able to partially fix this issue by creating a static final
field (i.e. a constant) for Encoders.STRING(). This removes the
bottleneck associated with instantiating this Encoder. However, this
moved the performance issue only to these two methods:
org.apache.spark.sql.SparkSession.createDataset (in the code below)
org.apache.spark.sql.Dataset.toLocalIterator ()
(ca. 40% each of execution time)
The second one is called when extracting the prediction results from
the dataset:
Dataset<Row> datasetWithPredictions = predictor.predict(text);
Dataset<Row> tokensWithPredictions =
datasetWithPredictions.select(TOKEN_RESULT, TOKEN_BEGIN, TOKEN_END,
PREDICTION_RESULT);
Iterator<Row> rowIt = tokensWithPredictions.toLocalIterator();
while(rowIt.hasNext()) {
Row row = rowIt.next();
[...] // do stuff here to convert the row
Any ideas of how I might be able to further optimize this?
Cheers,
Martin
Am 2022-02-18 07:42, schrieb mar...@wunderlich.com:
Hello,
I am working on optimising the performance of a Java ML/NLP
application based on Spark / SparkNLP. For prediction, I am applying a
trained model on a Spark dataset which consists of one column with
only one row. The dataset is created like this:
List<String> textList = Collections.singletonList(text);
Dataset<Row> data = sparkSession
.createDataset(textList, Encoders.STRING())
.withColumnRenamed(COL_VALUE, COL_TEXT);
The predictions are created like this:
PipelineModel fittedPipeline = pipeline.fit(dataset);
Dataset<Row> prediction = fittedPipeline.transform(dataset);
We noticed that the performance isn't quite as good as expected. After
profiling the application with VisualVM, I noticed that the problem is
with org.apache.spark.sql.Encoders.STRING() in the creation of the
dataset, which by itself takes up about 75% of the time for the whole
prediction method call.
So, is there a simpler and more efficient way of creating the required
dataset, consisting of one column and one String row?
Thanks a lot.
Cheers,
Martin