Thanks a lot, Sean, for the comments. I realize I didn't provide enough
background information to properly diagnose this issue.
In the meantime, I have created some test cases for isolating the
problem and running some specific performance tests. The numbers are
quite revealing: Running our Spark model individually on Strings takes
about 8 Sec for the test data, whereas is take 88 ms when run on the
entire data in a single Dataset. This is a factor of 100x. This gets
even worse for larger datasets.
So, the root cause here is the way the Spark model is being called for
one string at a time by the self-built prediction pipeline (which is
also using other ML techniques apart from Spark). Needs some
re-factoring...
Thanks again for the help.
Cheers,
Martin
Am 2022-02-18 13:41, schrieb Sean Owen:
That doesn't make a lot of sense. Are you profiling the driver, rather
than executors where the work occurs?
Is your data set quite small such that small overheads look big?
Do you even need Spark if your data is not distributed - coming from
the driver anyway?
The fact that a static final field did anything suggests something is
amiss with your driver program. Are you perhaps inadvertently
serializing your containing class with a bunch of other data by using
its methods in a closure?
If your data is small it's not surprising that the overhead could be in
just copying the data around, the two methods you cite, rather than the
compute.
Too many things here to really say what's going on.
On Fri, Feb 18, 2022 at 12:42 AM <mar...@wunderlich.com> wrote:
Hello,
I am working on optimising the performance of a Java ML/NLP
application based on Spark / SparkNLP. For prediction, I am applying a
trained model on a Spark dataset which consists of one column with
only one row. The dataset is created like this:
List<String> textList = Collections.singletonList(text);
Dataset<Row> data = sparkSession
.createDataset(textList, Encoders.STRING())
.withColumnRenamed(COL_VALUE, COL_TEXT);
The predictions are created like this:
PipelineModel fittedPipeline = pipeline.fit(dataset);
Dataset<Row> prediction = fittedPipeline.transform(dataset);
We noticed that the performance isn't quite as good as expected. After
profiling the application with VisualVM, I noticed that the problem is
with org.apache.spark.sql.Encoders.STRING() in the creation of the
dataset, which by itself takes up about 75% of the time for the whole
prediction method call.
So, is there a simpler and more efficient way of creating the required
dataset, consisting of one column and one String row?
Thanks a lot.
Cheers,
Martin