Oh, yes of course. If you run an entire distributed Spark job for one row,
over and over, that's much slower. It would make much more sense to run the
whole data set at once - the point is parallelism here.
On Mon, Feb 21, 2022 at 2:36 AM wrote:
> Thanks a lot, Sean, for the comments. I realize
Thanks a lot, Sean, for the comments. I realize I didn't provide enough
background information to properly diagnose this issue.
In the meantime, I have created some test cases for isolating the
problem and running some specific performance tests. The numbers are
quite revealing: Running
That doesn't make a lot of sense. Are you profiling the driver, rather than
executors where the work occurs?
Is your data set quite small such that small overheads look big?
Do you even need Spark if your data is not distributed - coming from the
driver anyway?
The fact that a static final field
Addendum: I have tried to replace localIterator with a forEach() call on
the dataset directly, but this hasn't improved the performance.
If the forEach call is the issue, there probably isn't much that can be
done to further improve things, other than perhaps trying to batch the
prediction
I have been able to partially fix this issue by creating a static final
field (i.e. a constant) for Encoders.STRING(). This removes the
bottleneck associated with instantiating this Encoder. However, this
moved the performance issue only to these two methods:
Hello,
I am working on optimising the performance of a Java ML/NLP application
based on Spark / SparkNLP. For prediction, I am applying a trained model
on a Spark dataset which consists of one column with only one row. The
dataset is created like this:
List textList =