Re: Encoders.STRING() causing performance problems in Java application

2022-02-21 Thread Sean Owen
Oh, yes of course. If you run an entire distributed Spark job for one row, over and over, that's much slower. It would make much more sense to run the whole data set at once - the point is parallelism here. On Mon, Feb 21, 2022 at 2:36 AM wrote: > Thanks a lot, Sean, for the comments. I realize

Re: Encoders.STRING() causing performance problems in Java application

2022-02-21 Thread martin
Thanks a lot, Sean, for the comments. I realize I didn't provide enough background information to properly diagnose this issue. In the meantime, I have created some test cases for isolating the problem and running some specific performance tests. The numbers are quite revealing: Running

Re: Encoders.STRING() causing performance problems in Java application

2022-02-18 Thread Sean Owen
That doesn't make a lot of sense. Are you profiling the driver, rather than executors where the work occurs? Is your data set quite small such that small overheads look big? Do you even need Spark if your data is not distributed - coming from the driver anyway? The fact that a static final field

Re: Encoders.STRING() causing performance problems in Java application

2022-02-18 Thread martin
Addendum: I have tried to replace localIterator with a forEach() call on the dataset directly, but this hasn't improved the performance. If the forEach call is the issue, there probably isn't much that can be done to further improve things, other than perhaps trying to batch the prediction

Re: Encoders.STRING() causing performance problems in Java application

2022-02-18 Thread martin
I have been able to partially fix this issue by creating a static final field (i.e. a constant) for Encoders.STRING(). This removes the bottleneck associated with instantiating this Encoder. However, this moved the performance issue only to these two methods:

Encoders.STRING() causing performance problems in Java application

2022-02-17 Thread martin
Hello, I am working on optimising the performance of a Java ML/NLP application based on Spark / SparkNLP. For prediction, I am applying a trained model on a Spark dataset which consists of one column with only one row. The dataset is created like this: List textList =