Re: Encoders.STRING() causing performance problems in Java application

martin Mon, 21 Feb 2022 00:36:55 -0800

Thanks a lot, Sean, for the comments. I realize I didn't provide enoughbackground information to properly diagnose this issue.

In the meantime, I have created some test cases for isolating theproblem and running some specific performance tests. The numbers arequite revealing: Running our Spark model individually on Strings takesabout 8 Sec for the test data, whereas is take 88 ms when run on theentire data in a single Dataset. This is a factor of 100x. This getseven worse for larger datasets.

So, the root cause here is the way the Spark model is being called forone string at a time by the self-built prediction pipeline (which isalso using other ML techniques apart from Spark). Needs somere-factoring...


Thanks again for the help.

Cheers,

Martin

Am 2022-02-18 13:41, schrieb Sean Owen:

That doesn't make a lot of sense. Are you profiling the driver, ratherthan executors where the work occurs?
Is your data set quite small such that small overheads look big?
Do you even need Spark if your data is not distributed - coming fromthe driver anyway?
The fact that a static final field did anything suggests something isamiss with your driver program. Are you perhaps inadvertentlyserializing your containing class with a bunch of other data by usingits methods in a closure?If your data is small it's not surprising that the overhead could be injust copying the data around, the two methods you cite, rather than thecompute.
Too many things here to really say what's going on.

On Fri, Feb 18, 2022 at 12:42 AM <mar...@wunderlich.com> wrote:
Hello,
I am working on optimising the performance of a Java ML/NLPapplication based on Spark / SparkNLP. For prediction, I am applying atrained model on a Spark dataset which consists of one column withonly one row. The dataset is created like this:
List<String> textList = Collections.singletonList(text);
Dataset<Row> data = sparkSession
.createDataset(textList, Encoders.STRING())
.withColumnRenamed(COL_VALUE, COL_TEXT);

The predictions are created like this:

PipelineModel fittedPipeline = pipeline.fit(dataset);

Dataset<Row> prediction = fittedPipeline.transform(dataset);
We noticed that the performance isn't quite as good as expected. Afterprofiling the application with VisualVM, I noticed that the problem iswith org.apache.spark.sql.Encoders.STRING() in the creation of thedataset, which by itself takes up about 75% of the time for the wholeprediction method call.
So, is there a simpler and more efficient way of creating the requireddataset, consisting of one column and one String row?
Thanks a lot.

Cheers,

Martin

Re: Encoders.STRING() causing performance problems in Java application

Reply via email to