That doesn't make a lot of sense. Are you profiling the driver, rather than
executors where the work occurs?
Is your data set quite small such that small overheads look big?
Do you even need Spark if your data is not distributed - coming from the
driver anyway?

The fact that a static final field did anything suggests something is amiss
with your driver program. Are you perhaps inadvertently serializing your
containing class with a bunch of other data by using its methods in a
closure?
If your data is small it's not surprising that the overhead could be in
just copying the data around, the two methods you cite, rather than the
compute.
Too many things here to really say what's going on.


On Fri, Feb 18, 2022 at 12:42 AM <mar...@wunderlich.com> wrote:

> Hello,
>
> I am working on optimising the performance of a Java ML/NLP application
> based on Spark / SparkNLP. For prediction, I am applying a trained model on
> a Spark dataset which consists of one column with only one row. The dataset
> is created like this:
>
>     List<String> textList = Collections.singletonList(text);
>     Dataset<Row> data = sparkSession
>         .createDataset(textList, Encoders.STRING())
>         .withColumnRenamed(COL_VALUE, COL_TEXT);
>
>
> The predictions are created like this:
>
>     PipelineModel fittedPipeline = pipeline.fit(dataset);
>
>     Dataset<Row> prediction = fittedPipeline.transform(dataset);
>
>
> We noticed that the performance isn't quite as good as expected. After
> profiling the application with VisualVM, I noticed that the problem is with
> org.apache.spark.sql.Encoders.STRING() in the creation of the dataset,
> which by itself takes up about 75% of the time for the whole prediction
> method call.
>
> So, is there a simpler and more efficient way of creating the required
> dataset, consisting of one column and one String row?
>
> Thanks a lot.
>
> Cheers,
>
> Martin
>

Reply via email to