Hi, We're working on Spark NLP by including multiple ML Estimators and Transformers.
Getting a negative performance hit on Python side, because of the columns being recalculated recursively (and more than recursively) on each stage.transform() call. I am not being able to trace the root of the problem, since serialization seems to be happening on jvm side, using the _jvm wrappers in pyspark ML. Printing a log each time a stage is actually executed, and loading the same *PipelineModel* in both Scala and Python, I get the following log in Scala: ----------------- scala> val result = pipeline.transform(data).cache() annotating REGEX_TOKENIZER_b39e97328de5 annotating WORD_EMBEDDINGS_MODEL_82c6ed12d8f5 annotating NerDLModel_7b95d7750b3b annotating NER_CONVERTER_f12a17e51b45 annotating DEEP SENTENCE DETECTOR_4a08f41f1d47 annotating LEMMATIZER_eff31d5f9d97 annotating STEMMER_552360206a2d annotating POS_2b9b0142f847 annotating SPELL_7c55d8e48423 result: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [text: string, document: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>,sentence_embeddings:array<float>>> ... 9 more fields] ----------------- while on python, cache() does not only execute cache() operation, but also, retraces to multiple calls of the columns, as if it had shorter memory: ------------------- result_df.show() [Stage 37:===================> (1 + 2) / 3]Really annotating REGEX_TOKENIZER_b39e97328de5 annotating REGEX_TOKENIZER_b39e97328de5 annotating WORD_EMBEDDINGS_MODEL_82c6ed12d8f5 annotating REGEX_TOKENIZER_b39e97328de5 annotating REGEX_TOKENIZER_b39e97328de5 annotating WORD_EMBEDDINGS_MODEL_82c6ed12d8f5 annotating NerDLModel_7b95d7750b3b [Stage 37:======================================> (2 + 1) / 3]Really annotating REGEX_TOKENIZER_b39e97328de5 annotating REGEX_TOKENIZER_b39e97328de5 annotating REGEX_TOKENIZER_b39e97328de5 annotating WORD_EMBEDDINGS_MODEL_82c6ed12d8f5 annotating NerDLModel_7b95d7750b3b annotating NER_CONVERTER_f12a17e51b45 annotating REGEX_TOKENIZER_b39e97328de5 annotating REGEX_TOKENIZER_b39e97328de5 annotating REGEX_TOKENIZER_b39e97328de5 annotating REGEX_TOKENIZER_b39e97328de5 annotating WORD_EMBEDDINGS_MODEL_82c6ed12d8f5 annotating NerDLModel_7b95d7750b3b annotating NER_CONVERTER_f12a17e51b45 annotating DEEP SENTENCE DETECTOR_4a08f41f1d47 annotating REGEX_TOKENIZER_b39e97328de5 annotating LEMMATIZER_eff31d5f9d97 annotating REGEX_TOKENIZER_b39e97328de5 annotating STEMMER_552360206a2d annotating REGEX_TOKENIZER_b39e97328de5 annotating REGEX_TOKENIZER_b39e97328de5 annotating REGEX_TOKENIZER_b39e97328de5 annotating REGEX_TOKENIZER_b39e97328de5 annotating WORD_EMBEDDINGS_MODEL_82c6ed12d8f5 annotating NerDLModel_7b95d7750b3b annotating NER_CONVERTER_f12a17e51b45 annotating DEEP SENTENCE DETECTOR_4a08f41f1d47 annotating REGEX_TOKENIZER_b39e97328de5 annotating POS_2b9b0142f847 annotating REGEX_TOKENIZER_b39e97328de5 annotating SPELL_7c55d8e48423 ------------------- If you have any insights into how can we help trace the problem, we'll be gladly appreciating! I have tried various approaches, such as transforming step by step, or caching() the input, but none of them seem to have impact. Best, Saif