[mongo-spark-connector] How can I improve the performance of Mongo spark write?

sj p Fri, 28 Jan 2022 20:36:32 -0800

hello, I'm having a performance problem with writing with a Mongo spark 
connector.


I want to write 0.3M data in the form of string and hex binary (4kB) through 
the Mongo spark connector.
spark = create_c3s_spark_session(app_name, spark_config=[
        ("spark.executor.cores", "1"), 
        ("spark.executor.memory", "6g"),
        ("spark.executor.instances", "50"), # ("spark.executor.instances", 
"50"),
        ("spark.archives", f'{GIT_SOURCE_BASE}/{pyenv 
path}.tar.gz#environment'),
        ("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.12:3.0.1"),
        ("spark.mongodb.output.uri", default_mongodb_uri),
    ],     c3s_username = c3s_username)
However, it takes 30 hours to get the instances set to 1, and it takes 30 hours 
to get to 50 too.
writer = list_df.write.format("mongo").mode("append").option("database", 
mongo_config.get('database')).option("collection", f'{collection_name}')
writer.save()
I don't understand that the total data size is only 1.2GB, and it takes the 
same time even if the number of instances is increased.

The strange thing is that if the hex binary is tested at 400B instead of 4K, 
the completion will be completed within an hour, and increasing the instances 
will definitely reduce the time required.

What actions are needed to address performance issues?

[mongo-spark-connector] How can I improve the performance of Mongo spark write?

Reply via email to