Problem with updating beam SDK

vivek chaurasiya Fri, 07 Feb 2020 23:52:52 -0800

Hi team,

We had beam SDKs 2.5 running on AWS-EMR Spark distribution 5.17.

Essentially our beam code was just reading bunch of files from GCS and
pushing to ElasticSearch in AWS using beam's class ElasticSearchIO (
https://beam.apache.org/releases/javadoc/2.0.0/index.html?org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.html).
So there is just a Map step, no reduce/groupby/etc. in the beam code.

Basically my code is doing:
PCollection<String> coll = // read GCS
coll.apply (ElasticSearchIO.write())

We submit spark command using 'spark-submit'
spark-submit --deploy-mode cluster --conf
spark.executor.extraJavaOptions=-DCLOUD_PLATFORM=AWS --conf
spark.driver.extraJavaOptions=-DCLOUD_PLATFORM=AWS --conf
spark.yarn.am.waitTime=300s --conf
spark.executor.extraClassPath=__app__.jar --driver-memory 8G
--num-executors 5 --executor-memory 20G --executor-cores 8 --jars
s3://snap-search-spark/cloud-dataflow-1.0.jar --class
com.snapchat.beam.common.pipeline.EMRSparkStartPipeline
s3://snap-search-spark/cloud-dataflow-1.0.jar --job=fgi-export
--isSolr=false --dateTime=2020-01-31T00:00:00 --isDev=true
--incrementalExport=false

The dump to ES was finishing in max 1hour.

This week we upgraded beam SDKs to 2.18 and running on AWS-EMR Spark
distribution 5.17. We observe that the export process becomes really slow
like 9 hours. The GCS filesize ~ 50gb (500 files of 100 mb each).

I am new to SparkUI and AWS EMR, but still i tried to see why this slowness
is happening. Few observations:

1) some executors died got SIGTERM. Then i tried this:
https://dev.sobeslavsky.net/apache-spark-sigterm-mystery-with-dynamic-allocation/
NO luck

2) I will try upgrading AWS-EMR Spark distribution 5.29 but will have to
test it.

Anyone seen similar issues in past? Some suggestions will be highly
appreciated.

Thanks
Vivek

Problem with updating beam SDK

Reply via email to