Can someone comment here? On Fri, Feb 7, 2020, 11:52 PM vivek chaurasiya <[email protected]> wrote:
> Hi team, > > We had beam SDKs 2.5 running on AWS-EMR Spark distribution 5.17. > > Essentially our beam code was just reading bunch of files from GCS and > pushing to ElasticSearch in AWS using beam's class ElasticSearchIO ( > https://beam.apache.org/releases/javadoc/2.0.0/index.html?org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.html). > So there is just a Map step, no reduce/groupby/etc. in the beam code. > > Basically my code is doing: > PCollection<String> coll = // read GCS > coll.apply (ElasticSearchIO.write()) > > We submit spark command using 'spark-submit' > spark-submit --deploy-mode cluster --conf > spark.executor.extraJavaOptions=-DCLOUD_PLATFORM=AWS --conf > spark.driver.extraJavaOptions=-DCLOUD_PLATFORM=AWS --conf > spark.yarn.am.waitTime=300s --conf > spark.executor.extraClassPath=__app__.jar --driver-memory 8G > --num-executors 5 --executor-memory 20G --executor-cores 8 --jars > s3://snap-search-spark/cloud-dataflow-1.0.jar --class > com.snapchat.beam.common.pipeline.EMRSparkStartPipeline > s3://snap-search-spark/cloud-dataflow-1.0.jar --job=fgi-export > --isSolr=false --dateTime=2020-01-31T00:00:00 --isDev=true > --incrementalExport=false > > The dump to ES was finishing in max 1hour. > > This week we upgraded beam SDKs to 2.18 and running on AWS-EMR Spark > distribution 5.17. We observe that the export process becomes really slow > like 9 hours. The GCS filesize ~ 50gb (500 files of 100 mb each). > > I am new to SparkUI and AWS EMR, but still i tried to see why this > slowness is happening. Few observations: > > 1) some executors died got SIGTERM. Then i tried this: > https://dev.sobeslavsky.net/apache-spark-sigterm-mystery-with-dynamic-allocation/ > NO luck > > 2) I will try upgrading AWS-EMR Spark distribution 5.29 but will have to > test it. > > Anyone seen similar issues in past? Some suggestions will be highly > appreciated. > > Thanks > Vivek > > >
