Re: Kylin v4.0.0 GA on EMR 6.3.0 fail to start Sparder due to YARN staging files missing

Yaqian Zhang Mon, 13 Sep 2021 19:42:05 -0700

Hi Gabe:

You can try to configure 'kylin.query.spark-conf.spark.yarn.stagingDir' in 
kylin.properties  to make this configuration take effect in kylin.


> 在 2021年9月13日，下午9:56，Michael, Gabe <[email protected]> 写道：
> 
> Thank you for your reply.
>  
> HADOOP_CONF_DIR is set correctly to /usr/local/kylin/hadoop_conf
> fs.defaultFS in /usr/local/kylin/hadoop_conf/core-site.xml is set to hdfs:// 
> <hdfs://xxxxx:8020>xxxxx <hdfs://xxxxx:8020>:8020 <hdfs://xxxxx:8020> (domain 
> name omitted)
>  
> I also tested submitting a simple Spark app from the command line with 
> spark-submit, and it succeeds.
> According to the log messages it is uploading the files to HDFS when I submit 
> directly from spark-submit:
>  
> 21/09/13 13:49:19 INFO Client: Preparing resources for our AM container
> 21/09/13 13:49:19 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 21/09/13 13:49:23 INFO Client: Uploading resource 
> file:/mnt/tmp/spark-7256648b-ffe0-4455-8a80-d56f1a7fd707/__spark_libs__3285017367714177339.zip
>  -> hdfs:// 
> <hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_libs__3285017367714177339.zip>xxxxx
>  
> <hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_libs__3285017367714177339.zip>:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_libs__3285017367714177339.zip
>  
> <hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_libs__3285017367714177339.zip>
> 21/09/13 13:49:25 INFO Client: Uploading resource 
> file:/usr/local/kylin/spark/python/lib/pyspark.zip -> hdfs:// 
> <hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/pyspark.zip>xxxxx
>  
> <hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/pyspark.zip>:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/pyspark.zip
>  
> <hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/pyspark.zip>
> 21/09/13 13:49:25 INFO Client: Uploading resource 
> file:/usr/local/kylin/spark/python/lib/py4j-0.10.9-src.zip -> hdfs:// 
> <hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/py4j-0.10.9-src.zip>xxxxx
>  
> <hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/py4j-0.10.9-src.zip>:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/py4j-0.10.9-src.zip
>  
> <hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/py4j-0.10.9-src.zip>
> 21/09/13 13:49:25 INFO Client: Uploading resource 
> file:/mnt/tmp/spark-7256648b-ffe0-4455-8a80-d56f1a7fd707/__spark_conf__6717448128964414860.zip
>  -> hdfs:// 
> <hdfs://xxxxx/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_conf__.zip>xxxxx
>  
> <hdfs://xxxxx/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_conf__.zip>/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_conf__.zip
>  
> <hdfs://xxxxx/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_conf__.zip>
>  
> However I can reproduce the same problem I encounter with Kylin by setting 
> the spark.yarn.stagingDir configuration:
>  
> spark-submit --master yarn --conf spark.yarn.stagingDir=file:///home/hadoop 
> <file:///home/hadoop> --deploy-mode client /home/hadoop/foo.py 
>  
> It will try to upload to a local destination 
> “file:/home/hadoop/.sparkStaging/application_1631282030708_2945/…” and the 
> application will fail.
>  
> I am able to set spark.yarn.stagingDir to an HDFS location in 
> /usr/local/kylin/spark/conf/spark-defaults.conf and spark-submit succeeds.
> 
> However it seems Kylin ignores values set in spark.yarn.stagingDir?
>  
> If I am able to set spark.yarn.stagingDir correctly I think it would work.
>  
> Thank you for your assistance,
>  
> Gabe
>  
> De : Yaqian Zhang <[email protected] <mailto:[email protected]>>
> Date : dimanche, 12 septembre 2021 à 22:45
> À : [email protected] <mailto:[email protected]> 
> <[email protected] <mailto:[email protected]>>
> Objet : Re: Kylin v4.0.0 GA on EMR 6.3.0 fail to start Sparder due to YARN 
> staging files missing
> 
> Hi：
> I noticed this in your kylin.log:
>  
> “Uploading resource 
> file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_libs__7584573487901234438.zip
>  -> 
> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip
> 2021-09-10 18:45:51,487 INFO  [Thread-9] yarn.Client:57 : Uploading resource 
> file:/usr/local/kylin/lib/kylin-parquet-job-4.0.0.jar -> 
> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/kylin-parquet-job-4.0.0.jar
> 2021-09-10 18:45:51,597 INFO  [Thread-9] yarn.Client:57 : Uploading resource 
> file:/usr/local/kylin/conf/spark-executor-log4j.properties -> 
> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/spark-executor-log4j.properties
> 2021-09-10 18:45:51,718 INFO  [Thread-9] yarn.Client:57 : Uploading resource 
> file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_conf__5546014978595262008.zip
>  -> 
> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_conf__.zip”
>  
> This does not seem normal. According to the process of submitting spark 
> application, it needs to upload these libs to HDFS or S3, but it is obvious 
> that the path here shows that these libs have been uploaded to the local 
> directory of the driver running node, so that other nodes cannot find the 
> path.
>  
> I'm not sure what caused these libs not to be uploaded to the correct path, 
> but you can check whether this configuration ‘HADOOP_CONF_DIR' exists in the 
> front page of kylin, as shown in the following figure:
> <image001.png>
> If so, you can check whether 'fs.defaultFS' in core-site.xml under this path 
> is configured to the correct directory.
>  
> By the way, the configuration 
> 'kylin.query.spark-conf.spark.executor.extraJavaOptions' in kylin.properties 
> does not need to be manually modified by the user, kylin will automatically 
> configure those variables at runtime.
> 
> 
> 在 2021年9月11日，上午2:57，Michael, Gabe <[email protected] 
> <mailto:[email protected]>> 写道：
>  
> Hello,
>  
> When running Kylin 4.0.0 on AWS EMR 6.3.0, I am able to successfully build a 
> cube.
>  
> But when I try to query it, the Sparder application cannot start.
>  
> Kylin attempts to upload some files to a local directory, then the Spark job 
> fails because it cannot read files from that directory.
>  
> 2021-09-10 18:45:47,407 INFO  [Thread-9] yarn.Client:57 : Preparing resources 
> for our AM container
> 2021-09-10 18:45:47,428 WARN  [Thread-9] yarn.Client:69 : Neither 
> spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading 
> libraries under SPARK_HOME.
> 2021-09-10 18:45:50,861 INFO  [Thread-9] yarn.Client:57 : Uploading resource 
> file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_libs__7584573487901234438.zip
>  -> 
> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip
> 2021-09-10 18:45:51,487 INFO  [Thread-9] yarn.Client:57 : Uploading resource 
> file:/usr/local/kylin/lib/kylin-parquet-job-4.0.0.jar -> 
> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/kylin-parquet-job-4.0.0.jar
> 2021-09-10 18:45:51,597 INFO  [Thread-9] yarn.Client:57 : Uploading resource 
> file:/usr/local/kylin/conf/spark-executor-log4j.properties -> 
> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/spark-executor-log4j.properties
> 2021-09-10 18:45:51,718 INFO  [Thread-9] yarn.Client:57 : Uploading resource 
> file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_conf__5546014978595262008.zip
>  -> 
> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_conf__.zip
> 2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing 
> view acls to: hadoop
> 2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing 
> modify acls to: hadoop
> 2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing 
> view acls groups to: 
> 2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing 
> modify acls groups to: 
> 2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : 
> SecurityManager: authentication disabled; ui acls disabled; users  with view 
> permissions: Set(hadoop); groups with view permissions: Set(); users  with 
> modify permissions: Set(hadoop); groups with modify permissions: Set()
> 2021-09-10 18:45:51,814 INFO  [Thread-9] yarn.Client:57 : Submitting 
> application application_1631282030708_2863 to ResourceManager
> 2021-09-10 18:45:51,861 INFO  [Thread-9] impl.YarnClientImpl:329 : Submitted 
> application application_1631282030708_2863
> 2021-09-10 18:45:52,863 INFO  [Thread-9] yarn.Client:57 : Application report 
> for application_1631282030708_2863 (state: FAILED)
> 2021-09-10 18:45:52,866 INFO  [Thread-9] yarn.Client:57 : 
>        client token: N/A
>        diagnostics: Application application_1631282030708_2863 failed 2 times 
> due to AM Container for appattempt_1631282030708_2863_000002 exited with  
> exitCode: -1000
> Failing this attempt.Diagnostics: [2021-09-10 18:45:52.033]File 
> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip
>  does not exist
> java.io.FileNotFoundException: File 
> file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip
>  does not exist
>        at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:671)
>        at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:992)
>        at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:661)
>        at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:464)
>        at 
> org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)
>        at 
> org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
>        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
>        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:422)
>        at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
>        at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:243)
>        at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:236)
>        at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:224)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>        at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>        at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>        at java.lang.Thread.run(Thread.java:748)
>  
> For more detailed output, check the application tracking page: 
> http://ip-10-240-102-189.bamtech.test.us-east-1.bamgrid.net:8088/cluster/app/application_1631282030708_2863
>  
> <http://ip-10-240-102-189.bamtech.test.us-east-1.bamgrid.net:8088/cluster/app/application_1631282030708_2863>
>  Then click on links to logs of each attempt.
> . Failing the application.
>        ApplicationMaster host: N/A
>        ApplicationMaster RPC port: -1
>        queue: default
>        start time: 1631299551829
>        final status: FAILED
>        tracking URL: 
> http://ip-10-240-102-189.bamtech.test.us-east-1.bamgrid.net:8088/cluster/app/application_1631282030708_2863
>  
> <http://ip-10-240-102-189.bamtech.test.us-east-1.bamgrid.net:8088/cluster/app/application_1631282030708_2863>
>        user: hadoop
> 2021-09-10 18:45:52,941 INFO  [Thread-9] yarn.Client:57 : Deleted staging 
> directory file:/home/hadoop/.sparkStaging/application_1631282030708_2863
> 2021-09-10 18:45:52,942 ERROR [Thread-9] 
> cluster.YarnClientSchedulerBackend:73 : The YARN application has already 
> ended! It might have been killed or the Application Master may have failed to 
> start. Check the YARN application logs for more details.
> 2021-09-10 18:45:52,943 ERROR [Thread-9] spark.SparkContext:94 : Error 
> initializing SparkContext.
>  
> Here are my kylin.properties with irrelevant/sensitive values removed:
>  
> kylin.env.hdfs-working-dir=s3a:// <s3a://XXXXX/qa/kylin/hdfs/>XXXXX 
> <s3a://XXXXX/qa/kylin/hdfs/>/qa/kylin/hdfs/ <s3a://XXXXX/qa/kylin/hdfs/>
> kylin.env=QA
> kylin.server.mode=all
> kylin.server.cluster-servers=localhost:7070
> kylin.engine.default=6
> kylin.storage.default=4
> kylin.server.external-acl-provider=
> kylin.source.hive.database-for-flat-table=default
> kylin.web.default-time-filter=1
> kylin.storage.clean-after-delete-operation=false
> kylin.job.retry=1
> kylin.job.max-concurrent-jobs=1
> kylin.job.sampling-percentage=100
> kylin.job.scheduler.provider.100=org.apache.kylin.job.impl.curator.CuratorScheduler
> kylin.job.scheduler.default=2
> kylin.spark-conf.auto.prior=true
> kylin.engine.spark-conf.spark.master=yarn
> kylin.engine.spark-conf.spark.submit.deployMode=client
> kylin.engine.spark-conf.spark.yarn.queue=default
> kylin.engine.spark-conf.spark.eventLog.enabled=true
> kylin.engine.spark-conf.spark.eventLog.dir=hdfs:///kylin/spark-history 
> <hdfs://kylin/spark-history>
> kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs:///kylin/spark-history
>  <hdfs://kylin/spark-history>
> kylin.engine.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false
> kylin.engine.spark-conf.spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8 
> -Dhdp.version=current -Dlog4j.configuration=spark-executor-log4j.properties 
> -Dlog4j.debug -Dkylin.hdfs.working.dir=${hdfs.working.dir} 
> -Dkylin.metadata.identifier=kylin_metadata -Dkylin.spark.category=job 
> -Dkylin.spark.project=${job.project} -Dkylin.spark.identifier=${job.id 
> <https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fjob.id%2F&data=04%7C01%7CGabe.Michael%40disneystreaming.com%7C77de7c2283aa466d15fc08d9766074af%7C65f03ca86d0a493e9e4ac85ac9526a03%7C1%7C0%7C637670979514787428%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=GHWXzC1cClz06wfF2nFYMaGKuLlVo69yFsmFi%2BVwIi0%3D&reserved=0>}
>  -Dkylin.spark.jobName=${job.stepId} -Duser.timezone=${user.timezone}
> kylin.engine.spark-conf.spark.driver.extraJavaOptions=-XX:+CrashOnOutOfMemoryError
> kylin.query.auto-sparder-context-enabled-enabled=false
> kylin.query.spark-conf.spark.master=yarn
> kylin.query.spark-conf.spark.driver.cores=1
> kylin.query.spark-conf.spark.driver.memory=4G
> kylin.query.spark-conf.spark.driver.memoryOverhead=1G
> kylin.query.spark-conf.spark.executor.cores=1
> kylin.query.spark-conf.spark.executor.instances=1
> kylin.query.spark-conf.spark.executor.memory=4G
> kylin.query.spark-conf.spark.executor.memoryOverhead=1G
> kylin.query.spark-conf.spark.serializer=org.apache.spark.serializer.JavaSerializer
> kylin.query.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false
> kylin.query.spark-conf.spark.executor.extraJavaOptions=-Dhdp.version=current 
> -Dlog4j.configuration=spark-executor-log4j.properties -Dlog4j.debug 
> -Dkylin.hdfs.working.dir=s3a://dataeng-data-test/qa/kylin/hdfs/ 
> <s3a://dataeng-data-test/qa/kylin/hdfs/> 
> -Dkylin.metadata.identifier=kylin_metadata -Dkylin.spark.category=sparder 
> -Dkylin.spark.identifier={{APP_ID}}
> kylin.source.hive.redistribute-flat-table=false
> kylin.metadata.jdbc.dialect=mysql
> kylin.metadata.jdbc.json-always-small-cell=true
> kylin.job.lock=org.apache.kylin.storage.hbase.util.ZookeeperDistributedJobLock
> kylin.web.set-config-enable=true
> kylin.job.allow-empty-segment=false
> kylin.env.hadoop-conf-dir=/etc/hadoop/conf
> kylin.query.lazy-query-enabled=true
> kylin.query.cache-signature-enabled=true
> kylin.query.segment-cache-enabled=false
> kylin.engine.spark-fact-distinct=true
> kylin.engine.spark-dimension-dictionary=false
> kylin.engine.spark-uhc-dictionary=true
> kylin.engine.spark.rdd-partition-cut-mb=10
> kylin.engine.spark.min-partition=1
> kylin.engine.spark.max-partition=5000
> kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true
> kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=1
> kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1000
> kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300
> kylin.engine.spark-conf.spark.driver.memory=2G
> kylin.engine.spark-conf.spark.executor.memory=4G
> kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024
> kylin.engine.spark-conf.spark.executor.cores=1
> kylin.engine.spark-conf.spark.network.timeout=600
> kylin.engine.spark-conf.spark.shuffle.service.enabled=true
> kylin.engine.spark-conf.spark.hadoop.dfs.replication=2
> kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true
> kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
> kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
> kylin.engine.spark-conf-mergedict.spark.executor.memory=6G
> kylin.engine.spark-conf-mergedict.spark.memory.fraction=0.2
> kylin.engine.spark-conf.spark.sql.hive.metastore.version=3.1.2
> kylin.engine.spark-conf.spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/lib/*
> kylin.query.spark-conf.spark.sql.hive.metastore.version=3.1.2
> kylin.query.spark-conf.spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/lib/*
> kylin.server.cluster-name=kylin_metadata
> kylin.log.spark-executor-properties-file=/usr/local/kylin/conf/spark-executor-log4j.properties
> kylin.metadata.url.identifier=kylin_metadata
>  
> Thank you for your assistance,
>  
> Gabe
>  
> -- 
> Gabe Michael
> Principal Data Engineer
> Disney Streaming Services
>

Re: Kylin v4.0.0 GA on EMR 6.3.0 fail to start Sparder due to YARN staging files missing

Reply via email to