Re: Kylin v4.0.0 GA on EMR 6.3.0 fail to start Sparder due to YARN staging files missing

Michael, Gabe Tue, 14 Sep 2021 06:43:45 -0700

Yaqian Zhang thank you for the suggestion, I configured 
"kylin.query.spark-conf.spark.yarn.stagingDir=hdfs://my-cluster-hostname:8020/tmp/spark-staging"
 (I also created this directory on HDFS first with "hdfs dfs -mkdir -p 
/tmp/spark-staging") and now the file uploads are going to HDFS, the Sparder 
Spark job runs successfully and I receive query results!


De : Yaqian Zhang <[email protected]>
Date : lundi, 13 septembre 2021 à 22:43
À : [email protected] <[email protected]>
Objet : Re: Kylin v4.0.0 GA on EMR 6.3.0 fail to start Sparder due to YARN 
staging files missing
Hi Gabe:

You can try to configure 'kylin.query.spark-conf.spark.yarn.stagingDir' in 
kylin.properties  to make this configuration take effect in kylin.

在 2021年9月13日，下午9:56，Michael, Gabe 
<[email protected]<mailto:[email protected]>> 写道：

Thank you for your reply.

HADOOP_CONF_DIR is set correctly to /usr/local/kylin/hadoop_conf
fs.defaultFS in /usr/local/kylin/hadoop_conf/core-site.xml is set to 
hdfs://<hdfs://xxxxx:8020>xxxxx<hdfs://xxxxx:8020>:8020<hdfs://xxxxx:8020> 
(domain name omitted)

I also tested submitting a simple Spark app from the command line with 
spark-submit, and it succeeds.
According to the log messages it is uploading the files to HDFS when I submit 
directly from spark-submit:

21/09/13 13:49:19 INFO Client: Preparing resources for our AM container
21/09/13 13:49:19 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
is set, falling back to uploading libraries under SPARK_HOME.
21/09/13 13:49:23 INFO Client: Uploading resource 
file:/mnt/tmp/spark-7256648b-ffe0-4455-8a80-d56f1a7fd707/__spark_libs__3285017367714177339.zip
 -> 
hdfs://<hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_libs__3285017367714177339.zip>xxxxx<hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_libs__3285017367714177339.zip>:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_libs__3285017367714177339.zip<hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_libs__3285017367714177339.zip>
21/09/13 13:49:25 INFO Client: Uploading resource 
file:/usr/local/kylin/spark/python/lib/pyspark.zip -> 
hdfs://<hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/pyspark.zip>xxxxx<hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/pyspark.zip>:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/pyspark.zip<hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/pyspark.zip>
21/09/13 13:49:25 INFO Client: Uploading resource 
file:/usr/local/kylin/spark/python/lib/py4j-0.10.9-src.zip -> 
hdfs://<hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/py4j-0.10.9-src.zip>xxxxx<hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/py4j-0.10.9-src.zip>:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/py4j-0.10.9-src.zip<hdfs://xxxxx:8020/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/py4j-0.10.9-src.zip>
21/09/13 13:49:25 INFO Client: Uploading resource 
file:/mnt/tmp/spark-7256648b-ffe0-4455-8a80-d56f1a7fd707/__spark_conf__6717448128964414860.zip
 -> 
hdfs://<hdfs://xxxxx/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_conf__.zip>xxxxx<hdfs://xxxxx/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_conf__.zip>/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_conf__.zip<hdfs://xxxxx/tmp/spark-staging/hadoop/.sparkStaging/application_1631282030708_2987/__spark_conf__.zip>

However I can reproduce the same problem I encounter with Kylin by setting the 
spark.yarn.stagingDir configuration:

spark-submit --master yarn --conf spark.yarn.stagingDir=file:///home/hadoop 
--deploy-mode client /home/hadoop/foo.py

It will try to upload to a local destination 
“file:/home/hadoop/.sparkStaging/application_1631282030708_2945/…” and the 
application will fail.

I am able to set spark.yarn.stagingDir to an HDFS location in 
/usr/local/kylin/spark/conf/spark-defaults.conf and spark-submit succeeds.

However it seems Kylin ignores values set in spark.yarn.stagingDir?

If I am able to set spark.yarn.stagingDir correctly I think it would work.

Thank you for your assistance,

Gabe

De : Yaqian Zhang <[email protected]<mailto:[email protected]>>
Date : dimanche, 12 septembre 2021 à 22:45
À : [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Objet : Re: Kylin v4.0.0 GA on EMR 6.3.0 fail to start Sparder due to YARN 
staging files missing
Hi：
I noticed this in your kylin.log:

“Uploading resource 
file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_libs__7584573487901234438.zip
 -> 
file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip
2021-09-10 18:45:51,487 INFO  [Thread-9] yarn.Client:57 : Uploading resource 
file:/usr/local/kylin/lib/kylin-parquet-job-4.0.0.jar -> 
file:/home/hadoop/.sparkStaging/application_1631282030708_2863/kylin-parquet-job-4.0.0.jar
2021-09-10 18:45:51,597 INFO  [Thread-9] yarn.Client:57 : Uploading resource 
file:/usr/local/kylin/conf/spark-executor-log4j.properties -> 
file:/home/hadoop/.sparkStaging/application_1631282030708_2863/spark-executor-log4j.properties
2021-09-10 18:45:51,718 INFO  [Thread-9] yarn.Client:57 : Uploading resource 
file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_conf__5546014978595262008.zip
 -> 
file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_conf__.zip”

This does not seem normal. According to the process of submitting spark 
application, it needs to upload these libs to HDFS or S3, but it is obvious 
that the path here shows that these libs have been uploaded to the local 
directory of the driver running node, so that other nodes cannot find the path.

I'm not sure what caused these libs not to be uploaded to the correct path, but 
you can check whether this configuration ‘HADOOP_CONF_DIR' exists in the front 
page of kylin, as shown in the following figure:
<image001.png>
If so, you can check whether 'fs.defaultFS' in core-site.xml under this path is 
configured to the correct directory.

By the way, the configuration 
'kylin.query.spark-conf.spark.executor.extraJavaOptions' in kylin.properties 
does not need to be manually modified by the user, kylin will automatically 
configure those variables at runtime.



在 2021年9月11日，上午2:57，Michael, Gabe 
<[email protected]<mailto:[email protected]>> 写道：

Hello,

When running Kylin 4.0.0 on AWS EMR 6.3.0, I am able to successfully build a 
cube.

But when I try to query it, the Sparder application cannot start.

Kylin attempts to upload some files to a local directory, then the Spark job 
fails because it cannot read files from that directory.

2021-09-10 18:45:47,407 INFO  [Thread-9] yarn.Client:57 : Preparing resources 
for our AM container
2021-09-10 18:45:47,428 WARN  [Thread-9] yarn.Client:69 : Neither 
spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading 
libraries under SPARK_HOME.
2021-09-10 18:45:50,861 INFO  [Thread-9] yarn.Client:57 : Uploading resource 
file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_libs__7584573487901234438.zip
 -> 
file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip
2021-09-10 18:45:51,487 INFO  [Thread-9] yarn.Client:57 : Uploading resource 
file:/usr/local/kylin/lib/kylin-parquet-job-4.0.0.jar -> 
file:/home/hadoop/.sparkStaging/application_1631282030708_2863/kylin-parquet-job-4.0.0.jar
2021-09-10 18:45:51,597 INFO  [Thread-9] yarn.Client:57 : Uploading resource 
file:/usr/local/kylin/conf/spark-executor-log4j.properties -> 
file:/home/hadoop/.sparkStaging/application_1631282030708_2863/spark-executor-log4j.properties
2021-09-10 18:45:51,718 INFO  [Thread-9] yarn.Client:57 : Uploading resource 
file:/usr/local/kylin/tomcat/temp/spark-8ec4dae7-5f3c-477e-bda3-4c4f00978586/__spark_conf__5546014978595262008.zip
 -> 
file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_conf__.zip
2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing 
view acls to: hadoop
2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing 
modify acls to: hadoop
2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing 
view acls groups to:
2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : Changing 
modify acls groups to:
2021-09-10 18:45:51,780 INFO  [Thread-9] spark.SecurityManager:57 : 
SecurityManager: authentication disabled; ui acls disabled; users  with view 
permissions: Set(hadoop); groups with view permissions: Set(); users  with 
modify permissions: Set(hadoop); groups with modify permissions: Set()
2021-09-10 18:45:51,814 INFO  [Thread-9] yarn.Client:57 : Submitting 
application application_1631282030708_2863 to ResourceManager
2021-09-10 18:45:51,861 INFO  [Thread-9] impl.YarnClientImpl:329 : Submitted 
application application_1631282030708_2863
2021-09-10 18:45:52,863 INFO  [Thread-9] yarn.Client:57 : Application report 
for application_1631282030708_2863 (state: FAILED)
2021-09-10 18:45:52,866 INFO  [Thread-9] yarn.Client:57 :
       client token: N/A
       diagnostics: Application application_1631282030708_2863 failed 2 times 
due to AM Container for appattempt_1631282030708_2863_000002 exited with  
exitCode: -1000
Failing this attempt.Diagnostics: [2021-09-10 18:45:52.033]File 
file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip
 does not exist
java.io.FileNotFoundException: File 
file:/home/hadoop/.sparkStaging/application_1631282030708_2863/__spark_libs__7584573487901234438.zip
 does not exist
       at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:671)
       at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:992)
       at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:661)
       at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:464)
       at 
org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)
       at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
       at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
       at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.Subject.doAs(Subject.java:422)
       at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
       at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
       at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:243)
       at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:236)
       at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:224)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)

For more detailed output, check the application tracking page: 
http://ip-10-240-102-189.bamtech.test.us-east-1.bamgrid.net:8088/cluster/app/application_1631282030708_2863
 Then click on links to logs of each attempt.
. Failing the application.
       ApplicationMaster host: N/A
       ApplicationMaster RPC port: -1
       queue: default
       start time: 1631299551829
       final status: FAILED
       tracking URL: 
http://ip-10-240-102-189.bamtech.test.us-east-1.bamgrid.net:8088/cluster/app/application_1631282030708_2863
       user: hadoop
2021-09-10 18:45:52,941 INFO  [Thread-9] yarn.Client:57 : Deleted staging 
directory file:/home/hadoop/.sparkStaging/application_1631282030708_2863
2021-09-10 18:45:52,942 ERROR [Thread-9] cluster.YarnClientSchedulerBackend:73 
: The YARN application has already ended! It might have been killed or the 
Application Master may have failed to start. Check the YARN application logs 
for more details.
2021-09-10 18:45:52,943 ERROR [Thread-9] spark.SparkContext:94 : Error 
initializing SparkContext.

Here are my kylin.properties with irrelevant/sensitive values removed:

kylin.env.hdfs-working-dir=s3a://<s3a://XXXXX/qa/kylin/hdfs/>XXXXX<s3a://XXXXX/qa/kylin/hdfs/>/qa/kylin/hdfs/<s3a://XXXXX/qa/kylin/hdfs/>
kylin.env=QA
kylin.server.mode=all
kylin.server.cluster-servers=localhost:7070
kylin.engine.default=6
kylin.storage.default=4
kylin.server.external-acl-provider=
kylin.source.hive.database-for-flat-table=default
kylin.web.default-time-filter=1
kylin.storage.clean-after-delete-operation=false
kylin.job.retry=1
kylin.job.max-concurrent-jobs=1
kylin.job.sampling-percentage=100
kylin.job.scheduler.provider.100=org.apache.kylin.job.impl.curator.CuratorScheduler
kylin.job.scheduler.default=2
kylin.spark-conf.auto.prior=true
kylin.engine.spark-conf.spark.master=yarn
kylin.engine.spark-conf.spark.submit.deployMode=client
kylin.engine.spark-conf.spark.yarn.queue=default
kylin.engine.spark-conf.spark.eventLog.enabled=true
kylin.engine.spark-conf.spark.eventLog.dir=hdfs:///kylin/spark-history<hdfs://kylin/spark-history>
kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs:///kylin/spark-history<hdfs://kylin/spark-history>
kylin.engine.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false
kylin.engine.spark-conf.spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8 
-Dhdp.version=current -Dlog4j.configuration=spark-executor-log4j.properties 
-Dlog4j.debug -Dkylin.hdfs.working.dir=${hdfs.working.dir} 
-Dkylin.metadata.identifier=kylin_metadata -Dkylin.spark.category=job 
-Dkylin.spark.project=${job.project} 
-Dkylin.spark.identifier=${job.id<https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fjob.id%2F&data=04%7C01%7CGabe.Michael%40disneystreaming.com%7Cd1fbea80570f44f96d7408d977293df4%7C65f03ca86d0a493e9e4ac85ac9526a03%7C1%7C0%7C637671841890968845%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=t1ypBK86%2BI99DZphr3CIZnLiZvgrm1ZeFeBqPQkxs1E%3D&reserved=0>}
 -Dkylin.spark.jobName=${job.stepId} -Duser.timezone=${user.timezone}
kylin.engine.spark-conf.spark.driver.extraJavaOptions=-XX:+CrashOnOutOfMemoryError
kylin.query.auto-sparder-context-enabled-enabled=false
kylin.query.spark-conf.spark.master=yarn
kylin.query.spark-conf.spark.driver.cores=1
kylin.query.spark-conf.spark.driver.memory=4G
kylin.query.spark-conf.spark.driver.memoryOverhead=1G
kylin.query.spark-conf.spark.executor.cores=1
kylin.query.spark-conf.spark.executor.instances=1
kylin.query.spark-conf.spark.executor.memory=4G
kylin.query.spark-conf.spark.executor.memoryOverhead=1G
kylin.query.spark-conf.spark.serializer=org.apache.spark.serializer.JavaSerializer
kylin.query.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false
kylin.query.spark-conf.spark.executor.extraJavaOptions=-Dhdp.version=current 
-Dlog4j.configuration=spark-executor-log4j.properties -Dlog4j.debug 
-Dkylin.hdfs.working.dir=s3a://dataeng-data-test/qa/kylin/hdfs/ 
-Dkylin.metadata.identifier=kylin_metadata -Dkylin.spark.category=sparder 
-Dkylin.spark.identifier={{APP_ID}}
kylin.source.hive.redistribute-flat-table=false
kylin.metadata.jdbc.dialect=mysql
kylin.metadata.jdbc.json-always-small-cell=true
kylin.job.lock=org.apache.kylin.storage.hbase.util.ZookeeperDistributedJobLock
kylin.web.set-config-enable=true
kylin.job.allow-empty-segment=false
kylin.env.hadoop-conf-dir=/etc/hadoop/conf
kylin.query.lazy-query-enabled=true
kylin.query.cache-signature-enabled=true
kylin.query.segment-cache-enabled=false
kylin.engine.spark-fact-distinct=true
kylin.engine.spark-dimension-dictionary=false
kylin.engine.spark-uhc-dictionary=true
kylin.engine.spark.rdd-partition-cut-mb=10
kylin.engine.spark.min-partition=1
kylin.engine.spark.max-partition=5000
kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true
kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=1
kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1000
kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300
kylin.engine.spark-conf.spark.driver.memory=2G
kylin.engine.spark-conf.spark.executor.memory=4G
kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024
kylin.engine.spark-conf.spark.executor.cores=1
kylin.engine.spark-conf.spark.network.timeout=600
kylin.engine.spark-conf.spark.shuffle.service.enabled=true
kylin.engine.spark-conf.spark.hadoop.dfs.replication=2
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
kylin.engine.spark-conf-mergedict.spark.executor.memory=6G
kylin.engine.spark-conf-mergedict.spark.memory.fraction=0.2
kylin.engine.spark-conf.spark.sql.hive.metastore.version=3.1.2
kylin.engine.spark-conf.spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/lib/*
kylin.query.spark-conf.spark.sql.hive.metastore.version=3.1.2
kylin.query.spark-conf.spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/lib/*
kylin.server.cluster-name=kylin_metadata
kylin.log.spark-executor-properties-file=/usr/local/kylin/conf/spark-executor-log4j.properties
kylin.metadata.url.identifier=kylin_metadata

Thank you for your assistance,

Gabe

--
Gabe Michael
Principal Data Engineer
Disney Streaming Services

Re: Kylin v4.0.0 GA on EMR 6.3.0 fail to start Sparder due to YARN staging files missing

Reply via email to