I was finally able to get a successful build using the following settings ....
There was a slideshare presentation on some performance settings: https://www.slideshare.net/ShiShaoFeng1/spark-tunning-in-apache-kylin Below is a section for #LocalTuning which uses settings from the presentation above I -think- the most meaningful one for me is the max-partition=500 which came from the presentation. After adding this the failing step was completed and I'm re-running everything now. The hardware is a 3 node, dual cpu, 128GB each (old Dell R710s) and data is ~4B records, 5 measure, 6 dimensions and low cardinality. ------------------------------------------ ## Spark conf (default is in spark/conf/spark-defaults.conf) #kylin.engine.spark-conf.spark.master=yarn #kylin.engine.spark-conf.spark.submit.deployMode=cluster #kylin.engine.spark-conf.spark.yarn.queue=default #kylin.engine.spark-conf.spark.driver.memory=2G #kylin.engine.spark-conf.spark.executor.memory=4G #kylin.engine.spark-conf.spark.executor.instances=40 #kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024 #kylin.engine.spark-conf.spark.shuffle.service.enabled=true #kylin.engine.spark-conf.spark.eventLog.enabled=true kylin.engine.spark-conf.spark.eventLog.dir=hdfs\:///kylin/spark-history kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs\:///kylin/spark-history #kylin.engine.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false kylin.engine.spark-conf.spark.driver.extraClassPath=/opt/spark/jars/snappy*.jar kylin.engine.spark-conf.spark.driver.extraLibraryPath=/opt/hadoop/lib/native kylin.engine.spark-conf.spark.driver.extraLibraryPath=/opt/hadoop/lib/native kylin.engine.spark-conf.spark.executor.extraLibraryPath=/opt/hadoop/lib/native # #### Spark conf for specific job #kylin.engine.spark-conf-mergedict.spark.executor.memory=6G #kylin.engine.spark-conf-mergedict.spark.memory.fraction=0.2 # ## manually upload spark-assembly jar to HDFS and then set this property will avoid repeatedly uploading jar at runtime ##kylin.engine.spark-conf.spark.yarn.archive=hdfs://namenode:8020/kylin/spark/spark-libs.jar ##kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec #LOCAL TUNING kylin.engine.spark-conf.spark.submit.deployMode=cluster kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=1 kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1000 kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300 kylin.engine.spark-conf.spark.max-partition=500 kylin.engine.spark-conf.spark.driver.memory=8G kylin.engine.spark-conf.spark.executor.memory=8G kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024 kylin.engine.spark-conf.spark.executor.cores=1 kylin.engine.spark-conf.spark.network.timeout=600 kylin.engine.spark-conf.spark.shuffle.service.enabled=true kylin.engine.spark-conf.spark.hadoop.dfs.replication=2 kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec kylin.engine.spark-conf.spark.io.compression.codec=snappy kylin.engine.spark-conf.spark.local.dir=/opt/volume/disk1/tmp kylin.engine.spark-conf.spark.dynamicAllocation.schedulerBacklogTimeout=1 On Mon, Dec 17, 2018 at 8:23 PM Chao Long <[email protected]> wrote: > Hi J, > There is a slide about Spark tunning in Apache Kylin(author shaofengshi) > https://www.slideshare.net/ShiShaoFeng1/spark-tunning-in-apache-kylin > > About Step 3 (Extract Fact Table Distinct Columns) OOM, you can try to set > this parameter "kylin.engine.mr.uhc-reducer-count" a larger value(default > 1). > > ------------------ > Best Regards, > Chao Long > > ------------------ 原始邮件 ------------------ > *发件人:* "Jon Shoberg"<[email protected]>; > *发送时间:* 2018年12月18日(星期二) 中午11:16 > *收件人:* "user"<[email protected]>; > *主题:* Re: Spark tuning within Kylin? Article? Resource? > > Greatly appreciate the response. > > I started there but after OOM errors I started to work on the settings for > my test lab. After minimal success thought to ask if there was something > more in-depth for tuning with other Kylin users found successful. > > Right now I've gone to very basic configuration with dynamic allocation > and see if I can avoid the late-stage OOM errors. > > J > > On Mon, Dec 17, 2018 at 7:44 PM JiaTao Tao <[email protected]> wrote: > >> Hope this may help: http://kylin.apache.org/docs/tutorial/cube_spark.html >> >> Jon Shoberg <[email protected]> 于2018年12月18日周二 上午2:34写道: >> >>> Is there a good/favorite article for tuning spark settings within Kylin? >>> >>> I finally have Spark (2.1.3 as distributed with Kylin 2.5.2) running on >>> my systems. >>> >>> My small data set (35M records) runs well the default settings. >>> >>> My medium data set (4B records, 40GB compressed source file, 5 measures, >>> 6 dimensions with low carnality) often dies at Step 3 (Extract Fact Table >>> Distinct Columns) with out of memory errors. >>> >>> After using exceptionally large memory settings the job completed but >>> I'm trying to see if there is an optimization possible. >>> >>> Any suggestions or ideas? I've searched/read on spark tuning in general >>> but otherwise feeling I'm not making too much progress on optimizing with >>> the settings I've tried. >>> >>> Thanks!J >>> >> >> >> -- >> >> >> Regards! >> >> Aron Tao >> >
