I was finally able to get a successful build using the following settings
....

There was a slideshare presentation on some performance settings:

https://www.slideshare.net/ShiShaoFeng1/spark-tunning-in-apache-kylin

Below is a section for #LocalTuning which uses settings from the
presentation above

I -think- the most meaningful one for me is the max-partition=500 which
came from the presentation.

After adding this the failing step was completed and I'm re-running
everything now.

The hardware is a 3 node, dual cpu, 128GB each (old Dell R710s) and data is
~4B records, 5 measure, 6 dimensions and low cardinality.


------------------------------------------

## Spark conf (default is in spark/conf/spark-defaults.conf)
#kylin.engine.spark-conf.spark.master=yarn
#kylin.engine.spark-conf.spark.submit.deployMode=cluster
#kylin.engine.spark-conf.spark.yarn.queue=default
#kylin.engine.spark-conf.spark.driver.memory=2G
#kylin.engine.spark-conf.spark.executor.memory=4G
#kylin.engine.spark-conf.spark.executor.instances=40
#kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024
#kylin.engine.spark-conf.spark.shuffle.service.enabled=true
#kylin.engine.spark-conf.spark.eventLog.enabled=true
kylin.engine.spark-conf.spark.eventLog.dir=hdfs\:///kylin/spark-history
kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs\:///kylin/spark-history
#kylin.engine.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false

kylin.engine.spark-conf.spark.driver.extraClassPath=/opt/spark/jars/snappy*.jar
kylin.engine.spark-conf.spark.driver.extraLibraryPath=/opt/hadoop/lib/native
kylin.engine.spark-conf.spark.driver.extraLibraryPath=/opt/hadoop/lib/native
kylin.engine.spark-conf.spark.executor.extraLibraryPath=/opt/hadoop/lib/native
#
#### Spark conf for specific job
#kylin.engine.spark-conf-mergedict.spark.executor.memory=6G
#kylin.engine.spark-conf-mergedict.spark.memory.fraction=0.2
#
## manually upload spark-assembly jar to HDFS and then set this property
will avoid repeatedly uploading jar at runtime
##kylin.engine.spark-conf.spark.yarn.archive=hdfs://namenode:8020/kylin/spark/spark-libs.jar
##kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec

#LOCAL TUNING
kylin.engine.spark-conf.spark.submit.deployMode=cluster
kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true
kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=1
kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1000
kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300
kylin.engine.spark-conf.spark.max-partition=500
kylin.engine.spark-conf.spark.driver.memory=8G
kylin.engine.spark-conf.spark.executor.memory=8G
kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024
kylin.engine.spark-conf.spark.executor.cores=1
kylin.engine.spark-conf.spark.network.timeout=600
kylin.engine.spark-conf.spark.shuffle.service.enabled=true
kylin.engine.spark-conf.spark.hadoop.dfs.replication=2
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true
kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
kylin.engine.spark-conf.spark.io.compression.codec=snappy
kylin.engine.spark-conf.spark.local.dir=/opt/volume/disk1/tmp
kylin.engine.spark-conf.spark.dynamicAllocation.schedulerBacklogTimeout=1


On Mon, Dec 17, 2018 at 8:23 PM Chao Long <[email protected]> wrote:

> Hi J,
> There is a slide about Spark tunning in Apache Kylin(author shaofengshi)
> https://www.slideshare.net/ShiShaoFeng1/spark-tunning-in-apache-kylin
>
> About Step 3 (Extract Fact Table Distinct Columns) OOM, you can try to set
> this parameter "kylin.engine.mr.uhc-reducer-count" a larger value(default
> 1).
>
> ------------------
> Best Regards,
> Chao Long
>
> ------------------ 原始邮件 ------------------
> *发件人:* "Jon Shoberg"<[email protected]>;
> *发送时间:* 2018年12月18日(星期二) 中午11:16
> *收件人:* "user"<[email protected]>;
> *主题:* Re: Spark tuning within Kylin? Article? Resource?
>
> Greatly appreciate the response.
>
> I started there but after OOM errors I started to work on the settings for
> my test lab. After minimal success thought to ask if there was something
> more in-depth for tuning with other Kylin users found successful.
>
> Right now I've gone to very basic configuration with dynamic allocation
> and see if I can avoid the late-stage OOM errors.
>
> J
>
> On Mon, Dec 17, 2018 at 7:44 PM JiaTao Tao <[email protected]> wrote:
>
>> Hope this may help: http://kylin.apache.org/docs/tutorial/cube_spark.html
>>
>> Jon Shoberg <[email protected]> 于2018年12月18日周二 上午2:34写道:
>>
>>> Is there a good/favorite article for tuning spark settings within Kylin?
>>>
>>> I finally have Spark (2.1.3 as distributed with Kylin 2.5.2) running on
>>> my systems.
>>>
>>> My small data set (35M records) runs well the default settings.
>>>
>>> My medium data set (4B records, 40GB compressed source file, 5 measures,
>>> 6 dimensions with low carnality) often dies at Step 3 (Extract Fact Table
>>> Distinct Columns) with out of memory errors.
>>>
>>> After using exceptionally large memory settings the job completed but
>>> I'm trying to see if there is an optimization possible.
>>>
>>> Any suggestions or ideas?  I've searched/read on spark tuning in general
>>> but otherwise feeling I'm not making too much progress on optimizing with
>>> the settings I've tried.
>>>
>>> Thanks!J
>>>
>>
>>
>> --
>>
>>
>> Regards!
>>
>> Aron Tao
>>
>

Reply via email to