Re: Out of Memory while generating ORC Splits

Rajesh Balamohan Wed, 13 Sep 2017 03:14:41 -0700

>> Also, this parameter is set to false by default should this be set to
true ?
This is set to false for predictable query performance with multiple runs.
If set to true, there is a chance of footer getting evicted during GC and
get into unpredictable performance.


>> I see that the number of mappers generated by setting this parameter to
true is less than the number of mappers generated by setting the
split.strategy=BI. Therefore, I am hoping that using this parameter along
with HYBRID is better than using BI split strategy. Can you please comment
on this?
BI does not carry out any optimization at split gen phase. It would be
equal to the number of files being included in the split gen. ETL can do
additional optimizations based on the information present in footer. Hybrid
would combine BI and ETL strategy on need basis. Hence, the number of
splits generated by HYBRID/ETL can be different than that of BI.
"hive.orc.cache.use.soft.references=true" needs to be used along with
HYBRID/ETL strategy. HYBRID is the default strategy.

On Wed, Sep 13, 2017 at 3:23 PM, Jay <[email protected]> wrote:

> Thanks Rajesh, this helped solve the OOM issue. I was going through the
> wiki documentation for this parameter and was not able to understand it
> clearly, can you please explain the significance of this ?
>
> I have 2 questions:-
> 1. Also, this parameter is set to false by default should this be set to
> true ?
> 2. I see that the number of mappers generated by setting this parameter to
> true is less than the number of mappers generated by setting the
> split.strategy=BI. Therefore, I am hoping that using this parameter along
> with HYBRID is better than using BI split strategy. Can you please comment
> on this?
>
> Thanks,
> Jayadeep
>
> On Wed, Sep 13, 2017 at 3:14 PM, Rajesh Balamohan <[email protected]>
> wrote:
>
>> With "HYBRID" can you try with "hive.orc.cache.use.soft.references=true"?
>> That should help in preventing OOM with Hybrid strategy.
>>
>> ~Rajesh.B
>>
>> On Wed, Sep 13, 2017 at 2:54 PM, Jay <[email protected]>
>> wrote:
>>
>>> Hi All,
>>>
>>> I am running a simple select query as below
>>>
>>> select distinct vehicle_no from 
>>> rmd.gets_dw_eoa_eng_rec_dtl_orc_ext_concat_final_eng3
>>> where incident_dt = '2999-01-01';
>>>
>>> The table is a 2 level partitioned table as shown below
>>>
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2010-01-01
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2011-01-01
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:35
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2012-01-01
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2013-01-01
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-01-01
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-02-01
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-03-01
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:36
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-04-01
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:34
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-05-01
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:33
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-06-01
>>> drwx------   - gpadmin hdfs          0 2017-09-12 14:33
>>> /apps/hive/warehouse/rmd.db/gets_dw_eoa_eng_rec_dtl_orc_ext_
>>> concat_final_eng3/source_type_cd=ENG3/incident_dt=2014-07-01
>>>
>>>
>>> The ORC files have been created with a rough size of 2 GB and have ZLIB
>>> compression.
>>>
>>> When the hive.exec.orc.split.strategy is set to HYBRID in our HDP 2.6.1
>>> cluster the MAP phase is stuck in the INITIALIZATION phases and after about
>>> 20 minutes it fails with OOM.
>>>
>>> When I change hive.exec.orc.split.strategy to BI the SQL runs fine
>>> without any issues.
>>>
>>> My question is what parameter controls the memory assigned while
>>> Hive/Tez generates the splits?
>>>
>>> the hive container size is set to 8GB
>>>
>>> Thanks,
>>> Jayadeep
>>>
>>
>>
>

Re: Out of Memory while generating ORC Splits

Reply via email to