Prasanth -

This is easily the best and most complete explanation I've received to any
online posted question ever.  I know that sounds like a an overstatement,
but this answer is awesome.  :)  I really appreciate your insight on this.
 My only follow-up is asking how the memory.pool percentage plays a roll in
my success vs. fail. I.e. in my data, when I got down to 16k but had the
default memory pool of .50, it failed, when I scaled that back to .25, it
was successful at 16k.  Thoughts?

Thanks again for your research on this.




On Sun, Apr 27, 2014 at 11:07 PM, Prasanth Jayachandran <
pjayachand...@hortonworks.com> wrote:

> Hi John
>
> I prepared a presentation earlier that explains the impact of changing
> compression buffer size on the overall size of ORC file. It should help you
> understand all the questions that you had.
>
> In Hive 0.13, a new optimization is added that should avoid this OOM
> issue. https://issues.apache.org/jira/browse/HIVE-6455
> Unfortunately, hive 0.12 does not support this optimization. Hence
> reducing the compression size is the only option. As you can see from the
> PPT, reducing the compression buffer size does not have significant impact
> in file size or query execution time.
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
> Thanks
> Prasanth Jayachandran
>
> On Apr 27, 2014, at 3:06 PM, John Omernik <j...@omernik.com> wrote:
>
> So one more follow-up:
>
> The 16-.25-Success turns to a fail if I throw more data (and hence more
> partitions) at the problem. Could there be some sort of issue that rears
> it's head based on the number of output dynamic partitions?
>
> Thanks all!
>
>
>
>
> On Sun, Apr 27, 2014 at 3:33 PM, John Omernik <j...@omernik.com> wrote:
>
>> Here is some testing, I focused on two variables (Not really
>> understanding what they do)
>> orc.compress.size (256k by default)
>> hive.exec.orc.memory.pool (0.50 by default).
>>
>> The job I am running is a admittedly complex job running through a Python
>> Transform script.  However, as noted above, RCFile writes have NO issues.
>> Another point... the results of this job end up being is LOTs of Dynamic
>> partitions.  I am not sure if that plays a role here, or could help in
>> troubleshooting.
>>
>> So for these two I ran a bunch of tests, the results are in the format
>> (compress.size in k-memory.pool-Success/fail)
>> 256-0.50-Fail
>> 128-0.50-Fail
>>    64-0.50-Fail
>>    32-0.50-Fail
>>    16-0.50-Fail
>>    16-0.25-Success
>>    32-0.25-Fail
>>    16-0.35-Success
>>    16-0.45-Success
>>
>>
>> So after doing this I have questions:
>> 1. On the memory.pool what is happening when I change this? Is this
>> affecting the written files on subsequent reads?
>> 2. Does the hive memory pool change the speed of things? (I'll take
>> slower speed if it "works")
>> 3. On the compress.size, do I hurt subsequent reads with the smaller
>> compress size?
>> 4. These two variables, changed by themselves do not fix the problem, but
>> together they seem to... lucky? Or are they related?
>> 5. Is there a better approach I can take on this?
>> 6. Any other variables I could look at?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sun, Apr 27, 2014 at 11:56 AM, John Omernik <j...@omernik.com> wrote:
>>
>>> Hello all,
>>>
>>> I am working with Hive 0.12 right now on YARN.  When I am writing a
>>> table that is admittedly quite "wide" (there are lots of columns, near 60,
>>> including one binary field that can get quite large).   Some tasks will
>>> fail on ORC file write with Java Heap Space Issues.
>>>
>>> I have confirmed that using RCFiles on the same data produces no
>>> failures.
>>>
>>> This led me down the path of experimenting with the table properties.
>>> Obviously, living on the cutting edge here makes it so there is not tons of
>>> documentation on what these settings do, I have lots of slide shows showing
>>> me the settings that be used to tune ORC, but not what they do, or what the
>>> ramifications may be.
>>>
>>> For example, I've gone ahead and reduced the orc.compress.size to 64k
>>> This seems to address lots of the failures, (all other things being
>>> unchanged). But what does that mean for me in the long run? Larger files?
>>>  More files?  How is this negatively affecting me from a file perspective?
>>>
>>> In addition, would this be a good time to try SNAPPY over ZLIB as my
>>> default compression? I tried to find some direct memory comparisons but
>>> didn't see anything.
>>>
>>> So, give my data and the issues on write for my wide table, how would
>>> you recommend I address this? Is the compress.size the way to go?  What are
>>> the long term affects of this?  Any thoughts would be welcome.
>>>
>>> Thanks!
>>>
>>> John
>>>
>>
>>
>
>
>

Reply via email to