Prasanth - This is easily the best and most complete explanation I've received to any online posted question ever. I know that sounds like a an overstatement, but this answer is awesome. :) I really appreciate your insight on this. My only follow-up is asking how the memory.pool percentage plays a roll in my success vs. fail. I.e. in my data, when I got down to 16k but had the default memory pool of .50, it failed, when I scaled that back to .25, it was successful at 16k. Thoughts?
Thanks again for your research on this. On Sun, Apr 27, 2014 at 11:07 PM, Prasanth Jayachandran < pjayachand...@hortonworks.com> wrote: > Hi John > > I prepared a presentation earlier that explains the impact of changing > compression buffer size on the overall size of ORC file. It should help you > understand all the questions that you had. > > In Hive 0.13, a new optimization is added that should avoid this OOM > issue. https://issues.apache.org/jira/browse/HIVE-6455 > Unfortunately, hive 0.12 does not support this optimization. Hence > reducing the compression size is the only option. As you can see from the > PPT, reducing the compression buffer size does not have significant impact > in file size or query execution time. > > > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity > to which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. > > Thanks > Prasanth Jayachandran > > On Apr 27, 2014, at 3:06 PM, John Omernik <j...@omernik.com> wrote: > > So one more follow-up: > > The 16-.25-Success turns to a fail if I throw more data (and hence more > partitions) at the problem. Could there be some sort of issue that rears > it's head based on the number of output dynamic partitions? > > Thanks all! > > > > > On Sun, Apr 27, 2014 at 3:33 PM, John Omernik <j...@omernik.com> wrote: > >> Here is some testing, I focused on two variables (Not really >> understanding what they do) >> orc.compress.size (256k by default) >> hive.exec.orc.memory.pool (0.50 by default). >> >> The job I am running is a admittedly complex job running through a Python >> Transform script. However, as noted above, RCFile writes have NO issues. >> Another point... the results of this job end up being is LOTs of Dynamic >> partitions. I am not sure if that plays a role here, or could help in >> troubleshooting. >> >> So for these two I ran a bunch of tests, the results are in the format >> (compress.size in k-memory.pool-Success/fail) >> 256-0.50-Fail >> 128-0.50-Fail >> 64-0.50-Fail >> 32-0.50-Fail >> 16-0.50-Fail >> 16-0.25-Success >> 32-0.25-Fail >> 16-0.35-Success >> 16-0.45-Success >> >> >> So after doing this I have questions: >> 1. On the memory.pool what is happening when I change this? Is this >> affecting the written files on subsequent reads? >> 2. Does the hive memory pool change the speed of things? (I'll take >> slower speed if it "works") >> 3. On the compress.size, do I hurt subsequent reads with the smaller >> compress size? >> 4. These two variables, changed by themselves do not fix the problem, but >> together they seem to... lucky? Or are they related? >> 5. Is there a better approach I can take on this? >> 6. Any other variables I could look at? >> >> >> >> >> >> >> >> >> >> On Sun, Apr 27, 2014 at 11:56 AM, John Omernik <j...@omernik.com> wrote: >> >>> Hello all, >>> >>> I am working with Hive 0.12 right now on YARN. When I am writing a >>> table that is admittedly quite "wide" (there are lots of columns, near 60, >>> including one binary field that can get quite large). Some tasks will >>> fail on ORC file write with Java Heap Space Issues. >>> >>> I have confirmed that using RCFiles on the same data produces no >>> failures. >>> >>> This led me down the path of experimenting with the table properties. >>> Obviously, living on the cutting edge here makes it so there is not tons of >>> documentation on what these settings do, I have lots of slide shows showing >>> me the settings that be used to tune ORC, but not what they do, or what the >>> ramifications may be. >>> >>> For example, I've gone ahead and reduced the orc.compress.size to 64k >>> This seems to address lots of the failures, (all other things being >>> unchanged). But what does that mean for me in the long run? Larger files? >>> More files? How is this negatively affecting me from a file perspective? >>> >>> In addition, would this be a good time to try SNAPPY over ZLIB as my >>> default compression? I tried to find some direct memory comparisons but >>> didn't see anything. >>> >>> So, give my data and the issues on write for my wide table, how would >>> you recommend I address this? Is the compress.size the way to go? What are >>> the long term affects of this? Any thoughts would be welcome. >>> >>> Thanks! >>> >>> John >>> >> >> > > >