Thanks, that should help moving forward On Sep 3, 2015 10:38 AM, "Prasanth Jayachandran" < pjayachand...@hortonworks.com> wrote:
> > > On Sep 2, 2015, at 10:57 PM, David Capwell <dcapw...@gmail.com> wrote: > > > > So, very quickly looked at the JIRA and I had the following question; > > if you have a pool per thread rather than global, then assuming 50% > > heap will cause writer to OOM with multiple threads, which is > > different than older (0.14) ORC, correct? > > > > > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcConf.java#L83 > > > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java#L94 > > > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFile.java#L226 > > > > So with orc.memory.pool=0.5, this value only seems to make sense if > > single threaded, so if you are writing with multiple threads, then I > > assume the value should be (0.5 / #threads), so if 50 threads then > > 0.01 should be the value? > > Yes. You are correct. Since hive’s operator pipeline is single threaded > there was uncontested locks causing slow down. Hence the change. > I will create a JIRA to update the docs to reflect the same and the config > description. If multiple threads are writing then you might > need to share the heap for multiple writers. > > > If this is true, I can't find any documentation about this, all docs > > make it sound global. > > > > Noted. Since this is unreleased version, I will create a jira to make sure > this gets reflected in the docs. > > > On Wed, Sep 2, 2015 at 7:34 PM, David Capwell <dcapw...@gmail.com> > wrote: > >> Thanks for the jira, will see if that works for us. > >> > >> On Sep 2, 2015 7:11 PM, "Prasanth Jayachandran" > >> <pjayachand...@hortonworks.com> wrote: > >>> > >>> Memory manager is made thread local > >>> https://issues.apache.org/jira/browse/HIVE-10191 > >>> > >>> Can you try the patch from HIVE-10191 and see if that helps? > >>> > >>> On Sep 2, 2015, at 8:58 PM, David Capwell <dcapw...@gmail.com> wrote: > >>> > >>> I'll try that out and see if it goes away (not seen this in the past 24 > >>> hours, no code change). > >>> > >>> Doing this now means that I can't share the memory, so will prob go > with a > >>> thread local and allocate fixed sizes to the pool per thread (50% heap > / 50 > >>> threads). Will most likely be awhile before I can report back (unless > it > >>> fails fast in testing) > >>> > >>> On Sep 2, 2015 2:11 PM, "Owen O'Malley" <omal...@apache.org> wrote: > >>>> > >>>> (Dropping dev) > >>>> > >>>> Well, that explains the non-determinism, because the MemoryManager > will > >>>> be shared across threads and thus the stripes will get flushed at > >>>> effectively random times. > >>>> > >>>> Can you try giving each writer a unique MemoryManager? You'll need to > put > >>>> a class into the org.apache.hadoop.hive.ql.io.orc package to get > access to > >>>> the necessary class (MemoryManager) and method > >>>> (OrcFile.WriterOptions.memory). We may be missing a synchronization > on the > >>>> MemoryManager somewhere and thus be getting a race condition. > >>>> > >>>> Thanks, > >>>> Owen > >>>> > >>>> On Wed, Sep 2, 2015 at 12:57 PM, David Capwell <dcapw...@gmail.com> > >>>> wrote: > >>>>> > >>>>> We have multiple threads writing, but each thread works on one file, > so > >>>>> orc writer is only touched by one thread (never cross threads) > >>>>> > >>>>> On Sep 2, 2015 11:18 AM, "Owen O'Malley" <omal...@apache.org> wrote: > >>>>>> > >>>>>> I don't see how it would get there. That implies that minimum was > null, > >>>>>> but the count was non-zero. > >>>>>> > >>>>>> The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like: > >>>>>> > >>>>>> @Override > >>>>>> OrcProto.ColumnStatistics.Builder serialize() { > >>>>>> OrcProto.ColumnStatistics.Builder result = super.serialize(); > >>>>>> OrcProto.StringStatistics.Builder str = > >>>>>> OrcProto.StringStatistics.newBuilder(); > >>>>>> if (getNumberOfValues() != 0) { > >>>>>> str.setMinimum(getMinimum()); > >>>>>> str.setMaximum(getMaximum()); > >>>>>> str.setSum(sum); > >>>>>> } > >>>>>> result.setStringStatistics(str); > >>>>>> return result; > >>>>>> } > >>>>>> > >>>>>> and thus shouldn't call down to setMinimum unless it had at least > some > >>>>>> non-null values in the column. > >>>>>> > >>>>>> Do you have multiple threads working? There isn't anything that > should > >>>>>> be introducing non-determinism so for the same input it would fail > at the > >>>>>> same point. > >>>>>> > >>>>>> .. Owen > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Tue, Sep 1, 2015 at 10:51 PM, David Capwell <dcapw...@gmail.com> > >>>>>> wrote: > >>>>>>> > >>>>>>> We are writing ORC files in our application for hive to consume. > >>>>>>> Given enough time, we have noticed that writing causes a NPE when > >>>>>>> working with a string column's stats. Not sure whats causing it on > >>>>>>> our side yet since replaying the same data is just fine, it seems > more > >>>>>>> like this just happens over time (different data sources will hit > this > >>>>>>> around the same time in the same JVM). > >>>>>>> > >>>>>>> Here is the code in question, and below is the exception: > >>>>>>> > >>>>>>> final Writer writer = OrcFile.createWriter(path, > >>>>>>> OrcFile.writerOptions(conf).inspector(oi)); > >>>>>>> try { > >>>>>>> for (Data row : rows) { > >>>>>>> List<Object> struct = Orc.struct(row, inspector); > >>>>>>> writer.addRow(struct); > >>>>>>> } > >>>>>>> } finally { > >>>>>>> writer.close(); > >>>>>>> } > >>>>>>> > >>>>>>> > >>>>>>> Here is the exception: > >>>>>>> > >>>>>>> java.lang.NullPointerException: null > >>>>>>> at > >>>>>>> > org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$Builder.setMinimum(OrcProto.java:1803) > >>>>>>> ~[hive-exec-0.14.0.jar:0.14.0] > >>>>>>> at > >>>>>>> > org.apache.hadoop.hive.ql.io.orc.ColumnStatisticsImpl$StringStatisticsImpl.serialize(ColumnStatisticsImpl.java:411) > >>>>>>> ~[hive-exec-0.14.0.jar:0.14.0] > >>>>>>> at > >>>>>>> > org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.createRowIndexEntry(WriterImpl.java:1255) > >>>>>>> ~[hive-exec-0.14.0.jar:0.14.0] > >>>>>>> at > >>>>>>> > org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775) > >>>>>>> ~[hive-exec-0.14.0.jar:0.14.0] > >>>>>>> at > >>>>>>> > org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775) > >>>>>>> ~[hive-exec-0.14.0.jar:0.14.0] > >>>>>>> at > >>>>>>> > org.apache.hadoop.hive.ql.io.orc.WriterImpl.createRowIndexEntry(WriterImpl.java:1978) > >>>>>>> ~[hive-exec-0.14.0.jar:0.14.0] > >>>>>>> at > >>>>>>> > org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1985) > >>>>>>> ~[hive-exec-0.14.0.jar:0.14.0] > >>>>>>> at > >>>>>>> > org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:322) > >>>>>>> ~[hive-exec-0.14.0.jar:0.14.0] > >>>>>>> at > >>>>>>> > org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168) > >>>>>>> ~[hive-exec-0.14.0.jar:0.14.0] > >>>>>>> at > >>>>>>> > org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157) > >>>>>>> ~[hive-exec-0.14.0.jar:0.14.0] > >>>>>>> at > >>>>>>> > org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2276) > >>>>>>> ~[hive-exec-0.14.0.jar: > >>>>>>> > >>>>>>> > >>>>>>> Versions: > >>>>>>> > >>>>>>> Hadoop: apache 2.2.0 > >>>>>>> Hive Apache: 0.14.0 > >>>>>>> Java 1.7 > >>>>>>> > >>>>>>> > >>>>>>> Thanks for your time reading this email. > >>>>>> > >>>>>> > >>>> > >>> > >> > > > >