Re: ORC NPE while writing stats

David Capwell Thu, 03 Sep 2015 10:47:45 -0700

Thanks, that should help moving forward
On Sep 3, 2015 10:38 AM, "Prasanth Jayachandran" <
pjayachand...@hortonworks.com> wrote:


>
> > On Sep 2, 2015, at 10:57 PM, David Capwell <dcapw...@gmail.com> wrote:
> >
> > So, very quickly looked at the JIRA and I had the following question;
> > if you have a pool per thread rather than global, then assuming 50%
> > heap will cause writer to OOM with multiple threads, which is
> > different than older (0.14) ORC, correct?
> >
> >
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcConf.java#L83
> >
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java#L94
> >
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFile.java#L226
> >
> > So with orc.memory.pool=0.5, this value only seems to make sense if
> > single threaded, so if you are writing with multiple threads, then I
> > assume the value should be (0.5 / #threads), so if 50 threads then
> > 0.01 should be the value?
>
> Yes. You are correct. Since hive’s operator pipeline is single threaded
> there was uncontested locks causing slow down. Hence the change.
> I will create a JIRA to update the docs to reflect the same and the config
> description. If multiple threads are writing then you might
> need to share the heap for multiple writers.
>
> > If this is true, I can't find any documentation about this, all docs
> > make it sound global.
> >
>
> Noted. Since this is unreleased version, I will create a jira to make sure
> this gets reflected in the docs.
>
> > On Wed, Sep 2, 2015 at 7:34 PM, David Capwell <dcapw...@gmail.com>
> wrote:
> >> Thanks for the jira, will see if that works for us.
> >>
> >> On Sep 2, 2015 7:11 PM, "Prasanth Jayachandran"
> >> <pjayachand...@hortonworks.com> wrote:
> >>>
> >>> Memory manager is made thread local
> >>> https://issues.apache.org/jira/browse/HIVE-10191
> >>>
> >>> Can you try the patch from HIVE-10191 and see if that helps?
> >>>
> >>> On Sep 2, 2015, at 8:58 PM, David Capwell <dcapw...@gmail.com> wrote:
> >>>
> >>> I'll try that out and see if it goes away (not seen this in the past 24
> >>> hours, no code change).
> >>>
> >>> Doing this now means that I can't share the memory, so will prob go
> with a
> >>> thread local and allocate fixed sizes to the pool per thread (50% heap
> / 50
> >>> threads).  Will most likely be awhile before I can report back (unless
> it
> >>> fails fast in testing)
> >>>
> >>> On Sep 2, 2015 2:11 PM, "Owen O'Malley" <omal...@apache.org> wrote:
> >>>>
> >>>> (Dropping dev)
> >>>>
> >>>> Well, that explains the non-determinism, because the MemoryManager
> will
> >>>> be shared across threads and thus the stripes will get flushed at
> >>>> effectively random times.
> >>>>
> >>>> Can you try giving each writer a unique MemoryManager? You'll need to
> put
> >>>> a class into the org.apache.hadoop.hive.ql.io.orc package to get
> access to
> >>>> the necessary class (MemoryManager) and method
> >>>> (OrcFile.WriterOptions.memory). We may be missing a synchronization
> on the
> >>>> MemoryManager somewhere and thus be getting a race condition.
> >>>>
> >>>> Thanks,
> >>>>   Owen
> >>>>
> >>>> On Wed, Sep 2, 2015 at 12:57 PM, David Capwell <dcapw...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> We have multiple threads writing, but each thread works on one file,
> so
> >>>>> orc writer is only touched by one thread (never cross threads)
> >>>>>
> >>>>> On Sep 2, 2015 11:18 AM, "Owen O'Malley" <omal...@apache.org> wrote:
> >>>>>>
> >>>>>> I don't see how it would get there. That implies that minimum was
> null,
> >>>>>> but the count was non-zero.
> >>>>>>
> >>>>>> The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like:
> >>>>>>
> >>>>>> @Override
> >>>>>> OrcProto.ColumnStatistics.Builder serialize() {
> >>>>>>  OrcProto.ColumnStatistics.Builder result = super.serialize();
> >>>>>>  OrcProto.StringStatistics.Builder str =
> >>>>>>    OrcProto.StringStatistics.newBuilder();
> >>>>>>  if (getNumberOfValues() != 0) {
> >>>>>>    str.setMinimum(getMinimum());
> >>>>>>    str.setMaximum(getMaximum());
> >>>>>>    str.setSum(sum);
> >>>>>>  }
> >>>>>>  result.setStringStatistics(str);
> >>>>>>  return result;
> >>>>>> }
> >>>>>>
> >>>>>> and thus shouldn't call down to setMinimum unless it had at least
> some
> >>>>>> non-null values in the column.
> >>>>>>
> >>>>>> Do you have multiple threads working? There isn't anything that
> should
> >>>>>> be introducing non-determinism so for the same input it would fail
> at the
> >>>>>> same point.
> >>>>>>
> >>>>>> .. Owen
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Sep 1, 2015 at 10:51 PM, David Capwell <dcapw...@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> We are writing ORC files in our application for hive to consume.
> >>>>>>> Given enough time, we have noticed that writing causes a NPE when
> >>>>>>> working with a string column's stats.  Not sure whats causing it on
> >>>>>>> our side yet since replaying the same data is just fine, it seems
> more
> >>>>>>> like this just happens over time (different data sources will hit
> this
> >>>>>>> around the same time in the same JVM).
> >>>>>>>
> >>>>>>> Here is the code in question, and below is the exception:
> >>>>>>>
> >>>>>>> final Writer writer = OrcFile.createWriter(path,
> >>>>>>> OrcFile.writerOptions(conf).inspector(oi));
> >>>>>>> try {
> >>>>>>> for (Data row : rows) {
> >>>>>>>   List<Object> struct = Orc.struct(row, inspector);
> >>>>>>>   writer.addRow(struct);
> >>>>>>> }
> >>>>>>> } finally {
> >>>>>>>   writer.close();
> >>>>>>> }
> >>>>>>>
> >>>>>>>
> >>>>>>> Here is the exception:
> >>>>>>>
> >>>>>>> java.lang.NullPointerException: null
> >>>>>>>        at
> >>>>>>>
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$Builder.setMinimum(OrcProto.java:1803)
> >>>>>>> ~[hive-exec-0.14.0.jar:0.14.0]
> >>>>>>>        at
> >>>>>>>
> org.apache.hadoop.hive.ql.io.orc.ColumnStatisticsImpl$StringStatisticsImpl.serialize(ColumnStatisticsImpl.java:411)
> >>>>>>> ~[hive-exec-0.14.0.jar:0.14.0]
> >>>>>>>        at
> >>>>>>>
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.createRowIndexEntry(WriterImpl.java:1255)
> >>>>>>> ~[hive-exec-0.14.0.jar:0.14.0]
> >>>>>>>        at
> >>>>>>>
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775)
> >>>>>>> ~[hive-exec-0.14.0.jar:0.14.0]
> >>>>>>>        at
> >>>>>>>
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775)
> >>>>>>> ~[hive-exec-0.14.0.jar:0.14.0]
> >>>>>>>        at
> >>>>>>>
> org.apache.hadoop.hive.ql.io.orc.WriterImpl.createRowIndexEntry(WriterImpl.java:1978)
> >>>>>>> ~[hive-exec-0.14.0.jar:0.14.0]
> >>>>>>>        at
> >>>>>>>
> org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1985)
> >>>>>>> ~[hive-exec-0.14.0.jar:0.14.0]
> >>>>>>>        at
> >>>>>>>
> org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:322)
> >>>>>>> ~[hive-exec-0.14.0.jar:0.14.0]
> >>>>>>>        at
> >>>>>>>
> org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168)
> >>>>>>> ~[hive-exec-0.14.0.jar:0.14.0]
> >>>>>>>        at
> >>>>>>>
> org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157)
> >>>>>>> ~[hive-exec-0.14.0.jar:0.14.0]
> >>>>>>>        at
> >>>>>>>
> org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2276)
> >>>>>>> ~[hive-exec-0.14.0.jar:
> >>>>>>>
> >>>>>>>
> >>>>>>> Versions:
> >>>>>>>
> >>>>>>> Hadoop: apache 2.2.0
> >>>>>>> Hive Apache: 0.14.0
> >>>>>>> Java 1.7
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks for your time reading this email.
> >>>>>>
> >>>>>>
> >>>>
> >>>
> >>
> >
>
>

Re: ORC NPE while writing stats

Reply via email to