Re: ORC NPE while writing stats

David Capwell Wed, 02 Sep 2015 18:58:15 -0700

I'll try that out and see if it goes away (not seen this in the past 24
hours, no code change).


Doing this now means that I can't share the memory, so will prob go with a
thread local and allocate fixed sizes to the pool per thread (50% heap / 50
threads).  Will most likely be awhile before I can report back (unless it
fails fast in testing)
On Sep 2, 2015 2:11 PM, "Owen O'Malley" <omal...@apache.org> wrote:

> (Dropping dev)
>
> Well, that explains the non-determinism, because the MemoryManager will be
> shared across threads and thus the stripes will get flushed at effectively
> random times.
>
> Can you try giving each writer a unique MemoryManager? You'll need to put
> a class into the org.apache.hadoop.hive.ql.io.orc package to get access to
> the necessary class (MemoryManager) and method
> (OrcFile.WriterOptions.memory). We may be missing a synchronization on the
> MemoryManager somewhere and thus be getting a race condition.
>
> Thanks,
>    Owen
>
> On Wed, Sep 2, 2015 at 12:57 PM, David Capwell <dcapw...@gmail.com> wrote:
>
>> We have multiple threads writing, but each thread works on one file, so
>> orc writer is only touched by one thread (never cross threads)
>> On Sep 2, 2015 11:18 AM, "Owen O'Malley" <omal...@apache.org> wrote:
>>
>>> I don't see how it would get there. That implies that minimum was null,
>>> but the count was non-zero.
>>>
>>> The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like:
>>>
>>> @Override
>>> OrcProto.ColumnStatistics.Builder serialize() {
>>>   OrcProto.ColumnStatistics.Builder result = super.serialize();
>>>   OrcProto.StringStatistics.Builder str =
>>>     OrcProto.StringStatistics.newBuilder();
>>>   if (getNumberOfValues() != 0) {
>>>     str.setMinimum(getMinimum());
>>>     str.setMaximum(getMaximum());
>>>     str.setSum(sum);
>>>   }
>>>   result.setStringStatistics(str);
>>>   return result;
>>> }
>>>
>>> and thus shouldn't call down to setMinimum unless it had at least some 
>>> non-null values in the column.
>>>
>>> Do you have multiple threads working? There isn't anything that should be 
>>> introducing non-determinism so for the same input it would fail at the same 
>>> point.
>>>
>>> .. Owen
>>>
>>>
>>>
>>>
>>> On Tue, Sep 1, 2015 at 10:51 PM, David Capwell <dcapw...@gmail.com>
>>> wrote:
>>>
>>>> We are writing ORC files in our application for hive to consume.
>>>> Given enough time, we have noticed that writing causes a NPE when
>>>> working with a string column's stats.  Not sure whats causing it on
>>>> our side yet since replaying the same data is just fine, it seems more
>>>> like this just happens over time (different data sources will hit this
>>>> around the same time in the same JVM).
>>>>
>>>> Here is the code in question, and below is the exception:
>>>>
>>>> final Writer writer = OrcFile.createWriter(path,
>>>> OrcFile.writerOptions(conf).inspector(oi));
>>>> try {
>>>> for (Data row : rows) {
>>>>    List<Object> struct = Orc.struct(row, inspector);
>>>>    writer.addRow(struct);
>>>> }
>>>> } finally {
>>>>    writer.close();
>>>> }
>>>>
>>>>
>>>> Here is the exception:
>>>>
>>>> java.lang.NullPointerException: null
>>>>         at
>>>> org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$Builder.setMinimum(OrcProto.java:1803)
>>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>>>         at
>>>> org.apache.hadoop.hive.ql.io.orc.ColumnStatisticsImpl$StringStatisticsImpl.serialize(ColumnStatisticsImpl.java:411)
>>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>>>         at
>>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.createRowIndexEntry(WriterImpl.java:1255)
>>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>>>         at
>>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775)
>>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>>>         at
>>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775)
>>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>>>         at
>>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.createRowIndexEntry(WriterImpl.java:1978)
>>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>>>         at
>>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1985)
>>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>>>         at
>>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:322)
>>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>>>         at
>>>> org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168)
>>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>>>         at
>>>> org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157)
>>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>>>         at
>>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2276)
>>>> ~[hive-exec-0.14.0.jar:
>>>>
>>>>
>>>> Versions:
>>>>
>>>> Hadoop: apache 2.2.0
>>>> Hive Apache: 0.14.0
>>>> Java 1.7
>>>>
>>>>
>>>> Thanks for your time reading this email.
>>>>
>>>
>>>
>

Re: ORC NPE while writing stats

Reply via email to