Hive - Serializing Query Plans

2015-09-02 Thread Raajay
Hi,

>From the documents and code I realize that after Semantic Analysis,
QueryPlan.java can be serialized to disk using Thrift (toBinaryString())
method.

Now if want to execute the serialized query plan (say, on Tez) what should
I do ? By de-serializing the string, I can get back the api.Query object.
For execution, I need the QueryPlan.java object.

How to go from api.Query (Thrift Generated) to QueryPlan.java ?

Thanks
Raajay


RE: can we add column type in where clause in a hive query?

2015-09-02 Thread Ryan Harris
the fact that you have other data in the column (like letters) implies that you 
have the column stored as a string, so use a regex.

SELECT CAST(mycol as BIGINT) WHERE my mycol RLIKE '^-?[0-9.]+$'

From: Mohit Durgapal [mailto:durgapalmo...@gmail.com]
Sent: Wednesday, September 02, 2015 5:09 AM
To: user@hive.apache.org
Subject: can we add column type in where clause in a hive query?

I would like to query a hive table only for those rows that have coulmn1 as 
integer  only. Due to some data corruption, without this check I am getting a 
lot of junk data(mix integer & letters), I would like to get rid of that data 
by applying something like "where column1 is INT" kind of condition, but I 
couldn't find anything like that in hive. Could anyone suggest how I could do 
it?

==
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL 
and may contain information that is privileged and exempt from disclosure under 
applicable law. If you are neither the intended recipient nor responsible for 
delivering the message to the intended recipient, please note that any 
dissemination, distribution, copying or the taking of any action in reliance 
upon the message is strictly prohibited. If you have received this 
communication in error, please notify the sender immediately.  Thank you.


Testing a HiveStoragePredicateHandler

2015-09-02 Thread Luke Lovett
I'm writing a HiveStoragePredicateHandler, and I'm trying to figure
out the most appropriate way to write unit tests for the
decomposePredicate method. I'm seeking advice for the best way to do
this.

The way I see it, there seem to be two obvious approaches:
1. Write a query as a string. Run it through the parser and optimizer,
then pass it into decomposePredicate. Assert something about the
return value.
2. Run an "explain" on a query and assert that the TableScan
"filterExpr" is what I expect, and that the Filter Operator's
predicate is what I expect.

I don't how to do (1), because I'm not sure what the appropriate
classes/methods are to parse and optimize a query, or how to figure
out this information. (2) sounds a lot easier, but I'm not sure that
this is the right approach, because I'm not completely sure how my
HiveStoragePredicateHandler directly affects the explain output.

Can anyone confirm that (2) is the right/wrong way to test a
HiveStoragePredicateHandler? Any advice on trying to test using method
(1)?

Many thanks in advance.


Re: ORC NPE while writing stats

2015-09-02 Thread David Capwell
Also, the data put in are primitives, structs (list), and arrays (list); we
don't use any of the boxed writables (like text).
On Sep 2, 2015 12:57 PM, "David Capwell"  wrote:

> We have multiple threads writing, but each thread works on one file, so
> orc writer is only touched by one thread (never cross threads)
> On Sep 2, 2015 11:18 AM, "Owen O'Malley"  wrote:
>
>> I don't see how it would get there. That implies that minimum was null,
>> but the count was non-zero.
>>
>> The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like:
>>
>> @Override
>> OrcProto.ColumnStatistics.Builder serialize() {
>>   OrcProto.ColumnStatistics.Builder result = super.serialize();
>>   OrcProto.StringStatistics.Builder str =
>> OrcProto.StringStatistics.newBuilder();
>>   if (getNumberOfValues() != 0) {
>> str.setMinimum(getMinimum());
>> str.setMaximum(getMaximum());
>> str.setSum(sum);
>>   }
>>   result.setStringStatistics(str);
>>   return result;
>> }
>>
>> and thus shouldn't call down to setMinimum unless it had at least some 
>> non-null values in the column.
>>
>> Do you have multiple threads working? There isn't anything that should be 
>> introducing non-determinism so for the same input it would fail at the same 
>> point.
>>
>> .. Owen
>>
>>
>>
>>
>> On Tue, Sep 1, 2015 at 10:51 PM, David Capwell 
>> wrote:
>>
>>> We are writing ORC files in our application for hive to consume.
>>> Given enough time, we have noticed that writing causes a NPE when
>>> working with a string column's stats.  Not sure whats causing it on
>>> our side yet since replaying the same data is just fine, it seems more
>>> like this just happens over time (different data sources will hit this
>>> around the same time in the same JVM).
>>>
>>> Here is the code in question, and below is the exception:
>>>
>>> final Writer writer = OrcFile.createWriter(path,
>>> OrcFile.writerOptions(conf).inspector(oi));
>>> try {
>>> for (Data row : rows) {
>>>List struct = Orc.struct(row, inspector);
>>>writer.addRow(struct);
>>> }
>>> } finally {
>>>writer.close();
>>> }
>>>
>>>
>>> Here is the exception:
>>>
>>> java.lang.NullPointerException: null
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$Builder.setMinimum(OrcProto.java:1803)
>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.ColumnStatisticsImpl$StringStatisticsImpl.serialize(ColumnStatisticsImpl.java:411)
>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.createRowIndexEntry(WriterImpl.java:1255)
>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775)
>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775)
>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.createRowIndexEntry(WriterImpl.java:1978)
>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1985)
>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:322)
>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168)
>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157)
>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2276)
>>> ~[hive-exec-0.14.0.jar:
>>>
>>>
>>> Versions:
>>>
>>> Hadoop: apache 2.2.0
>>> Hive Apache: 0.14.0
>>> Java 1.7
>>>
>>>
>>> Thanks for your time reading this email.
>>>
>>
>>


Re: Wrong results from join query in Hive 0.13 and also 1.0 with reproduce.

2015-09-02 Thread Jim Green
Thanks Ashutosh.
Actually for this kind of query, if I put the 2 filters in WHERE clause
instead of ON clause, the query result is correct.

Do you suggest we put all filters into WHERE or OR clause? And Why?

On Wed, Sep 2, 2015 at 3:13 PM, Ashutosh Chauhan 
wrote:

> It indeed is. Title of bug is symptom of problem and
> doesn't accurately describe the problem. Bug will be triggered if following
> conditions are met:
>
> If query contains 3 or more joins
> AND
> joins are merged (i.e. tables participating in two of those joins are
> joined on same keys)
> AND
> these merged joins are not consecutive in query
> AND
> there is a filter on one of tables who participated in merged join which
> is in WHERE clause (not as join condition)
> then said filter will be dropped.
>
> Query you posted meets all these criteria. You can avoid this bug if you
> rewrite your query such that it violates one of the requirement (listed
> above) to trigger the bug.
>
> Ashutosh
>
>
> On Wed, Sep 2, 2015 at 10:19 AM, Jim Green  wrote:
>
>> Hi Ashutosh,
>>
>> Is Hive-10841 related? from the title of that jira, it sais “where col is
>> not null”caused the issue; however above reproduce did not have that clause.
>>
>>
>>
>> On Wed, Sep 2, 2015 at 2:24 AM, Ashutosh Chauhan 
>> wrote:
>>
>>> https://issues.apache.org/jira/browse/HIVE-10841
>>>
>>> Thanks,
>>> Ashutosh
>>>
>>> On Tue, Sep 1, 2015 at 6:00 PM, Jim Green  wrote:
>>>
 Seems Hive 1.2 fixed this issue. But not sure what is the JIRA related
 and the possibility to backport this fix into Hive 0.13?


 On Tue, Sep 1, 2015 at 5:35 PM, Jim Green  wrote:

> Hi Team,
>
> Below is the minimum reproduce of wrong results in Hive 0.13:
>
> *1. Create 4 tables*
> CREATE EXTERNAL TABLE testjoin1( joincol string );
> CREATE EXTERNAL TABLE testjoin2(
>anothercol string ,
>joincol string);
>
> CREATE EXTERNAL TABLE testjoin3( anothercol string);
>
> CREATE EXTERNAL TABLE testjoin4(
>   joincol string,
>   wherecol string ,
>   wherecol2 string);
>
> *2. Insert sample data *
> (Note: Make sure you firstly create the dual table which only contains
> 1 row)
>
> insert into table testjoin1 select '1' from dual;
> insert into table testjoin2 select 'another','1' from dual;
> insert into table testjoin3 select 'another' from dual;
> insert into table testjoin4 select '1','I_AM_MISSING','201501' from
> dual;
> insert into table testjoin4 select
> '1','I_Shouldnot_be_in_output','201501' from
> dual;
>
> hive> select * from testjoin1;
> OK
> 1
> Time taken: 0.04 seconds, Fetched: 1 row(s)
>
> hive> select * from testjoin2;
> OK
> another1
> Time taken: 0.039 seconds, Fetched: 1 row(s)
>
> hive> select * from testjoin3;
> OK
> another
> Time taken: 0.038 seconds, Fetched: 1 row(s)
>
> hive> select * from testjoin4;
> OK
> 1I_AM_MISSING201501
> 1I_Shouldnot_be_in_output201501
> Time taken: 0.04 seconds, Fetched: 2 row(s)
>
> *3. SQL1 is returning wrong results.*
>
> Select testjoin4.* From
> testjoin1
> JOIN testjoin2
>   ON (testjoin2.joincol = testjoin1.joincol)
> JOIN testjoin3
>   ON (testjoin3.anothercol= testjoin2.anothercol)
> JOIN testjoin4
>   ON (testjoin4.joincol = testjoin1.joincol AND
> testjoin4.wherecol2='201501')
> WHERE (testjoin4.wherecol='I_AM_MISSING');
>
> 1I_AM_MISSING201501
> 1I_Shouldnot_be_in_output201501
> Time taken: 21.702 seconds, Fetched: 2 row(s)
>
>
> *4. SQL2 is returning good result(If we move the both filters to WHERE
> clause )*
>
> Select testjoin4.* From
> testjoin1
> JOIN testjoin2
>   ON (testjoin2.joincol = testjoin1.joincol)
> JOIN testjoin3
>   ON (testjoin3.anothercol= testjoin2.anothercol)
> JOIN testjoin4
>   ON (testjoin4.joincol = testjoin1.joincol)
> WHERE (testjoin4.wherecol='I_AM_MISSING' and
> testjoin4.wherecol2='201501');
>
> 1I_AM_MISSING201501
> Time taken: 20.393 seconds, Fetched: 1 row(s)
> —
> *Another test is done in Hive 1.0 and found both SQL1 and SQL2 are
> returning wrong results….*
>
> 1 I_AM_MISSING 201501
> 1 I_AM_MISSING 201501
> Time taken: 13.983 seconds, Fetched: 2 row(s)
>
> *Anybody knows any related JIRAs?*
>
> --
> Thanks,
> www.openkb.info
> (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)
>



 --
 Thanks,
 www.openkb.info
 (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)

>>>
>>>
>>
>>
>> --
>> Thanks,
>> www.openkb.info
>> (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)
>>
>
>


-- 

Re: ORC NPE while writing stats

2015-09-02 Thread Prasanth Jayachandran
Memory manager is made thread local
https://issues.apache.org/jira/browse/HIVE-10191

Can you try the patch from HIVE-10191 and see if that helps?

On Sep 2, 2015, at 8:58 PM, David Capwell 
> wrote:


I'll try that out and see if it goes away (not seen this in the past 24 hours, 
no code change).

Doing this now means that I can't share the memory, so will prob go with a 
thread local and allocate fixed sizes to the pool per thread (50% heap / 50 
threads).  Will most likely be awhile before I can report back (unless it fails 
fast in testing)

On Sep 2, 2015 2:11 PM, "Owen O'Malley" 
> wrote:
(Dropping dev)

Well, that explains the non-determinism, because the MemoryManager will be 
shared across threads and thus the stripes will get flushed at effectively 
random times.

Can you try giving each writer a unique MemoryManager? You'll need to put a 
class into the org.apache.hadoop.hive.ql.io.orc package to get access to the 
necessary class (MemoryManager) and method (OrcFile.WriterOptions.memory). We 
may be missing a synchronization on the MemoryManager somewhere and thus be 
getting a race condition.

Thanks,
   Owen

On Wed, Sep 2, 2015 at 12:57 PM, David Capwell 
> wrote:

We have multiple threads writing, but each thread works on one file, so orc 
writer is only touched by one thread (never cross threads)

On Sep 2, 2015 11:18 AM, "Owen O'Malley" 
> wrote:
I don't see how it would get there. That implies that minimum was null, but the 
count was non-zero.

The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like:


@Override
OrcProto.ColumnStatistics.Builder serialize() {
  OrcProto.ColumnStatistics.Builder result = super.serialize();
  OrcProto.StringStatistics.Builder str =
OrcProto.StringStatistics.newBuilder();
  if (getNumberOfValues() != 0) {
str.setMinimum(getMinimum());
str.setMaximum(getMaximum());
str.setSum(sum);
  }
  result.setStringStatistics(str);
  return result;
}


and thus shouldn't call down to setMinimum unless it had at least some non-null 
values in the column.

Do you have multiple threads working? There isn't anything that should be 
introducing non-determinism so for the same input it would fail at the same 
point.

.. Owen



On Tue, Sep 1, 2015 at 10:51 PM, David Capwell 
> wrote:
We are writing ORC files in our application for hive to consume.
Given enough time, we have noticed that writing causes a NPE when
working with a string column's stats.  Not sure whats causing it on
our side yet since replaying the same data is just fine, it seems more
like this just happens over time (different data sources will hit this
around the same time in the same JVM).

Here is the code in question, and below is the exception:

final Writer writer = OrcFile.createWriter(path,
OrcFile.writerOptions(conf).inspector(oi));
try {
for (Data row : rows) {
   List struct = Orc.struct(row, inspector);
   writer.addRow(struct);
}
} finally {
   writer.close();
}


Here is the exception:

java.lang.NullPointerException: null
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$Builder.setMinimum(OrcProto.java:1803)
~[hive-exec-0.14.0.jar:0.14.0]
at 
org.apache.hadoop.hive.ql.io.orc.ColumnStatisticsImpl$StringStatisticsImpl.serialize(ColumnStatisticsImpl.java:411)
~[hive-exec-0.14.0.jar:0.14.0]
at 
org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.createRowIndexEntry(WriterImpl.java:1255)
~[hive-exec-0.14.0.jar:0.14.0]
at 
org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775)
~[hive-exec-0.14.0.jar:0.14.0]
at 
org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775)
~[hive-exec-0.14.0.jar:0.14.0]
at 
org.apache.hadoop.hive.ql.io.orc.WriterImpl.createRowIndexEntry(WriterImpl.java:1978)
~[hive-exec-0.14.0.jar:0.14.0]
at 
org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1985)
~[hive-exec-0.14.0.jar:0.14.0]
at 
org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:322)
~[hive-exec-0.14.0.jar:0.14.0]
at 
org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168)
~[hive-exec-0.14.0.jar:0.14.0]
at 
org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157)
~[hive-exec-0.14.0.jar:0.14.0]
at 
org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2276)
~[hive-exec-0.14.0.jar:


Versions:

Hadoop: apache 2.2.0
Hive Apache: 0.14.0
Java 1.7


Thanks for your time reading this email.





Re: ORC NPE while writing stats

2015-09-02 Thread David Capwell
So, very quickly looked at the JIRA and I had the following question;
if you have a pool per thread rather than global, then assuming 50%
heap will cause writer to OOM with multiple threads, which is
different than older (0.14) ORC, correct?

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcConf.java#L83
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java#L94
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFile.java#L226

So with orc.memory.pool=0.5, this value only seems to make sense if
single threaded, so if you are writing with multiple threads, then I
assume the value should be (0.5 / #threads), so if 50 threads then
0.01 should be the value?

If this is true, I can't find any documentation about this, all docs
make it sound global.

On Wed, Sep 2, 2015 at 7:34 PM, David Capwell  wrote:
> Thanks for the jira, will see if that works for us.
>
> On Sep 2, 2015 7:11 PM, "Prasanth Jayachandran"
>  wrote:
>>
>> Memory manager is made thread local
>> https://issues.apache.org/jira/browse/HIVE-10191
>>
>> Can you try the patch from HIVE-10191 and see if that helps?
>>
>> On Sep 2, 2015, at 8:58 PM, David Capwell  wrote:
>>
>> I'll try that out and see if it goes away (not seen this in the past 24
>> hours, no code change).
>>
>> Doing this now means that I can't share the memory, so will prob go with a
>> thread local and allocate fixed sizes to the pool per thread (50% heap / 50
>> threads).  Will most likely be awhile before I can report back (unless it
>> fails fast in testing)
>>
>> On Sep 2, 2015 2:11 PM, "Owen O'Malley"  wrote:
>>>
>>> (Dropping dev)
>>>
>>> Well, that explains the non-determinism, because the MemoryManager will
>>> be shared across threads and thus the stripes will get flushed at
>>> effectively random times.
>>>
>>> Can you try giving each writer a unique MemoryManager? You'll need to put
>>> a class into the org.apache.hadoop.hive.ql.io.orc package to get access to
>>> the necessary class (MemoryManager) and method
>>> (OrcFile.WriterOptions.memory). We may be missing a synchronization on the
>>> MemoryManager somewhere and thus be getting a race condition.
>>>
>>> Thanks,
>>>Owen
>>>
>>> On Wed, Sep 2, 2015 at 12:57 PM, David Capwell 
>>> wrote:

 We have multiple threads writing, but each thread works on one file, so
 orc writer is only touched by one thread (never cross threads)

 On Sep 2, 2015 11:18 AM, "Owen O'Malley"  wrote:
>
> I don't see how it would get there. That implies that minimum was null,
> but the count was non-zero.
>
> The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like:
>
> @Override
> OrcProto.ColumnStatistics.Builder serialize() {
>   OrcProto.ColumnStatistics.Builder result = super.serialize();
>   OrcProto.StringStatistics.Builder str =
> OrcProto.StringStatistics.newBuilder();
>   if (getNumberOfValues() != 0) {
> str.setMinimum(getMinimum());
> str.setMaximum(getMaximum());
> str.setSum(sum);
>   }
>   result.setStringStatistics(str);
>   return result;
> }
>
> and thus shouldn't call down to setMinimum unless it had at least some
> non-null values in the column.
>
> Do you have multiple threads working? There isn't anything that should
> be introducing non-determinism so for the same input it would fail at the
> same point.
>
> .. Owen
>
>
>
>
> On Tue, Sep 1, 2015 at 10:51 PM, David Capwell 
> wrote:
>>
>> We are writing ORC files in our application for hive to consume.
>> Given enough time, we have noticed that writing causes a NPE when
>> working with a string column's stats.  Not sure whats causing it on
>> our side yet since replaying the same data is just fine, it seems more
>> like this just happens over time (different data sources will hit this
>> around the same time in the same JVM).
>>
>> Here is the code in question, and below is the exception:
>>
>> final Writer writer = OrcFile.createWriter(path,
>> OrcFile.writerOptions(conf).inspector(oi));
>> try {
>> for (Data row : rows) {
>>List struct = Orc.struct(row, inspector);
>>writer.addRow(struct);
>> }
>> } finally {
>>writer.close();
>> }
>>
>>
>> Here is the exception:
>>
>> java.lang.NullPointerException: null
>> at
>> org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$Builder.setMinimum(OrcProto.java:1803)
>> ~[hive-exec-0.14.0.jar:0.14.0]
>> at
>> 

Re: ORC NPE while writing stats

2015-09-02 Thread David Capwell
Walking the MemoryManager, and I have a few questions:

# statements

Every time you create a writer for a given thread (assuming the thread
local version), you just update MemoryManager with the stripe size.
The scale is just %heap / (#writer * stripe (assuming equal stripe
size)).

Periodically ORC checks if the estimated amount of data >
stripe*scale. If so it flushes the stripe right away.  When the flush
happens, it checks to see how close it is to the end of a block and
scales the next stripe based off this.

# question assuming statements are correct

So, for me, I only have one writer per thread at any point in time, so
if MM is partitioned based off thread, then do I really care about the
% set for the pool size?  Since ORC appears to flush a stripe early,
wouldn't it make sense to figure out how many concurrent writers I
have, how much memory I want to allocate, then set the stripe size to
this?

So for 50 threads, and a stripe size of 64mb, 3,200mb would be
required? So, as long as I make sure the rest of my application gives
enough room for ORC, then I can just leave the value as default so it
just does stripe size...

So, if right, MM doesn't really do anything for me, so no issue
sharding and not configuring?


Thanks for your time reading this email!

On Wed, Sep 2, 2015 at 8:57 PM, David Capwell  wrote:
> So, very quickly looked at the JIRA and I had the following question;
> if you have a pool per thread rather than global, then assuming 50%
> heap will cause writer to OOM with multiple threads, which is
> different than older (0.14) ORC, correct?
>
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcConf.java#L83
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java#L94
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFile.java#L226
>
> So with orc.memory.pool=0.5, this value only seems to make sense if
> single threaded, so if you are writing with multiple threads, then I
> assume the value should be (0.5 / #threads), so if 50 threads then
> 0.01 should be the value?
>
> If this is true, I can't find any documentation about this, all docs
> make it sound global.
>
> On Wed, Sep 2, 2015 at 7:34 PM, David Capwell  wrote:
>> Thanks for the jira, will see if that works for us.
>>
>> On Sep 2, 2015 7:11 PM, "Prasanth Jayachandran"
>>  wrote:
>>>
>>> Memory manager is made thread local
>>> https://issues.apache.org/jira/browse/HIVE-10191
>>>
>>> Can you try the patch from HIVE-10191 and see if that helps?
>>>
>>> On Sep 2, 2015, at 8:58 PM, David Capwell  wrote:
>>>
>>> I'll try that out and see if it goes away (not seen this in the past 24
>>> hours, no code change).
>>>
>>> Doing this now means that I can't share the memory, so will prob go with a
>>> thread local and allocate fixed sizes to the pool per thread (50% heap / 50
>>> threads).  Will most likely be awhile before I can report back (unless it
>>> fails fast in testing)
>>>
>>> On Sep 2, 2015 2:11 PM, "Owen O'Malley"  wrote:

 (Dropping dev)

 Well, that explains the non-determinism, because the MemoryManager will
 be shared across threads and thus the stripes will get flushed at
 effectively random times.

 Can you try giving each writer a unique MemoryManager? You'll need to put
 a class into the org.apache.hadoop.hive.ql.io.orc package to get access to
 the necessary class (MemoryManager) and method
 (OrcFile.WriterOptions.memory). We may be missing a synchronization on the
 MemoryManager somewhere and thus be getting a race condition.

 Thanks,
Owen

 On Wed, Sep 2, 2015 at 12:57 PM, David Capwell 
 wrote:
>
> We have multiple threads writing, but each thread works on one file, so
> orc writer is only touched by one thread (never cross threads)
>
> On Sep 2, 2015 11:18 AM, "Owen O'Malley"  wrote:
>>
>> I don't see how it would get there. That implies that minimum was null,
>> but the count was non-zero.
>>
>> The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like:
>>
>> @Override
>> OrcProto.ColumnStatistics.Builder serialize() {
>>   OrcProto.ColumnStatistics.Builder result = super.serialize();
>>   OrcProto.StringStatistics.Builder str =
>> OrcProto.StringStatistics.newBuilder();
>>   if (getNumberOfValues() != 0) {
>> str.setMinimum(getMinimum());
>> str.setMaximum(getMaximum());
>> str.setSum(sum);
>>   }
>>   result.setStringStatistics(str);
>>   return result;
>> }
>>
>> and thus shouldn't call down to setMinimum unless it had at least some
>> non-null values in the column.
>>
>> Do you have multiple threads working? 

Re: ORC NPE while writing stats

2015-09-02 Thread David Capwell
I'll try that out and see if it goes away (not seen this in the past 24
hours, no code change).

Doing this now means that I can't share the memory, so will prob go with a
thread local and allocate fixed sizes to the pool per thread (50% heap / 50
threads).  Will most likely be awhile before I can report back (unless it
fails fast in testing)
On Sep 2, 2015 2:11 PM, "Owen O'Malley"  wrote:

> (Dropping dev)
>
> Well, that explains the non-determinism, because the MemoryManager will be
> shared across threads and thus the stripes will get flushed at effectively
> random times.
>
> Can you try giving each writer a unique MemoryManager? You'll need to put
> a class into the org.apache.hadoop.hive.ql.io.orc package to get access to
> the necessary class (MemoryManager) and method
> (OrcFile.WriterOptions.memory). We may be missing a synchronization on the
> MemoryManager somewhere and thus be getting a race condition.
>
> Thanks,
>Owen
>
> On Wed, Sep 2, 2015 at 12:57 PM, David Capwell  wrote:
>
>> We have multiple threads writing, but each thread works on one file, so
>> orc writer is only touched by one thread (never cross threads)
>> On Sep 2, 2015 11:18 AM, "Owen O'Malley"  wrote:
>>
>>> I don't see how it would get there. That implies that minimum was null,
>>> but the count was non-zero.
>>>
>>> The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like:
>>>
>>> @Override
>>> OrcProto.ColumnStatistics.Builder serialize() {
>>>   OrcProto.ColumnStatistics.Builder result = super.serialize();
>>>   OrcProto.StringStatistics.Builder str =
>>> OrcProto.StringStatistics.newBuilder();
>>>   if (getNumberOfValues() != 0) {
>>> str.setMinimum(getMinimum());
>>> str.setMaximum(getMaximum());
>>> str.setSum(sum);
>>>   }
>>>   result.setStringStatistics(str);
>>>   return result;
>>> }
>>>
>>> and thus shouldn't call down to setMinimum unless it had at least some 
>>> non-null values in the column.
>>>
>>> Do you have multiple threads working? There isn't anything that should be 
>>> introducing non-determinism so for the same input it would fail at the same 
>>> point.
>>>
>>> .. Owen
>>>
>>>
>>>
>>>
>>> On Tue, Sep 1, 2015 at 10:51 PM, David Capwell 
>>> wrote:
>>>
 We are writing ORC files in our application for hive to consume.
 Given enough time, we have noticed that writing causes a NPE when
 working with a string column's stats.  Not sure whats causing it on
 our side yet since replaying the same data is just fine, it seems more
 like this just happens over time (different data sources will hit this
 around the same time in the same JVM).

 Here is the code in question, and below is the exception:

 final Writer writer = OrcFile.createWriter(path,
 OrcFile.writerOptions(conf).inspector(oi));
 try {
 for (Data row : rows) {
List struct = Orc.struct(row, inspector);
writer.addRow(struct);
 }
 } finally {
writer.close();
 }


 Here is the exception:

 java.lang.NullPointerException: null
 at
 org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$Builder.setMinimum(OrcProto.java:1803)
 ~[hive-exec-0.14.0.jar:0.14.0]
 at
 org.apache.hadoop.hive.ql.io.orc.ColumnStatisticsImpl$StringStatisticsImpl.serialize(ColumnStatisticsImpl.java:411)
 ~[hive-exec-0.14.0.jar:0.14.0]
 at
 org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.createRowIndexEntry(WriterImpl.java:1255)
 ~[hive-exec-0.14.0.jar:0.14.0]
 at
 org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775)
 ~[hive-exec-0.14.0.jar:0.14.0]
 at
 org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775)
 ~[hive-exec-0.14.0.jar:0.14.0]
 at
 org.apache.hadoop.hive.ql.io.orc.WriterImpl.createRowIndexEntry(WriterImpl.java:1978)
 ~[hive-exec-0.14.0.jar:0.14.0]
 at
 org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1985)
 ~[hive-exec-0.14.0.jar:0.14.0]
 at
 org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:322)
 ~[hive-exec-0.14.0.jar:0.14.0]
 at
 org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168)
 ~[hive-exec-0.14.0.jar:0.14.0]
 at
 org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157)
 ~[hive-exec-0.14.0.jar:0.14.0]
 at
 org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2276)
 ~[hive-exec-0.14.0.jar:


 Versions:

 Hadoop: apache 2.2.0
 Hive Apache: 0.14.0
 Java 1.7


 Thanks for your time reading this email.

>>>
>>>
>


Re: ORC NPE while writing stats

2015-09-02 Thread David Capwell
Thanks for the jira, will see if that works for us.
On Sep 2, 2015 7:11 PM, "Prasanth Jayachandran" <
pjayachand...@hortonworks.com> wrote:

> Memory manager is made thread local
> https://issues.apache.org/jira/browse/HIVE-10191
>
> Can you try the patch from HIVE-10191 and see if that helps?
>
> On Sep 2, 2015, at 8:58 PM, David Capwell  wrote:
>
> I'll try that out and see if it goes away (not seen this in the past 24
> hours, no code change).
>
> Doing this now means that I can't share the memory, so will prob go with a
> thread local and allocate fixed sizes to the pool per thread (50% heap / 50
> threads).  Will most likely be awhile before I can report back (unless it
> fails fast in testing)
> On Sep 2, 2015 2:11 PM, "Owen O'Malley"  wrote:
>
>> (Dropping dev)
>>
>> Well, that explains the non-determinism, because the MemoryManager will
>> be shared across threads and thus the stripes will get flushed at
>> effectively random times.
>>
>> Can you try giving each writer a unique MemoryManager? You'll need to put
>> a class into the org.apache.hadoop.hive.ql.io.orc package to get access to
>> the necessary class (MemoryManager) and method
>> (OrcFile.WriterOptions.memory). We may be missing a synchronization on the
>> MemoryManager somewhere and thus be getting a race condition.
>>
>> Thanks,
>>Owen
>>
>> On Wed, Sep 2, 2015 at 12:57 PM, David Capwell 
>> wrote:
>>
>>> We have multiple threads writing, but each thread works on one file, so
>>> orc writer is only touched by one thread (never cross threads)
>>> On Sep 2, 2015 11:18 AM, "Owen O'Malley"  wrote:
>>>
 I don't see how it would get there. That implies that minimum was null,
 but the count was non-zero.

 The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like:

 @Override
 OrcProto.ColumnStatistics.Builder serialize() {
   OrcProto.ColumnStatistics.Builder result = super.serialize();
   OrcProto.StringStatistics.Builder str =
 OrcProto.StringStatistics.newBuilder();
   if (getNumberOfValues() != 0) {
 str.setMinimum(getMinimum());
 str.setMaximum(getMaximum());
 str.setSum(sum);
   }
   result.setStringStatistics(str);
   return result;
 }

 and thus shouldn't call down to setMinimum unless it had at least some 
 non-null values in the column.

 Do you have multiple threads working? There isn't anything that should be 
 introducing non-determinism so for the same input it would fail at the 
 same point.

 .. Owen




 On Tue, Sep 1, 2015 at 10:51 PM, David Capwell 
 wrote:

> We are writing ORC files in our application for hive to consume.
> Given enough time, we have noticed that writing causes a NPE when
> working with a string column's stats.  Not sure whats causing it on
> our side yet since replaying the same data is just fine, it seems more
> like this just happens over time (different data sources will hit this
> around the same time in the same JVM).
>
> Here is the code in question, and below is the exception:
>
> final Writer writer = OrcFile.createWriter(path,
> OrcFile.writerOptions(conf).inspector(oi));
> try {
> for (Data row : rows) {
>List struct = Orc.struct(row, inspector);
>writer.addRow(struct);
> }
> } finally {
>writer.close();
> }
>
>
> Here is the exception:
>
> java.lang.NullPointerException: null
> at
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$Builder.setMinimum(OrcProto.java:1803)
> ~[hive-exec-0.14.0.jar:0.14.0]
> at
> org.apache.hadoop.hive.ql.io.orc.ColumnStatisticsImpl$StringStatisticsImpl.serialize(ColumnStatisticsImpl.java:411)
> ~[hive-exec-0.14.0.jar:0.14.0]
> at
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.createRowIndexEntry(WriterImpl.java:1255)
> ~[hive-exec-0.14.0.jar:0.14.0]
> at
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775)
> ~[hive-exec-0.14.0.jar:0.14.0]
> at
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775)
> ~[hive-exec-0.14.0.jar:0.14.0]
> at
> org.apache.hadoop.hive.ql.io.orc.WriterImpl.createRowIndexEntry(WriterImpl.java:1978)
> ~[hive-exec-0.14.0.jar:0.14.0]
> at
> org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1985)
> ~[hive-exec-0.14.0.jar:0.14.0]
> at
> org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:322)
> ~[hive-exec-0.14.0.jar:0.14.0]
> at
> org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168)
> ~[hive-exec-0.14.0.jar:0.14.0]

Re: ORC NPE while writing stats

2015-09-02 Thread David Capwell
Also, if I am walking this correctly

writer.addRow(struct) may trigger my current thread to flush all the
state for other writers running in different threads.  This state
isn't updated by the same lock, so my thread won't see the same state,
which would explain the NPE.  Another issue is that estimateStripeSize
won't always give the correct value since my thread is the one calling
it...

With everything ThreadLocal, the only writers would be the ones in the
same thread, so should be better.


On Wed, Sep 2, 2015 at 9:47 PM, David Capwell  wrote:
> Walking the MemoryManager, and I have a few questions:
>
> # statements
>
> Every time you create a writer for a given thread (assuming the thread
> local version), you just update MemoryManager with the stripe size.
> The scale is just %heap / (#writer * stripe (assuming equal stripe
> size)).
>
> Periodically ORC checks if the estimated amount of data >
> stripe*scale. If so it flushes the stripe right away.  When the flush
> happens, it checks to see how close it is to the end of a block and
> scales the next stripe based off this.
>
> # question assuming statements are correct
>
> So, for me, I only have one writer per thread at any point in time, so
> if MM is partitioned based off thread, then do I really care about the
> % set for the pool size?  Since ORC appears to flush a stripe early,
> wouldn't it make sense to figure out how many concurrent writers I
> have, how much memory I want to allocate, then set the stripe size to
> this?
>
> So for 50 threads, and a stripe size of 64mb, 3,200mb would be
> required? So, as long as I make sure the rest of my application gives
> enough room for ORC, then I can just leave the value as default so it
> just does stripe size...
>
> So, if right, MM doesn't really do anything for me, so no issue
> sharding and not configuring?
>
>
> Thanks for your time reading this email!
>
> On Wed, Sep 2, 2015 at 8:57 PM, David Capwell  wrote:
>> So, very quickly looked at the JIRA and I had the following question;
>> if you have a pool per thread rather than global, then assuming 50%
>> heap will cause writer to OOM with multiple threads, which is
>> different than older (0.14) ORC, correct?
>>
>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcConf.java#L83
>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java#L94
>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFile.java#L226
>>
>> So with orc.memory.pool=0.5, this value only seems to make sense if
>> single threaded, so if you are writing with multiple threads, then I
>> assume the value should be (0.5 / #threads), so if 50 threads then
>> 0.01 should be the value?
>>
>> If this is true, I can't find any documentation about this, all docs
>> make it sound global.
>>
>> On Wed, Sep 2, 2015 at 7:34 PM, David Capwell  wrote:
>>> Thanks for the jira, will see if that works for us.
>>>
>>> On Sep 2, 2015 7:11 PM, "Prasanth Jayachandran"
>>>  wrote:

 Memory manager is made thread local
 https://issues.apache.org/jira/browse/HIVE-10191

 Can you try the patch from HIVE-10191 and see if that helps?

 On Sep 2, 2015, at 8:58 PM, David Capwell  wrote:

 I'll try that out and see if it goes away (not seen this in the past 24
 hours, no code change).

 Doing this now means that I can't share the memory, so will prob go with a
 thread local and allocate fixed sizes to the pool per thread (50% heap / 50
 threads).  Will most likely be awhile before I can report back (unless it
 fails fast in testing)

 On Sep 2, 2015 2:11 PM, "Owen O'Malley"  wrote:
>
> (Dropping dev)
>
> Well, that explains the non-determinism, because the MemoryManager will
> be shared across threads and thus the stripes will get flushed at
> effectively random times.
>
> Can you try giving each writer a unique MemoryManager? You'll need to put
> a class into the org.apache.hadoop.hive.ql.io.orc package to get access to
> the necessary class (MemoryManager) and method
> (OrcFile.WriterOptions.memory). We may be missing a synchronization on the
> MemoryManager somewhere and thus be getting a race condition.
>
> Thanks,
>Owen
>
> On Wed, Sep 2, 2015 at 12:57 PM, David Capwell 
> wrote:
>>
>> We have multiple threads writing, but each thread works on one file, so
>> orc writer is only touched by one thread (never cross threads)
>>
>> On Sep 2, 2015 11:18 AM, "Owen O'Malley"  wrote:
>>>
>>> I don't see how it would get there. That implies that minimum was null,
>>> but the count was non-zero.
>>>
>>> The 

Re: ORC NPE while writing stats

2015-09-02 Thread Owen O'Malley
I don't see how it would get there. That implies that minimum was null, but
the count was non-zero.

The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like:

@Override
OrcProto.ColumnStatistics.Builder serialize() {
  OrcProto.ColumnStatistics.Builder result = super.serialize();
  OrcProto.StringStatistics.Builder str =
OrcProto.StringStatistics.newBuilder();
  if (getNumberOfValues() != 0) {
str.setMinimum(getMinimum());
str.setMaximum(getMaximum());
str.setSum(sum);
  }
  result.setStringStatistics(str);
  return result;
}

and thus shouldn't call down to setMinimum unless it had at least some
non-null values in the column.

Do you have multiple threads working? There isn't anything that should
be introducing non-determinism so for the same input it would fail at
the same point.

.. Owen




On Tue, Sep 1, 2015 at 10:51 PM, David Capwell  wrote:

> We are writing ORC files in our application for hive to consume.
> Given enough time, we have noticed that writing causes a NPE when
> working with a string column's stats.  Not sure whats causing it on
> our side yet since replaying the same data is just fine, it seems more
> like this just happens over time (different data sources will hit this
> around the same time in the same JVM).
>
> Here is the code in question, and below is the exception:
>
> final Writer writer = OrcFile.createWriter(path,
> OrcFile.writerOptions(conf).inspector(oi));
> try {
> for (Data row : rows) {
>List struct = Orc.struct(row, inspector);
>writer.addRow(struct);
> }
> } finally {
>writer.close();
> }
>
>
> Here is the exception:
>
> java.lang.NullPointerException: null
> at
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$Builder.setMinimum(OrcProto.java:1803)
> ~[hive-exec-0.14.0.jar:0.14.0]
> at
> org.apache.hadoop.hive.ql.io.orc.ColumnStatisticsImpl$StringStatisticsImpl.serialize(ColumnStatisticsImpl.java:411)
> ~[hive-exec-0.14.0.jar:0.14.0]
> at
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.createRowIndexEntry(WriterImpl.java:1255)
> ~[hive-exec-0.14.0.jar:0.14.0]
> at
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775)
> ~[hive-exec-0.14.0.jar:0.14.0]
> at
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775)
> ~[hive-exec-0.14.0.jar:0.14.0]
> at
> org.apache.hadoop.hive.ql.io.orc.WriterImpl.createRowIndexEntry(WriterImpl.java:1978)
> ~[hive-exec-0.14.0.jar:0.14.0]
> at
> org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1985)
> ~[hive-exec-0.14.0.jar:0.14.0]
> at
> org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:322)
> ~[hive-exec-0.14.0.jar:0.14.0]
> at
> org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168)
> ~[hive-exec-0.14.0.jar:0.14.0]
> at
> org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157)
> ~[hive-exec-0.14.0.jar:0.14.0]
> at
> org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2276)
> ~[hive-exec-0.14.0.jar:
>
>
> Versions:
>
> Hadoop: apache 2.2.0
> Hive Apache: 0.14.0
> Java 1.7
>
>
> Thanks for your time reading this email.
>


Re: ORC NPE while writing stats

2015-09-02 Thread Owen O'Malley
(Dropping dev)

Well, that explains the non-determinism, because the MemoryManager will be
shared across threads and thus the stripes will get flushed at effectively
random times.

Can you try giving each writer a unique MemoryManager? You'll need to put a
class into the org.apache.hadoop.hive.ql.io.orc package to get access to
the necessary class (MemoryManager) and method
(OrcFile.WriterOptions.memory). We may be missing a synchronization on the
MemoryManager somewhere and thus be getting a race condition.

Thanks,
   Owen

On Wed, Sep 2, 2015 at 12:57 PM, David Capwell  wrote:

> We have multiple threads writing, but each thread works on one file, so
> orc writer is only touched by one thread (never cross threads)
> On Sep 2, 2015 11:18 AM, "Owen O'Malley"  wrote:
>
>> I don't see how it would get there. That implies that minimum was null,
>> but the count was non-zero.
>>
>> The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like:
>>
>> @Override
>> OrcProto.ColumnStatistics.Builder serialize() {
>>   OrcProto.ColumnStatistics.Builder result = super.serialize();
>>   OrcProto.StringStatistics.Builder str =
>> OrcProto.StringStatistics.newBuilder();
>>   if (getNumberOfValues() != 0) {
>> str.setMinimum(getMinimum());
>> str.setMaximum(getMaximum());
>> str.setSum(sum);
>>   }
>>   result.setStringStatistics(str);
>>   return result;
>> }
>>
>> and thus shouldn't call down to setMinimum unless it had at least some 
>> non-null values in the column.
>>
>> Do you have multiple threads working? There isn't anything that should be 
>> introducing non-determinism so for the same input it would fail at the same 
>> point.
>>
>> .. Owen
>>
>>
>>
>>
>> On Tue, Sep 1, 2015 at 10:51 PM, David Capwell 
>> wrote:
>>
>>> We are writing ORC files in our application for hive to consume.
>>> Given enough time, we have noticed that writing causes a NPE when
>>> working with a string column's stats.  Not sure whats causing it on
>>> our side yet since replaying the same data is just fine, it seems more
>>> like this just happens over time (different data sources will hit this
>>> around the same time in the same JVM).
>>>
>>> Here is the code in question, and below is the exception:
>>>
>>> final Writer writer = OrcFile.createWriter(path,
>>> OrcFile.writerOptions(conf).inspector(oi));
>>> try {
>>> for (Data row : rows) {
>>>List struct = Orc.struct(row, inspector);
>>>writer.addRow(struct);
>>> }
>>> } finally {
>>>writer.close();
>>> }
>>>
>>>
>>> Here is the exception:
>>>
>>> java.lang.NullPointerException: null
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$Builder.setMinimum(OrcProto.java:1803)
>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.ColumnStatisticsImpl$StringStatisticsImpl.serialize(ColumnStatisticsImpl.java:411)
>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.createRowIndexEntry(WriterImpl.java:1255)
>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775)
>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775)
>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.createRowIndexEntry(WriterImpl.java:1978)
>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1985)
>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:322)
>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168)
>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157)
>>> ~[hive-exec-0.14.0.jar:0.14.0]
>>> at
>>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2276)
>>> ~[hive-exec-0.14.0.jar:
>>>
>>>
>>> Versions:
>>>
>>> Hadoop: apache 2.2.0
>>> Hive Apache: 0.14.0
>>> Java 1.7
>>>
>>>
>>> Thanks for your time reading this email.
>>>
>>
>>


Re: Wrong results from join query in Hive 0.13 and also 1.0 with reproduce.

2015-09-02 Thread Ashutosh Chauhan
It indeed is. Title of bug is symptom of problem and
doesn't accurately describe the problem. Bug will be triggered if following
conditions are met:

If query contains 3 or more joins
AND
joins are merged (i.e. tables participating in two of those joins are
joined on same keys)
AND
these merged joins are not consecutive in query
AND
there is a filter on one of tables who participated in merged join which is
in WHERE clause (not as join condition)
then said filter will be dropped.

Query you posted meets all these criteria. You can avoid this bug if you
rewrite your query such that it violates one of the requirement (listed
above) to trigger the bug.

Ashutosh


On Wed, Sep 2, 2015 at 10:19 AM, Jim Green  wrote:

> Hi Ashutosh,
>
> Is Hive-10841 related? from the title of that jira, it sais “where col is
> not null”caused the issue; however above reproduce did not have that clause.
>
>
>
> On Wed, Sep 2, 2015 at 2:24 AM, Ashutosh Chauhan 
> wrote:
>
>> https://issues.apache.org/jira/browse/HIVE-10841
>>
>> Thanks,
>> Ashutosh
>>
>> On Tue, Sep 1, 2015 at 6:00 PM, Jim Green  wrote:
>>
>>> Seems Hive 1.2 fixed this issue. But not sure what is the JIRA related
>>> and the possibility to backport this fix into Hive 0.13?
>>>
>>>
>>> On Tue, Sep 1, 2015 at 5:35 PM, Jim Green  wrote:
>>>
 Hi Team,

 Below is the minimum reproduce of wrong results in Hive 0.13:

 *1. Create 4 tables*
 CREATE EXTERNAL TABLE testjoin1( joincol string );
 CREATE EXTERNAL TABLE testjoin2(
anothercol string ,
joincol string);

 CREATE EXTERNAL TABLE testjoin3( anothercol string);

 CREATE EXTERNAL TABLE testjoin4(
   joincol string,
   wherecol string ,
   wherecol2 string);

 *2. Insert sample data *
 (Note: Make sure you firstly create the dual table which only contains
 1 row)

 insert into table testjoin1 select '1' from dual;
 insert into table testjoin2 select 'another','1' from dual;
 insert into table testjoin3 select 'another' from dual;
 insert into table testjoin4 select '1','I_AM_MISSING','201501' from
 dual;
 insert into table testjoin4 select
 '1','I_Shouldnot_be_in_output','201501' from
 dual;

 hive> select * from testjoin1;
 OK
 1
 Time taken: 0.04 seconds, Fetched: 1 row(s)

 hive> select * from testjoin2;
 OK
 another1
 Time taken: 0.039 seconds, Fetched: 1 row(s)

 hive> select * from testjoin3;
 OK
 another
 Time taken: 0.038 seconds, Fetched: 1 row(s)

 hive> select * from testjoin4;
 OK
 1I_AM_MISSING201501
 1I_Shouldnot_be_in_output201501
 Time taken: 0.04 seconds, Fetched: 2 row(s)

 *3. SQL1 is returning wrong results.*

 Select testjoin4.* From
 testjoin1
 JOIN testjoin2
   ON (testjoin2.joincol = testjoin1.joincol)
 JOIN testjoin3
   ON (testjoin3.anothercol= testjoin2.anothercol)
 JOIN testjoin4
   ON (testjoin4.joincol = testjoin1.joincol AND
 testjoin4.wherecol2='201501')
 WHERE (testjoin4.wherecol='I_AM_MISSING');

 1I_AM_MISSING201501
 1I_Shouldnot_be_in_output201501
 Time taken: 21.702 seconds, Fetched: 2 row(s)


 *4. SQL2 is returning good result(If we move the both filters to WHERE
 clause )*

 Select testjoin4.* From
 testjoin1
 JOIN testjoin2
   ON (testjoin2.joincol = testjoin1.joincol)
 JOIN testjoin3
   ON (testjoin3.anothercol= testjoin2.anothercol)
 JOIN testjoin4
   ON (testjoin4.joincol = testjoin1.joincol)
 WHERE (testjoin4.wherecol='I_AM_MISSING' and
 testjoin4.wherecol2='201501');

 1I_AM_MISSING201501
 Time taken: 20.393 seconds, Fetched: 1 row(s)
 —
 *Another test is done in Hive 1.0 and found both SQL1 and SQL2 are
 returning wrong results….*

 1 I_AM_MISSING 201501
 1 I_AM_MISSING 201501
 Time taken: 13.983 seconds, Fetched: 2 row(s)

 *Anybody knows any related JIRAs?*

 --
 Thanks,
 www.openkb.info
 (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)

>>>
>>>
>>>
>>> --
>>> Thanks,
>>> www.openkb.info
>>> (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)
>>>
>>
>>
>
>
> --
> Thanks,
> www.openkb.info
> (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)
>


Request for write access to the Hive wiki

2015-09-02 Thread Aswathy C.S
Hi,

I would like to get write access to Hive will. My Confluence username:
asreekumar.

thanks
Aswathy


Disabling local mode optimization

2015-09-02 Thread Daniel Haviv
Hi,
I would like to disable the optimization where a query that just selects
data is running without mapreduce (local mode).

hive.exec.mode.local.auto is set to false but hive still runs in local
mode for some queries.


How can I disable local mode completely?


Thank you.

Daniel


Re: Wrong results from join query in Hive 0.13 and also 1.0 with reproduce.

2015-09-02 Thread Ashutosh Chauhan
https://issues.apache.org/jira/browse/HIVE-10841

Thanks,
Ashutosh

On Tue, Sep 1, 2015 at 6:00 PM, Jim Green  wrote:

> Seems Hive 1.2 fixed this issue. But not sure what is the JIRA related and
> the possibility to backport this fix into Hive 0.13?
>
>
> On Tue, Sep 1, 2015 at 5:35 PM, Jim Green  wrote:
>
>> Hi Team,
>>
>> Below is the minimum reproduce of wrong results in Hive 0.13:
>>
>> *1. Create 4 tables*
>> CREATE EXTERNAL TABLE testjoin1( joincol string );
>> CREATE EXTERNAL TABLE testjoin2(
>>anothercol string ,
>>joincol string);
>>
>> CREATE EXTERNAL TABLE testjoin3( anothercol string);
>>
>> CREATE EXTERNAL TABLE testjoin4(
>>   joincol string,
>>   wherecol string ,
>>   wherecol2 string);
>>
>> *2. Insert sample data *
>> (Note: Make sure you firstly create the dual table which only contains 1
>> row)
>>
>> insert into table testjoin1 select '1' from dual;
>> insert into table testjoin2 select 'another','1' from dual;
>> insert into table testjoin3 select 'another' from dual;
>> insert into table testjoin4 select '1','I_AM_MISSING','201501' from dual;
>> insert into table testjoin4 select
>> '1','I_Shouldnot_be_in_output','201501' from
>> dual;
>>
>> hive> select * from testjoin1;
>> OK
>> 1
>> Time taken: 0.04 seconds, Fetched: 1 row(s)
>>
>> hive> select * from testjoin2;
>> OK
>> another1
>> Time taken: 0.039 seconds, Fetched: 1 row(s)
>>
>> hive> select * from testjoin3;
>> OK
>> another
>> Time taken: 0.038 seconds, Fetched: 1 row(s)
>>
>> hive> select * from testjoin4;
>> OK
>> 1I_AM_MISSING201501
>> 1I_Shouldnot_be_in_output201501
>> Time taken: 0.04 seconds, Fetched: 2 row(s)
>>
>> *3. SQL1 is returning wrong results.*
>>
>> Select testjoin4.* From
>> testjoin1
>> JOIN testjoin2
>>   ON (testjoin2.joincol = testjoin1.joincol)
>> JOIN testjoin3
>>   ON (testjoin3.anothercol= testjoin2.anothercol)
>> JOIN testjoin4
>>   ON (testjoin4.joincol = testjoin1.joincol AND
>> testjoin4.wherecol2='201501')
>> WHERE (testjoin4.wherecol='I_AM_MISSING');
>>
>> 1I_AM_MISSING201501
>> 1I_Shouldnot_be_in_output201501
>> Time taken: 21.702 seconds, Fetched: 2 row(s)
>>
>>
>> *4. SQL2 is returning good result(If we move the both filters to WHERE
>> clause )*
>>
>> Select testjoin4.* From
>> testjoin1
>> JOIN testjoin2
>>   ON (testjoin2.joincol = testjoin1.joincol)
>> JOIN testjoin3
>>   ON (testjoin3.anothercol= testjoin2.anothercol)
>> JOIN testjoin4
>>   ON (testjoin4.joincol = testjoin1.joincol)
>> WHERE (testjoin4.wherecol='I_AM_MISSING' and
>> testjoin4.wherecol2='201501');
>>
>> 1I_AM_MISSING201501
>> Time taken: 20.393 seconds, Fetched: 1 row(s)
>> —
>> *Another test is done in Hive 1.0 and found both SQL1 and SQL2 are
>> returning wrong results….*
>>
>> 1 I_AM_MISSING 201501
>> 1 I_AM_MISSING 201501
>> Time taken: 13.983 seconds, Fetched: 2 row(s)
>>
>> *Anybody knows any related JIRAs?*
>>
>> --
>> Thanks,
>> www.openkb.info
>> (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)
>>
>
>
>
> --
> Thanks,
> www.openkb.info
> (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)
>