Hive - Serializing Query Plans
Hi, >From the documents and code I realize that after Semantic Analysis, QueryPlan.java can be serialized to disk using Thrift (toBinaryString()) method. Now if want to execute the serialized query plan (say, on Tez) what should I do ? By de-serializing the string, I can get back the api.Query object. For execution, I need the QueryPlan.java object. How to go from api.Query (Thrift Generated) to QueryPlan.java ? Thanks Raajay
RE: can we add column type in where clause in a hive query?
the fact that you have other data in the column (like letters) implies that you have the column stored as a string, so use a regex. SELECT CAST(mycol as BIGINT) WHERE my mycol RLIKE '^-?[0-9.]+$' From: Mohit Durgapal [mailto:durgapalmo...@gmail.com] Sent: Wednesday, September 02, 2015 5:09 AM To: user@hive.apache.org Subject: can we add column type in where clause in a hive query? I would like to query a hive table only for those rows that have coulmn1 as integer only. Due to some data corruption, without this check I am getting a lot of junk data(mix integer & letters), I would like to get rid of that data by applying something like "where column1 is INT" kind of condition, but I couldn't find anything like that in hive. Could anyone suggest how I could do it? == THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL and may contain information that is privileged and exempt from disclosure under applicable law. If you are neither the intended recipient nor responsible for delivering the message to the intended recipient, please note that any dissemination, distribution, copying or the taking of any action in reliance upon the message is strictly prohibited. If you have received this communication in error, please notify the sender immediately. Thank you.
Testing a HiveStoragePredicateHandler
I'm writing a HiveStoragePredicateHandler, and I'm trying to figure out the most appropriate way to write unit tests for the decomposePredicate method. I'm seeking advice for the best way to do this. The way I see it, there seem to be two obvious approaches: 1. Write a query as a string. Run it through the parser and optimizer, then pass it into decomposePredicate. Assert something about the return value. 2. Run an "explain" on a query and assert that the TableScan "filterExpr" is what I expect, and that the Filter Operator's predicate is what I expect. I don't how to do (1), because I'm not sure what the appropriate classes/methods are to parse and optimize a query, or how to figure out this information. (2) sounds a lot easier, but I'm not sure that this is the right approach, because I'm not completely sure how my HiveStoragePredicateHandler directly affects the explain output. Can anyone confirm that (2) is the right/wrong way to test a HiveStoragePredicateHandler? Any advice on trying to test using method (1)? Many thanks in advance.
Re: ORC NPE while writing stats
Also, the data put in are primitives, structs (list), and arrays (list); we don't use any of the boxed writables (like text). On Sep 2, 2015 12:57 PM, "David Capwell"wrote: > We have multiple threads writing, but each thread works on one file, so > orc writer is only touched by one thread (never cross threads) > On Sep 2, 2015 11:18 AM, "Owen O'Malley" wrote: > >> I don't see how it would get there. That implies that minimum was null, >> but the count was non-zero. >> >> The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like: >> >> @Override >> OrcProto.ColumnStatistics.Builder serialize() { >> OrcProto.ColumnStatistics.Builder result = super.serialize(); >> OrcProto.StringStatistics.Builder str = >> OrcProto.StringStatistics.newBuilder(); >> if (getNumberOfValues() != 0) { >> str.setMinimum(getMinimum()); >> str.setMaximum(getMaximum()); >> str.setSum(sum); >> } >> result.setStringStatistics(str); >> return result; >> } >> >> and thus shouldn't call down to setMinimum unless it had at least some >> non-null values in the column. >> >> Do you have multiple threads working? There isn't anything that should be >> introducing non-determinism so for the same input it would fail at the same >> point. >> >> .. Owen >> >> >> >> >> On Tue, Sep 1, 2015 at 10:51 PM, David Capwell >> wrote: >> >>> We are writing ORC files in our application for hive to consume. >>> Given enough time, we have noticed that writing causes a NPE when >>> working with a string column's stats. Not sure whats causing it on >>> our side yet since replaying the same data is just fine, it seems more >>> like this just happens over time (different data sources will hit this >>> around the same time in the same JVM). >>> >>> Here is the code in question, and below is the exception: >>> >>> final Writer writer = OrcFile.createWriter(path, >>> OrcFile.writerOptions(conf).inspector(oi)); >>> try { >>> for (Data row : rows) { >>>List struct = Orc.struct(row, inspector); >>>writer.addRow(struct); >>> } >>> } finally { >>>writer.close(); >>> } >>> >>> >>> Here is the exception: >>> >>> java.lang.NullPointerException: null >>> at >>> org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$Builder.setMinimum(OrcProto.java:1803) >>> ~[hive-exec-0.14.0.jar:0.14.0] >>> at >>> org.apache.hadoop.hive.ql.io.orc.ColumnStatisticsImpl$StringStatisticsImpl.serialize(ColumnStatisticsImpl.java:411) >>> ~[hive-exec-0.14.0.jar:0.14.0] >>> at >>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.createRowIndexEntry(WriterImpl.java:1255) >>> ~[hive-exec-0.14.0.jar:0.14.0] >>> at >>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775) >>> ~[hive-exec-0.14.0.jar:0.14.0] >>> at >>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775) >>> ~[hive-exec-0.14.0.jar:0.14.0] >>> at >>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.createRowIndexEntry(WriterImpl.java:1978) >>> ~[hive-exec-0.14.0.jar:0.14.0] >>> at >>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1985) >>> ~[hive-exec-0.14.0.jar:0.14.0] >>> at >>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:322) >>> ~[hive-exec-0.14.0.jar:0.14.0] >>> at >>> org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168) >>> ~[hive-exec-0.14.0.jar:0.14.0] >>> at >>> org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157) >>> ~[hive-exec-0.14.0.jar:0.14.0] >>> at >>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2276) >>> ~[hive-exec-0.14.0.jar: >>> >>> >>> Versions: >>> >>> Hadoop: apache 2.2.0 >>> Hive Apache: 0.14.0 >>> Java 1.7 >>> >>> >>> Thanks for your time reading this email. >>> >> >>
Re: Wrong results from join query in Hive 0.13 and also 1.0 with reproduce.
Thanks Ashutosh. Actually for this kind of query, if I put the 2 filters in WHERE clause instead of ON clause, the query result is correct. Do you suggest we put all filters into WHERE or OR clause? And Why? On Wed, Sep 2, 2015 at 3:13 PM, Ashutosh Chauhanwrote: > It indeed is. Title of bug is symptom of problem and > doesn't accurately describe the problem. Bug will be triggered if following > conditions are met: > > If query contains 3 or more joins > AND > joins are merged (i.e. tables participating in two of those joins are > joined on same keys) > AND > these merged joins are not consecutive in query > AND > there is a filter on one of tables who participated in merged join which > is in WHERE clause (not as join condition) > then said filter will be dropped. > > Query you posted meets all these criteria. You can avoid this bug if you > rewrite your query such that it violates one of the requirement (listed > above) to trigger the bug. > > Ashutosh > > > On Wed, Sep 2, 2015 at 10:19 AM, Jim Green wrote: > >> Hi Ashutosh, >> >> Is Hive-10841 related? from the title of that jira, it sais “where col is >> not null”caused the issue; however above reproduce did not have that clause. >> >> >> >> On Wed, Sep 2, 2015 at 2:24 AM, Ashutosh Chauhan >> wrote: >> >>> https://issues.apache.org/jira/browse/HIVE-10841 >>> >>> Thanks, >>> Ashutosh >>> >>> On Tue, Sep 1, 2015 at 6:00 PM, Jim Green wrote: >>> Seems Hive 1.2 fixed this issue. But not sure what is the JIRA related and the possibility to backport this fix into Hive 0.13? On Tue, Sep 1, 2015 at 5:35 PM, Jim Green wrote: > Hi Team, > > Below is the minimum reproduce of wrong results in Hive 0.13: > > *1. Create 4 tables* > CREATE EXTERNAL TABLE testjoin1( joincol string ); > CREATE EXTERNAL TABLE testjoin2( >anothercol string , >joincol string); > > CREATE EXTERNAL TABLE testjoin3( anothercol string); > > CREATE EXTERNAL TABLE testjoin4( > joincol string, > wherecol string , > wherecol2 string); > > *2. Insert sample data * > (Note: Make sure you firstly create the dual table which only contains > 1 row) > > insert into table testjoin1 select '1' from dual; > insert into table testjoin2 select 'another','1' from dual; > insert into table testjoin3 select 'another' from dual; > insert into table testjoin4 select '1','I_AM_MISSING','201501' from > dual; > insert into table testjoin4 select > '1','I_Shouldnot_be_in_output','201501' from > dual; > > hive> select * from testjoin1; > OK > 1 > Time taken: 0.04 seconds, Fetched: 1 row(s) > > hive> select * from testjoin2; > OK > another1 > Time taken: 0.039 seconds, Fetched: 1 row(s) > > hive> select * from testjoin3; > OK > another > Time taken: 0.038 seconds, Fetched: 1 row(s) > > hive> select * from testjoin4; > OK > 1I_AM_MISSING201501 > 1I_Shouldnot_be_in_output201501 > Time taken: 0.04 seconds, Fetched: 2 row(s) > > *3. SQL1 is returning wrong results.* > > Select testjoin4.* From > testjoin1 > JOIN testjoin2 > ON (testjoin2.joincol = testjoin1.joincol) > JOIN testjoin3 > ON (testjoin3.anothercol= testjoin2.anothercol) > JOIN testjoin4 > ON (testjoin4.joincol = testjoin1.joincol AND > testjoin4.wherecol2='201501') > WHERE (testjoin4.wherecol='I_AM_MISSING'); > > 1I_AM_MISSING201501 > 1I_Shouldnot_be_in_output201501 > Time taken: 21.702 seconds, Fetched: 2 row(s) > > > *4. SQL2 is returning good result(If we move the both filters to WHERE > clause )* > > Select testjoin4.* From > testjoin1 > JOIN testjoin2 > ON (testjoin2.joincol = testjoin1.joincol) > JOIN testjoin3 > ON (testjoin3.anothercol= testjoin2.anothercol) > JOIN testjoin4 > ON (testjoin4.joincol = testjoin1.joincol) > WHERE (testjoin4.wherecol='I_AM_MISSING' and > testjoin4.wherecol2='201501'); > > 1I_AM_MISSING201501 > Time taken: 20.393 seconds, Fetched: 1 row(s) > — > *Another test is done in Hive 1.0 and found both SQL1 and SQL2 are > returning wrong results….* > > 1 I_AM_MISSING 201501 > 1 I_AM_MISSING 201501 > Time taken: 13.983 seconds, Fetched: 2 row(s) > > *Anybody knows any related JIRAs?* > > -- > Thanks, > www.openkb.info > (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool) > -- Thanks, www.openkb.info (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool) >>> >>> >> >> >> -- >> Thanks, >> www.openkb.info >> (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool) >> > > --
Re: ORC NPE while writing stats
Memory manager is made thread local https://issues.apache.org/jira/browse/HIVE-10191 Can you try the patch from HIVE-10191 and see if that helps? On Sep 2, 2015, at 8:58 PM, David Capwell> wrote: I'll try that out and see if it goes away (not seen this in the past 24 hours, no code change). Doing this now means that I can't share the memory, so will prob go with a thread local and allocate fixed sizes to the pool per thread (50% heap / 50 threads). Will most likely be awhile before I can report back (unless it fails fast in testing) On Sep 2, 2015 2:11 PM, "Owen O'Malley" > wrote: (Dropping dev) Well, that explains the non-determinism, because the MemoryManager will be shared across threads and thus the stripes will get flushed at effectively random times. Can you try giving each writer a unique MemoryManager? You'll need to put a class into the org.apache.hadoop.hive.ql.io.orc package to get access to the necessary class (MemoryManager) and method (OrcFile.WriterOptions.memory). We may be missing a synchronization on the MemoryManager somewhere and thus be getting a race condition. Thanks, Owen On Wed, Sep 2, 2015 at 12:57 PM, David Capwell > wrote: We have multiple threads writing, but each thread works on one file, so orc writer is only touched by one thread (never cross threads) On Sep 2, 2015 11:18 AM, "Owen O'Malley" > wrote: I don't see how it would get there. That implies that minimum was null, but the count was non-zero. The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like: @Override OrcProto.ColumnStatistics.Builder serialize() { OrcProto.ColumnStatistics.Builder result = super.serialize(); OrcProto.StringStatistics.Builder str = OrcProto.StringStatistics.newBuilder(); if (getNumberOfValues() != 0) { str.setMinimum(getMinimum()); str.setMaximum(getMaximum()); str.setSum(sum); } result.setStringStatistics(str); return result; } and thus shouldn't call down to setMinimum unless it had at least some non-null values in the column. Do you have multiple threads working? There isn't anything that should be introducing non-determinism so for the same input it would fail at the same point. .. Owen On Tue, Sep 1, 2015 at 10:51 PM, David Capwell > wrote: We are writing ORC files in our application for hive to consume. Given enough time, we have noticed that writing causes a NPE when working with a string column's stats. Not sure whats causing it on our side yet since replaying the same data is just fine, it seems more like this just happens over time (different data sources will hit this around the same time in the same JVM). Here is the code in question, and below is the exception: final Writer writer = OrcFile.createWriter(path, OrcFile.writerOptions(conf).inspector(oi)); try { for (Data row : rows) { List struct = Orc.struct(row, inspector); writer.addRow(struct); } } finally { writer.close(); } Here is the exception: java.lang.NullPointerException: null at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$Builder.setMinimum(OrcProto.java:1803) ~[hive-exec-0.14.0.jar:0.14.0] at org.apache.hadoop.hive.ql.io.orc.ColumnStatisticsImpl$StringStatisticsImpl.serialize(ColumnStatisticsImpl.java:411) ~[hive-exec-0.14.0.jar:0.14.0] at org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.createRowIndexEntry(WriterImpl.java:1255) ~[hive-exec-0.14.0.jar:0.14.0] at org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775) ~[hive-exec-0.14.0.jar:0.14.0] at org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775) ~[hive-exec-0.14.0.jar:0.14.0] at org.apache.hadoop.hive.ql.io.orc.WriterImpl.createRowIndexEntry(WriterImpl.java:1978) ~[hive-exec-0.14.0.jar:0.14.0] at org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1985) ~[hive-exec-0.14.0.jar:0.14.0] at org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:322) ~[hive-exec-0.14.0.jar:0.14.0] at org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168) ~[hive-exec-0.14.0.jar:0.14.0] at org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157) ~[hive-exec-0.14.0.jar:0.14.0] at org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2276) ~[hive-exec-0.14.0.jar: Versions: Hadoop: apache 2.2.0 Hive Apache: 0.14.0 Java 1.7 Thanks for your time reading this email.
Re: ORC NPE while writing stats
So, very quickly looked at the JIRA and I had the following question; if you have a pool per thread rather than global, then assuming 50% heap will cause writer to OOM with multiple threads, which is different than older (0.14) ORC, correct? https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcConf.java#L83 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java#L94 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFile.java#L226 So with orc.memory.pool=0.5, this value only seems to make sense if single threaded, so if you are writing with multiple threads, then I assume the value should be (0.5 / #threads), so if 50 threads then 0.01 should be the value? If this is true, I can't find any documentation about this, all docs make it sound global. On Wed, Sep 2, 2015 at 7:34 PM, David Capwellwrote: > Thanks for the jira, will see if that works for us. > > On Sep 2, 2015 7:11 PM, "Prasanth Jayachandran" > wrote: >> >> Memory manager is made thread local >> https://issues.apache.org/jira/browse/HIVE-10191 >> >> Can you try the patch from HIVE-10191 and see if that helps? >> >> On Sep 2, 2015, at 8:58 PM, David Capwell wrote: >> >> I'll try that out and see if it goes away (not seen this in the past 24 >> hours, no code change). >> >> Doing this now means that I can't share the memory, so will prob go with a >> thread local and allocate fixed sizes to the pool per thread (50% heap / 50 >> threads). Will most likely be awhile before I can report back (unless it >> fails fast in testing) >> >> On Sep 2, 2015 2:11 PM, "Owen O'Malley" wrote: >>> >>> (Dropping dev) >>> >>> Well, that explains the non-determinism, because the MemoryManager will >>> be shared across threads and thus the stripes will get flushed at >>> effectively random times. >>> >>> Can you try giving each writer a unique MemoryManager? You'll need to put >>> a class into the org.apache.hadoop.hive.ql.io.orc package to get access to >>> the necessary class (MemoryManager) and method >>> (OrcFile.WriterOptions.memory). We may be missing a synchronization on the >>> MemoryManager somewhere and thus be getting a race condition. >>> >>> Thanks, >>>Owen >>> >>> On Wed, Sep 2, 2015 at 12:57 PM, David Capwell >>> wrote: We have multiple threads writing, but each thread works on one file, so orc writer is only touched by one thread (never cross threads) On Sep 2, 2015 11:18 AM, "Owen O'Malley" wrote: > > I don't see how it would get there. That implies that minimum was null, > but the count was non-zero. > > The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like: > > @Override > OrcProto.ColumnStatistics.Builder serialize() { > OrcProto.ColumnStatistics.Builder result = super.serialize(); > OrcProto.StringStatistics.Builder str = > OrcProto.StringStatistics.newBuilder(); > if (getNumberOfValues() != 0) { > str.setMinimum(getMinimum()); > str.setMaximum(getMaximum()); > str.setSum(sum); > } > result.setStringStatistics(str); > return result; > } > > and thus shouldn't call down to setMinimum unless it had at least some > non-null values in the column. > > Do you have multiple threads working? There isn't anything that should > be introducing non-determinism so for the same input it would fail at the > same point. > > .. Owen > > > > > On Tue, Sep 1, 2015 at 10:51 PM, David Capwell > wrote: >> >> We are writing ORC files in our application for hive to consume. >> Given enough time, we have noticed that writing causes a NPE when >> working with a string column's stats. Not sure whats causing it on >> our side yet since replaying the same data is just fine, it seems more >> like this just happens over time (different data sources will hit this >> around the same time in the same JVM). >> >> Here is the code in question, and below is the exception: >> >> final Writer writer = OrcFile.createWriter(path, >> OrcFile.writerOptions(conf).inspector(oi)); >> try { >> for (Data row : rows) { >>List struct = Orc.struct(row, inspector); >>writer.addRow(struct); >> } >> } finally { >>writer.close(); >> } >> >> >> Here is the exception: >> >> java.lang.NullPointerException: null >> at >> org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$Builder.setMinimum(OrcProto.java:1803) >> ~[hive-exec-0.14.0.jar:0.14.0] >> at >>
Re: ORC NPE while writing stats
Walking the MemoryManager, and I have a few questions: # statements Every time you create a writer for a given thread (assuming the thread local version), you just update MemoryManager with the stripe size. The scale is just %heap / (#writer * stripe (assuming equal stripe size)). Periodically ORC checks if the estimated amount of data > stripe*scale. If so it flushes the stripe right away. When the flush happens, it checks to see how close it is to the end of a block and scales the next stripe based off this. # question assuming statements are correct So, for me, I only have one writer per thread at any point in time, so if MM is partitioned based off thread, then do I really care about the % set for the pool size? Since ORC appears to flush a stripe early, wouldn't it make sense to figure out how many concurrent writers I have, how much memory I want to allocate, then set the stripe size to this? So for 50 threads, and a stripe size of 64mb, 3,200mb would be required? So, as long as I make sure the rest of my application gives enough room for ORC, then I can just leave the value as default so it just does stripe size... So, if right, MM doesn't really do anything for me, so no issue sharding and not configuring? Thanks for your time reading this email! On Wed, Sep 2, 2015 at 8:57 PM, David Capwellwrote: > So, very quickly looked at the JIRA and I had the following question; > if you have a pool per thread rather than global, then assuming 50% > heap will cause writer to OOM with multiple threads, which is > different than older (0.14) ORC, correct? > > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcConf.java#L83 > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java#L94 > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFile.java#L226 > > So with orc.memory.pool=0.5, this value only seems to make sense if > single threaded, so if you are writing with multiple threads, then I > assume the value should be (0.5 / #threads), so if 50 threads then > 0.01 should be the value? > > If this is true, I can't find any documentation about this, all docs > make it sound global. > > On Wed, Sep 2, 2015 at 7:34 PM, David Capwell wrote: >> Thanks for the jira, will see if that works for us. >> >> On Sep 2, 2015 7:11 PM, "Prasanth Jayachandran" >> wrote: >>> >>> Memory manager is made thread local >>> https://issues.apache.org/jira/browse/HIVE-10191 >>> >>> Can you try the patch from HIVE-10191 and see if that helps? >>> >>> On Sep 2, 2015, at 8:58 PM, David Capwell wrote: >>> >>> I'll try that out and see if it goes away (not seen this in the past 24 >>> hours, no code change). >>> >>> Doing this now means that I can't share the memory, so will prob go with a >>> thread local and allocate fixed sizes to the pool per thread (50% heap / 50 >>> threads). Will most likely be awhile before I can report back (unless it >>> fails fast in testing) >>> >>> On Sep 2, 2015 2:11 PM, "Owen O'Malley" wrote: (Dropping dev) Well, that explains the non-determinism, because the MemoryManager will be shared across threads and thus the stripes will get flushed at effectively random times. Can you try giving each writer a unique MemoryManager? You'll need to put a class into the org.apache.hadoop.hive.ql.io.orc package to get access to the necessary class (MemoryManager) and method (OrcFile.WriterOptions.memory). We may be missing a synchronization on the MemoryManager somewhere and thus be getting a race condition. Thanks, Owen On Wed, Sep 2, 2015 at 12:57 PM, David Capwell wrote: > > We have multiple threads writing, but each thread works on one file, so > orc writer is only touched by one thread (never cross threads) > > On Sep 2, 2015 11:18 AM, "Owen O'Malley" wrote: >> >> I don't see how it would get there. That implies that minimum was null, >> but the count was non-zero. >> >> The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like: >> >> @Override >> OrcProto.ColumnStatistics.Builder serialize() { >> OrcProto.ColumnStatistics.Builder result = super.serialize(); >> OrcProto.StringStatistics.Builder str = >> OrcProto.StringStatistics.newBuilder(); >> if (getNumberOfValues() != 0) { >> str.setMinimum(getMinimum()); >> str.setMaximum(getMaximum()); >> str.setSum(sum); >> } >> result.setStringStatistics(str); >> return result; >> } >> >> and thus shouldn't call down to setMinimum unless it had at least some >> non-null values in the column. >> >> Do you have multiple threads working?
Re: ORC NPE while writing stats
I'll try that out and see if it goes away (not seen this in the past 24 hours, no code change). Doing this now means that I can't share the memory, so will prob go with a thread local and allocate fixed sizes to the pool per thread (50% heap / 50 threads). Will most likely be awhile before I can report back (unless it fails fast in testing) On Sep 2, 2015 2:11 PM, "Owen O'Malley"wrote: > (Dropping dev) > > Well, that explains the non-determinism, because the MemoryManager will be > shared across threads and thus the stripes will get flushed at effectively > random times. > > Can you try giving each writer a unique MemoryManager? You'll need to put > a class into the org.apache.hadoop.hive.ql.io.orc package to get access to > the necessary class (MemoryManager) and method > (OrcFile.WriterOptions.memory). We may be missing a synchronization on the > MemoryManager somewhere and thus be getting a race condition. > > Thanks, >Owen > > On Wed, Sep 2, 2015 at 12:57 PM, David Capwell wrote: > >> We have multiple threads writing, but each thread works on one file, so >> orc writer is only touched by one thread (never cross threads) >> On Sep 2, 2015 11:18 AM, "Owen O'Malley" wrote: >> >>> I don't see how it would get there. That implies that minimum was null, >>> but the count was non-zero. >>> >>> The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like: >>> >>> @Override >>> OrcProto.ColumnStatistics.Builder serialize() { >>> OrcProto.ColumnStatistics.Builder result = super.serialize(); >>> OrcProto.StringStatistics.Builder str = >>> OrcProto.StringStatistics.newBuilder(); >>> if (getNumberOfValues() != 0) { >>> str.setMinimum(getMinimum()); >>> str.setMaximum(getMaximum()); >>> str.setSum(sum); >>> } >>> result.setStringStatistics(str); >>> return result; >>> } >>> >>> and thus shouldn't call down to setMinimum unless it had at least some >>> non-null values in the column. >>> >>> Do you have multiple threads working? There isn't anything that should be >>> introducing non-determinism so for the same input it would fail at the same >>> point. >>> >>> .. Owen >>> >>> >>> >>> >>> On Tue, Sep 1, 2015 at 10:51 PM, David Capwell >>> wrote: >>> We are writing ORC files in our application for hive to consume. Given enough time, we have noticed that writing causes a NPE when working with a string column's stats. Not sure whats causing it on our side yet since replaying the same data is just fine, it seems more like this just happens over time (different data sources will hit this around the same time in the same JVM). Here is the code in question, and below is the exception: final Writer writer = OrcFile.createWriter(path, OrcFile.writerOptions(conf).inspector(oi)); try { for (Data row : rows) { List struct = Orc.struct(row, inspector); writer.addRow(struct); } } finally { writer.close(); } Here is the exception: java.lang.NullPointerException: null at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$Builder.setMinimum(OrcProto.java:1803) ~[hive-exec-0.14.0.jar:0.14.0] at org.apache.hadoop.hive.ql.io.orc.ColumnStatisticsImpl$StringStatisticsImpl.serialize(ColumnStatisticsImpl.java:411) ~[hive-exec-0.14.0.jar:0.14.0] at org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.createRowIndexEntry(WriterImpl.java:1255) ~[hive-exec-0.14.0.jar:0.14.0] at org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775) ~[hive-exec-0.14.0.jar:0.14.0] at org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775) ~[hive-exec-0.14.0.jar:0.14.0] at org.apache.hadoop.hive.ql.io.orc.WriterImpl.createRowIndexEntry(WriterImpl.java:1978) ~[hive-exec-0.14.0.jar:0.14.0] at org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1985) ~[hive-exec-0.14.0.jar:0.14.0] at org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:322) ~[hive-exec-0.14.0.jar:0.14.0] at org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168) ~[hive-exec-0.14.0.jar:0.14.0] at org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157) ~[hive-exec-0.14.0.jar:0.14.0] at org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2276) ~[hive-exec-0.14.0.jar: Versions: Hadoop: apache 2.2.0 Hive Apache: 0.14.0 Java 1.7 Thanks for your time reading this email. >>> >>> >
Re: ORC NPE while writing stats
Thanks for the jira, will see if that works for us. On Sep 2, 2015 7:11 PM, "Prasanth Jayachandran" < pjayachand...@hortonworks.com> wrote: > Memory manager is made thread local > https://issues.apache.org/jira/browse/HIVE-10191 > > Can you try the patch from HIVE-10191 and see if that helps? > > On Sep 2, 2015, at 8:58 PM, David Capwellwrote: > > I'll try that out and see if it goes away (not seen this in the past 24 > hours, no code change). > > Doing this now means that I can't share the memory, so will prob go with a > thread local and allocate fixed sizes to the pool per thread (50% heap / 50 > threads). Will most likely be awhile before I can report back (unless it > fails fast in testing) > On Sep 2, 2015 2:11 PM, "Owen O'Malley" wrote: > >> (Dropping dev) >> >> Well, that explains the non-determinism, because the MemoryManager will >> be shared across threads and thus the stripes will get flushed at >> effectively random times. >> >> Can you try giving each writer a unique MemoryManager? You'll need to put >> a class into the org.apache.hadoop.hive.ql.io.orc package to get access to >> the necessary class (MemoryManager) and method >> (OrcFile.WriterOptions.memory). We may be missing a synchronization on the >> MemoryManager somewhere and thus be getting a race condition. >> >> Thanks, >>Owen >> >> On Wed, Sep 2, 2015 at 12:57 PM, David Capwell >> wrote: >> >>> We have multiple threads writing, but each thread works on one file, so >>> orc writer is only touched by one thread (never cross threads) >>> On Sep 2, 2015 11:18 AM, "Owen O'Malley" wrote: >>> I don't see how it would get there. That implies that minimum was null, but the count was non-zero. The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like: @Override OrcProto.ColumnStatistics.Builder serialize() { OrcProto.ColumnStatistics.Builder result = super.serialize(); OrcProto.StringStatistics.Builder str = OrcProto.StringStatistics.newBuilder(); if (getNumberOfValues() != 0) { str.setMinimum(getMinimum()); str.setMaximum(getMaximum()); str.setSum(sum); } result.setStringStatistics(str); return result; } and thus shouldn't call down to setMinimum unless it had at least some non-null values in the column. Do you have multiple threads working? There isn't anything that should be introducing non-determinism so for the same input it would fail at the same point. .. Owen On Tue, Sep 1, 2015 at 10:51 PM, David Capwell wrote: > We are writing ORC files in our application for hive to consume. > Given enough time, we have noticed that writing causes a NPE when > working with a string column's stats. Not sure whats causing it on > our side yet since replaying the same data is just fine, it seems more > like this just happens over time (different data sources will hit this > around the same time in the same JVM). > > Here is the code in question, and below is the exception: > > final Writer writer = OrcFile.createWriter(path, > OrcFile.writerOptions(conf).inspector(oi)); > try { > for (Data row : rows) { >List struct = Orc.struct(row, inspector); >writer.addRow(struct); > } > } finally { >writer.close(); > } > > > Here is the exception: > > java.lang.NullPointerException: null > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$Builder.setMinimum(OrcProto.java:1803) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.ColumnStatisticsImpl$StringStatisticsImpl.serialize(ColumnStatisticsImpl.java:411) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.createRowIndexEntry(WriterImpl.java:1255) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl.createRowIndexEntry(WriterImpl.java:1978) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1985) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:322) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168) > ~[hive-exec-0.14.0.jar:0.14.0]
Re: ORC NPE while writing stats
Also, if I am walking this correctly writer.addRow(struct) may trigger my current thread to flush all the state for other writers running in different threads. This state isn't updated by the same lock, so my thread won't see the same state, which would explain the NPE. Another issue is that estimateStripeSize won't always give the correct value since my thread is the one calling it... With everything ThreadLocal, the only writers would be the ones in the same thread, so should be better. On Wed, Sep 2, 2015 at 9:47 PM, David Capwellwrote: > Walking the MemoryManager, and I have a few questions: > > # statements > > Every time you create a writer for a given thread (assuming the thread > local version), you just update MemoryManager with the stripe size. > The scale is just %heap / (#writer * stripe (assuming equal stripe > size)). > > Periodically ORC checks if the estimated amount of data > > stripe*scale. If so it flushes the stripe right away. When the flush > happens, it checks to see how close it is to the end of a block and > scales the next stripe based off this. > > # question assuming statements are correct > > So, for me, I only have one writer per thread at any point in time, so > if MM is partitioned based off thread, then do I really care about the > % set for the pool size? Since ORC appears to flush a stripe early, > wouldn't it make sense to figure out how many concurrent writers I > have, how much memory I want to allocate, then set the stripe size to > this? > > So for 50 threads, and a stripe size of 64mb, 3,200mb would be > required? So, as long as I make sure the rest of my application gives > enough room for ORC, then I can just leave the value as default so it > just does stripe size... > > So, if right, MM doesn't really do anything for me, so no issue > sharding and not configuring? > > > Thanks for your time reading this email! > > On Wed, Sep 2, 2015 at 8:57 PM, David Capwell wrote: >> So, very quickly looked at the JIRA and I had the following question; >> if you have a pool per thread rather than global, then assuming 50% >> heap will cause writer to OOM with multiple threads, which is >> different than older (0.14) ORC, correct? >> >> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcConf.java#L83 >> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java#L94 >> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFile.java#L226 >> >> So with orc.memory.pool=0.5, this value only seems to make sense if >> single threaded, so if you are writing with multiple threads, then I >> assume the value should be (0.5 / #threads), so if 50 threads then >> 0.01 should be the value? >> >> If this is true, I can't find any documentation about this, all docs >> make it sound global. >> >> On Wed, Sep 2, 2015 at 7:34 PM, David Capwell wrote: >>> Thanks for the jira, will see if that works for us. >>> >>> On Sep 2, 2015 7:11 PM, "Prasanth Jayachandran" >>> wrote: Memory manager is made thread local https://issues.apache.org/jira/browse/HIVE-10191 Can you try the patch from HIVE-10191 and see if that helps? On Sep 2, 2015, at 8:58 PM, David Capwell wrote: I'll try that out and see if it goes away (not seen this in the past 24 hours, no code change). Doing this now means that I can't share the memory, so will prob go with a thread local and allocate fixed sizes to the pool per thread (50% heap / 50 threads). Will most likely be awhile before I can report back (unless it fails fast in testing) On Sep 2, 2015 2:11 PM, "Owen O'Malley" wrote: > > (Dropping dev) > > Well, that explains the non-determinism, because the MemoryManager will > be shared across threads and thus the stripes will get flushed at > effectively random times. > > Can you try giving each writer a unique MemoryManager? You'll need to put > a class into the org.apache.hadoop.hive.ql.io.orc package to get access to > the necessary class (MemoryManager) and method > (OrcFile.WriterOptions.memory). We may be missing a synchronization on the > MemoryManager somewhere and thus be getting a race condition. > > Thanks, >Owen > > On Wed, Sep 2, 2015 at 12:57 PM, David Capwell > wrote: >> >> We have multiple threads writing, but each thread works on one file, so >> orc writer is only touched by one thread (never cross threads) >> >> On Sep 2, 2015 11:18 AM, "Owen O'Malley" wrote: >>> >>> I don't see how it would get there. That implies that minimum was null, >>> but the count was non-zero. >>> >>> The
Re: ORC NPE while writing stats
I don't see how it would get there. That implies that minimum was null, but the count was non-zero. The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like: @Override OrcProto.ColumnStatistics.Builder serialize() { OrcProto.ColumnStatistics.Builder result = super.serialize(); OrcProto.StringStatistics.Builder str = OrcProto.StringStatistics.newBuilder(); if (getNumberOfValues() != 0) { str.setMinimum(getMinimum()); str.setMaximum(getMaximum()); str.setSum(sum); } result.setStringStatistics(str); return result; } and thus shouldn't call down to setMinimum unless it had at least some non-null values in the column. Do you have multiple threads working? There isn't anything that should be introducing non-determinism so for the same input it would fail at the same point. .. Owen On Tue, Sep 1, 2015 at 10:51 PM, David Capwellwrote: > We are writing ORC files in our application for hive to consume. > Given enough time, we have noticed that writing causes a NPE when > working with a string column's stats. Not sure whats causing it on > our side yet since replaying the same data is just fine, it seems more > like this just happens over time (different data sources will hit this > around the same time in the same JVM). > > Here is the code in question, and below is the exception: > > final Writer writer = OrcFile.createWriter(path, > OrcFile.writerOptions(conf).inspector(oi)); > try { > for (Data row : rows) { >List struct = Orc.struct(row, inspector); >writer.addRow(struct); > } > } finally { >writer.close(); > } > > > Here is the exception: > > java.lang.NullPointerException: null > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$Builder.setMinimum(OrcProto.java:1803) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.ColumnStatisticsImpl$StringStatisticsImpl.serialize(ColumnStatisticsImpl.java:411) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.createRowIndexEntry(WriterImpl.java:1255) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl.createRowIndexEntry(WriterImpl.java:1978) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1985) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:322) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157) > ~[hive-exec-0.14.0.jar:0.14.0] > at > org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2276) > ~[hive-exec-0.14.0.jar: > > > Versions: > > Hadoop: apache 2.2.0 > Hive Apache: 0.14.0 > Java 1.7 > > > Thanks for your time reading this email. >
Re: ORC NPE while writing stats
(Dropping dev) Well, that explains the non-determinism, because the MemoryManager will be shared across threads and thus the stripes will get flushed at effectively random times. Can you try giving each writer a unique MemoryManager? You'll need to put a class into the org.apache.hadoop.hive.ql.io.orc package to get access to the necessary class (MemoryManager) and method (OrcFile.WriterOptions.memory). We may be missing a synchronization on the MemoryManager somewhere and thus be getting a race condition. Thanks, Owen On Wed, Sep 2, 2015 at 12:57 PM, David Capwellwrote: > We have multiple threads writing, but each thread works on one file, so > orc writer is only touched by one thread (never cross threads) > On Sep 2, 2015 11:18 AM, "Owen O'Malley" wrote: > >> I don't see how it would get there. That implies that minimum was null, >> but the count was non-zero. >> >> The ColumnStatisticsImpl$StringStatisticsImpl.serialize looks like: >> >> @Override >> OrcProto.ColumnStatistics.Builder serialize() { >> OrcProto.ColumnStatistics.Builder result = super.serialize(); >> OrcProto.StringStatistics.Builder str = >> OrcProto.StringStatistics.newBuilder(); >> if (getNumberOfValues() != 0) { >> str.setMinimum(getMinimum()); >> str.setMaximum(getMaximum()); >> str.setSum(sum); >> } >> result.setStringStatistics(str); >> return result; >> } >> >> and thus shouldn't call down to setMinimum unless it had at least some >> non-null values in the column. >> >> Do you have multiple threads working? There isn't anything that should be >> introducing non-determinism so for the same input it would fail at the same >> point. >> >> .. Owen >> >> >> >> >> On Tue, Sep 1, 2015 at 10:51 PM, David Capwell >> wrote: >> >>> We are writing ORC files in our application for hive to consume. >>> Given enough time, we have noticed that writing causes a NPE when >>> working with a string column's stats. Not sure whats causing it on >>> our side yet since replaying the same data is just fine, it seems more >>> like this just happens over time (different data sources will hit this >>> around the same time in the same JVM). >>> >>> Here is the code in question, and below is the exception: >>> >>> final Writer writer = OrcFile.createWriter(path, >>> OrcFile.writerOptions(conf).inspector(oi)); >>> try { >>> for (Data row : rows) { >>>List struct = Orc.struct(row, inspector); >>>writer.addRow(struct); >>> } >>> } finally { >>>writer.close(); >>> } >>> >>> >>> Here is the exception: >>> >>> java.lang.NullPointerException: null >>> at >>> org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$Builder.setMinimum(OrcProto.java:1803) >>> ~[hive-exec-0.14.0.jar:0.14.0] >>> at >>> org.apache.hadoop.hive.ql.io.orc.ColumnStatisticsImpl$StringStatisticsImpl.serialize(ColumnStatisticsImpl.java:411) >>> ~[hive-exec-0.14.0.jar:0.14.0] >>> at >>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.createRowIndexEntry(WriterImpl.java:1255) >>> ~[hive-exec-0.14.0.jar:0.14.0] >>> at >>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775) >>> ~[hive-exec-0.14.0.jar:0.14.0] >>> at >>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.createRowIndexEntry(WriterImpl.java:775) >>> ~[hive-exec-0.14.0.jar:0.14.0] >>> at >>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.createRowIndexEntry(WriterImpl.java:1978) >>> ~[hive-exec-0.14.0.jar:0.14.0] >>> at >>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1985) >>> ~[hive-exec-0.14.0.jar:0.14.0] >>> at >>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:322) >>> ~[hive-exec-0.14.0.jar:0.14.0] >>> at >>> org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168) >>> ~[hive-exec-0.14.0.jar:0.14.0] >>> at >>> org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157) >>> ~[hive-exec-0.14.0.jar:0.14.0] >>> at >>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2276) >>> ~[hive-exec-0.14.0.jar: >>> >>> >>> Versions: >>> >>> Hadoop: apache 2.2.0 >>> Hive Apache: 0.14.0 >>> Java 1.7 >>> >>> >>> Thanks for your time reading this email. >>> >> >>
Re: Wrong results from join query in Hive 0.13 and also 1.0 with reproduce.
It indeed is. Title of bug is symptom of problem and doesn't accurately describe the problem. Bug will be triggered if following conditions are met: If query contains 3 or more joins AND joins are merged (i.e. tables participating in two of those joins are joined on same keys) AND these merged joins are not consecutive in query AND there is a filter on one of tables who participated in merged join which is in WHERE clause (not as join condition) then said filter will be dropped. Query you posted meets all these criteria. You can avoid this bug if you rewrite your query such that it violates one of the requirement (listed above) to trigger the bug. Ashutosh On Wed, Sep 2, 2015 at 10:19 AM, Jim Greenwrote: > Hi Ashutosh, > > Is Hive-10841 related? from the title of that jira, it sais “where col is > not null”caused the issue; however above reproduce did not have that clause. > > > > On Wed, Sep 2, 2015 at 2:24 AM, Ashutosh Chauhan > wrote: > >> https://issues.apache.org/jira/browse/HIVE-10841 >> >> Thanks, >> Ashutosh >> >> On Tue, Sep 1, 2015 at 6:00 PM, Jim Green wrote: >> >>> Seems Hive 1.2 fixed this issue. But not sure what is the JIRA related >>> and the possibility to backport this fix into Hive 0.13? >>> >>> >>> On Tue, Sep 1, 2015 at 5:35 PM, Jim Green wrote: >>> Hi Team, Below is the minimum reproduce of wrong results in Hive 0.13: *1. Create 4 tables* CREATE EXTERNAL TABLE testjoin1( joincol string ); CREATE EXTERNAL TABLE testjoin2( anothercol string , joincol string); CREATE EXTERNAL TABLE testjoin3( anothercol string); CREATE EXTERNAL TABLE testjoin4( joincol string, wherecol string , wherecol2 string); *2. Insert sample data * (Note: Make sure you firstly create the dual table which only contains 1 row) insert into table testjoin1 select '1' from dual; insert into table testjoin2 select 'another','1' from dual; insert into table testjoin3 select 'another' from dual; insert into table testjoin4 select '1','I_AM_MISSING','201501' from dual; insert into table testjoin4 select '1','I_Shouldnot_be_in_output','201501' from dual; hive> select * from testjoin1; OK 1 Time taken: 0.04 seconds, Fetched: 1 row(s) hive> select * from testjoin2; OK another1 Time taken: 0.039 seconds, Fetched: 1 row(s) hive> select * from testjoin3; OK another Time taken: 0.038 seconds, Fetched: 1 row(s) hive> select * from testjoin4; OK 1I_AM_MISSING201501 1I_Shouldnot_be_in_output201501 Time taken: 0.04 seconds, Fetched: 2 row(s) *3. SQL1 is returning wrong results.* Select testjoin4.* From testjoin1 JOIN testjoin2 ON (testjoin2.joincol = testjoin1.joincol) JOIN testjoin3 ON (testjoin3.anothercol= testjoin2.anothercol) JOIN testjoin4 ON (testjoin4.joincol = testjoin1.joincol AND testjoin4.wherecol2='201501') WHERE (testjoin4.wherecol='I_AM_MISSING'); 1I_AM_MISSING201501 1I_Shouldnot_be_in_output201501 Time taken: 21.702 seconds, Fetched: 2 row(s) *4. SQL2 is returning good result(If we move the both filters to WHERE clause )* Select testjoin4.* From testjoin1 JOIN testjoin2 ON (testjoin2.joincol = testjoin1.joincol) JOIN testjoin3 ON (testjoin3.anothercol= testjoin2.anothercol) JOIN testjoin4 ON (testjoin4.joincol = testjoin1.joincol) WHERE (testjoin4.wherecol='I_AM_MISSING' and testjoin4.wherecol2='201501'); 1I_AM_MISSING201501 Time taken: 20.393 seconds, Fetched: 1 row(s) — *Another test is done in Hive 1.0 and found both SQL1 and SQL2 are returning wrong results….* 1 I_AM_MISSING 201501 1 I_AM_MISSING 201501 Time taken: 13.983 seconds, Fetched: 2 row(s) *Anybody knows any related JIRAs?* -- Thanks, www.openkb.info (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool) >>> >>> >>> >>> -- >>> Thanks, >>> www.openkb.info >>> (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool) >>> >> >> > > > -- > Thanks, > www.openkb.info > (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool) >
Request for write access to the Hive wiki
Hi, I would like to get write access to Hive will. My Confluence username: asreekumar. thanks Aswathy
Disabling local mode optimization
Hi, I would like to disable the optimization where a query that just selects data is running without mapreduce (local mode). hive.exec.mode.local.auto is set to false but hive still runs in local mode for some queries. How can I disable local mode completely? Thank you. Daniel
Re: Wrong results from join query in Hive 0.13 and also 1.0 with reproduce.
https://issues.apache.org/jira/browse/HIVE-10841 Thanks, Ashutosh On Tue, Sep 1, 2015 at 6:00 PM, Jim Greenwrote: > Seems Hive 1.2 fixed this issue. But not sure what is the JIRA related and > the possibility to backport this fix into Hive 0.13? > > > On Tue, Sep 1, 2015 at 5:35 PM, Jim Green wrote: > >> Hi Team, >> >> Below is the minimum reproduce of wrong results in Hive 0.13: >> >> *1. Create 4 tables* >> CREATE EXTERNAL TABLE testjoin1( joincol string ); >> CREATE EXTERNAL TABLE testjoin2( >>anothercol string , >>joincol string); >> >> CREATE EXTERNAL TABLE testjoin3( anothercol string); >> >> CREATE EXTERNAL TABLE testjoin4( >> joincol string, >> wherecol string , >> wherecol2 string); >> >> *2. Insert sample data * >> (Note: Make sure you firstly create the dual table which only contains 1 >> row) >> >> insert into table testjoin1 select '1' from dual; >> insert into table testjoin2 select 'another','1' from dual; >> insert into table testjoin3 select 'another' from dual; >> insert into table testjoin4 select '1','I_AM_MISSING','201501' from dual; >> insert into table testjoin4 select >> '1','I_Shouldnot_be_in_output','201501' from >> dual; >> >> hive> select * from testjoin1; >> OK >> 1 >> Time taken: 0.04 seconds, Fetched: 1 row(s) >> >> hive> select * from testjoin2; >> OK >> another1 >> Time taken: 0.039 seconds, Fetched: 1 row(s) >> >> hive> select * from testjoin3; >> OK >> another >> Time taken: 0.038 seconds, Fetched: 1 row(s) >> >> hive> select * from testjoin4; >> OK >> 1I_AM_MISSING201501 >> 1I_Shouldnot_be_in_output201501 >> Time taken: 0.04 seconds, Fetched: 2 row(s) >> >> *3. SQL1 is returning wrong results.* >> >> Select testjoin4.* From >> testjoin1 >> JOIN testjoin2 >> ON (testjoin2.joincol = testjoin1.joincol) >> JOIN testjoin3 >> ON (testjoin3.anothercol= testjoin2.anothercol) >> JOIN testjoin4 >> ON (testjoin4.joincol = testjoin1.joincol AND >> testjoin4.wherecol2='201501') >> WHERE (testjoin4.wherecol='I_AM_MISSING'); >> >> 1I_AM_MISSING201501 >> 1I_Shouldnot_be_in_output201501 >> Time taken: 21.702 seconds, Fetched: 2 row(s) >> >> >> *4. SQL2 is returning good result(If we move the both filters to WHERE >> clause )* >> >> Select testjoin4.* From >> testjoin1 >> JOIN testjoin2 >> ON (testjoin2.joincol = testjoin1.joincol) >> JOIN testjoin3 >> ON (testjoin3.anothercol= testjoin2.anothercol) >> JOIN testjoin4 >> ON (testjoin4.joincol = testjoin1.joincol) >> WHERE (testjoin4.wherecol='I_AM_MISSING' and >> testjoin4.wherecol2='201501'); >> >> 1I_AM_MISSING201501 >> Time taken: 20.393 seconds, Fetched: 1 row(s) >> — >> *Another test is done in Hive 1.0 and found both SQL1 and SQL2 are >> returning wrong results….* >> >> 1 I_AM_MISSING 201501 >> 1 I_AM_MISSING 201501 >> Time taken: 13.983 seconds, Fetched: 2 row(s) >> >> *Anybody knows any related JIRAs?* >> >> -- >> Thanks, >> www.openkb.info >> (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool) >> > > > > -- > Thanks, > www.openkb.info > (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool) >