Re: Review Request 33806: Add Tree traversal tools to ParseUtil class that allow for checking node structures with general predicate

2015-05-18 Thread Sergio Pena

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33806/#review84156
---

Ship it!


Ship It!

- Sergio Pena


On May 11, 2015, 6:38 p.m., Reuben Kuhnert wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33806/
 ---
 
 (Updated May 11, 2015, 6:38 p.m.)
 
 
 Review request for hive, Gopal V, John Pullokkaran, and Sergio Pena.
 
 
 Bugs: HIVE-10190
 https://issues.apache.org/jira/browse/HIVE-10190
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 HIVE-10190: CBO: AST mode checks for TABLESAMPLE with 
 AST.toString().contains(TOK_TABLESPLITSAMPLE)
 
 
 Diffs
 -
 
   
 ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveCalciteUtil.java 
 372c93d9af01608538b2e2e5a50c45188acb04f9 
   ql/src/java/org/apache/hadoop/hive/ql/parse/ParseUtils.java 
 373429cbf666f1b19828c532aea3c07f08f95e1a 
 
 Diff: https://reviews.apache.org/r/33806/diff/
 
 
 Testing
 ---
 
 Tested locally
 
 
 Thanks,
 
 Reuben Kuhnert
 




[jira] [Created] (HIVE-10738) Beeline does not respect hive.cli.print.current.db

2015-05-18 Thread Reuben Kuhnert (JIRA)
Reuben Kuhnert created HIVE-10738:
-

 Summary: Beeline does not respect hive.cli.print.current.db
 Key: HIVE-10738
 URL: https://issues.apache.org/jira/browse/HIVE-10738
 Project: Hive
  Issue Type: Bug
Reporter: Reuben Kuhnert
Assignee: Reuben Kuhnert
Priority: Minor



Hive CLI (shows default database):
{code}
hive set hive.cli.print.current.db=true;
set hive.cli.print.current.db=true;
hive (default) 
{code}

Beeline (no change):
{code}
0: jdbc:hive2://localhost:1 set hive.cli.print.current.db=true;
set hive.cli.print.current.db=true;
No rows affected (3.016 seconds)
0: jdbc:hive2://localhost:1 
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] Supporting Hadoop-1 and experimental features

2015-05-18 Thread Edward Capriolo
Up until recently Hive supported numerous versions of Hadoop code base with
a simple shim layer. I would rather we stick to the shim layer. I think
this was easily the best part about hive was that a single release worked
well regardless of your hadoop version. It was also a key element to hive's
success. I do not want to see us have multiple branches.

On Sat, May 16, 2015 at 1:29 AM, Xuefu Zhang xzh...@cloudera.com wrote:

 Thanks for the explanation, Alan!

 While I have understood more on the proposal, I actually see more problems
 than the confusion of two lines of releases. Essentially, this proposal
 forces a user to make a hard choice between a stabler, legacy-aware release
 line and an adventurous, pioneering release line. And once the choice is
 made, there is no easy way back or forward.

 Here is my interpretation. Let's say we have two main branches as
 proposed. I develop a new feature which I think useful for both branches.
 So, I commit it to both branches. My feature requires additional schema
 support, so I provide upgrade scripts for both branches. The scripts are
 different because the two branches have already diverged in schema.

 Now the two branches evolve in a diverging fashion like this. This is all
 good as long as a user stays in his line. The moment the user considers a
 switch, mostly likely, from branch-1 to branch-2, he is stuck. Why? Because
 there is no upgrade path from a release in branch-1 to a release in
 branch-2!

 If we want to provide an upgrade path, then there will be MxN paths, where
 M and N are the number of releases in the two branches, respectively. This
 is going to be next to a nightmare, not only for users, but also for us.

 Also, the proposal will require two sets of things that Hive provides:
 double documentation, double feature tracking, double build/test
 infrastructures, etc.

 This approach can also potentially cause the problem we saw in hadoop
 releases, where 0.23 release was greater than 1.0 release.

 To me, the problem we are trying to solve is deprecating old things such
 hadoop-1, Hive CLI, etc. This a valid problem to be solved. As I see,
 however, we approached the problem in less favorable ways.

 First, it seemed we wanted to deprecate something just for the sake of
 deprecation, and it's not based on the rationale that supports the desire.
 Dev might write code that accidentally break hadoop-1 build. However, this
 is more a build infrastructure problem rather than the burden of supporting
 hadoop-1. If our build could catch it at precommit test, then I would think
 the accident can be well avoided. Most of the times, fixing the build is
 trivial. And we have already addressed the build infrastructure problem.

 Secondly, if we do have a strong reason to deprecate something, we should
 have a deprecation plan rather than declaring on the spot that the current
 release is the last one supporting X. I think Microsoft did a better job in
 terms production deprecation. For instance, they announced long before the
 last day desupporting Windows XP. In my opinion, we should have a similar
 vision, giving users, distributions enough time to adjust rather than
 shocking them with breaking news.

 In summary, I do see the need of deprecation in Hive, but I am afraid the
 way we take, including the proposal here, isn't going to nicely solve the
 problem. On the contrary, I foresee a spectrum of confusion, frustration,
 and burden for the user as well as for developers.

 Thanks,
 Xuefu

 On Fri, May 15, 2015 at 8:19 PM, Alan Gates alanfga...@gmail.com wrote:



   Xuefu Zhang xzh...@cloudera.com
  May 15, 2015 at 17:31

 Just make sure that I understand the proposal correctly: we are going to
 have two main branches, one for hadoop-1 and one for hadoop-2.

  We shouldn't tie this to hadoop-1 and 2.  It's about Hive not Hadoop.
 It will be some time before Hive's branch-2 is stable, while Hadoop-2 is
 already well established.

  New features
 are only merged to branch-2. That essentially says we stop development for
 hadoop-1, right?

  If developers want to keep contributing patches to branch-1 then
 there's no need for it to stop.  We would want to avoid putting new
 features only on branch-1, unless they only made sense in that context.
 But I assume we'll see people contributing to branch-1 for some time.

  Are we also making two lines of releases: ene for branch-1
 and one for branch-2? Won't that be confusing and also burdensome if we
 release say 1.3, 2.0, 2.1, 1.4...

  I'm asserting that it will be less confusing than the alternatives.  We
 need some way to make early releases of many of the new features.  I
 believe that this proposal is less confusing than if we start putting the
 new features in 1.x branches.  This is particularly true because it would
 help us to start being able to drop older functionality like Hadoop-1 and
 MapReduce, which is very hard to do in the 1.x line without stranding users.

  Please note that we will 

[jira] [Created] (HIVE-10740) RpcServer should be restarted if related configuration is changed [Spark Branch]

2015-05-18 Thread Jimmy Xiang (JIRA)
Jimmy Xiang created HIVE-10740:
--

 Summary: RpcServer should be restarted if related configuration is 
changed [Spark Branch]
 Key: HIVE-10740
 URL: https://issues.apache.org/jira/browse/HIVE-10740
 Project: Hive
  Issue Type: Bug
  Components: Spark
Reporter: Jimmy Xiang


In reviewing patch for HIVE-10721, Chengxiang pointed out an existing issue 
with HoS: the RpcServer is never restarted even related configurations are 
changed, as we do for SparkSession. We should monitor related configurations 
and restart the RpcServer if any is changed. It should be restarted while there 
is no active SparkSession.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10739) Hiveserver2 Memory leak in ObjectInspectorFactory cache

2015-05-18 Thread Binglin Chang (JIRA)
Binglin Chang created HIVE-10739:


 Summary: Hiveserver2 Memory leak in ObjectInspectorFactory cache
 Key: HIVE-10739
 URL: https://issues.apache.org/jira/browse/HIVE-10739
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Reporter: Binglin Chang


User issue multiple add jar to add thrift classes to classpath and create 
table or doing query using those thrift serdes. 
After session is closed, there is still class and objectinspector instance live 
in the cache, so the classloader for the class and all the other referenced 
class and static fields cannot be freed.
We may need to provided an option to create inspector without putting it to 
cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] Supporting Hadoop-1 and experimental features

2015-05-18 Thread Edward Capriolo
This concept of experimental features basically translates to I do not
have the time to care about people not using my version. I do not see it
as good. We have seen what happened to upstream hadoop there was this gap
between 0.21 , and ??.??. No one was clear what the API was (mapred,
new mapreduce), no one know what to link off of cdh?, vanilla?, yahoo
distribution?.

IMHO. This is just going to increase fragmentation.

On Mon, May 18, 2015 at 1:04 PM, Edward Capriolo edlinuxg...@gmail.com
wrote:

 Up until recently Hive supported numerous versions of Hadoop code base
 with a simple shim layer. I would rather we stick to the shim layer. I
 think this was easily the best part about hive was that a single release
 worked well regardless of your hadoop version. It was also a key element to
 hive's success. I do not want to see us have multiple branches.

 On Sat, May 16, 2015 at 1:29 AM, Xuefu Zhang xzh...@cloudera.com wrote:

 Thanks for the explanation, Alan!

 While I have understood more on the proposal, I actually see more
 problems than the confusion of two lines of releases. Essentially, this
 proposal forces a user to make a hard choice between a stabler,
 legacy-aware release line and an adventurous, pioneering release line. And
 once the choice is made, there is no easy way back or forward.

 Here is my interpretation. Let's say we have two main branches as
 proposed. I develop a new feature which I think useful for both branches.
 So, I commit it to both branches. My feature requires additional schema
 support, so I provide upgrade scripts for both branches. The scripts are
 different because the two branches have already diverged in schema.

 Now the two branches evolve in a diverging fashion like this. This is all
 good as long as a user stays in his line. The moment the user considers a
 switch, mostly likely, from branch-1 to branch-2, he is stuck. Why? Because
 there is no upgrade path from a release in branch-1 to a release in
 branch-2!

 If we want to provide an upgrade path, then there will be MxN paths,
 where M and N are the number of releases in the two branches, respectively.
 This is going to be next to a nightmare, not only for users, but also for
 us.

 Also, the proposal will require two sets of things that Hive provides:
 double documentation, double feature tracking, double build/test
 infrastructures, etc.

 This approach can also potentially cause the problem we saw in hadoop
 releases, where 0.23 release was greater than 1.0 release.

 To me, the problem we are trying to solve is deprecating old things such
 hadoop-1, Hive CLI, etc. This a valid problem to be solved. As I see,
 however, we approached the problem in less favorable ways.

 First, it seemed we wanted to deprecate something just for the sake of
 deprecation, and it's not based on the rationale that supports the desire.
 Dev might write code that accidentally break hadoop-1 build. However, this
 is more a build infrastructure problem rather than the burden of supporting
 hadoop-1. If our build could catch it at precommit test, then I would think
 the accident can be well avoided. Most of the times, fixing the build is
 trivial. And we have already addressed the build infrastructure problem.

 Secondly, if we do have a strong reason to deprecate something, we should
 have a deprecation plan rather than declaring on the spot that the current
 release is the last one supporting X. I think Microsoft did a better job in
 terms production deprecation. For instance, they announced long before the
 last day desupporting Windows XP. In my opinion, we should have a similar
 vision, giving users, distributions enough time to adjust rather than
 shocking them with breaking news.

 In summary, I do see the need of deprecation in Hive, but I am afraid the
 way we take, including the proposal here, isn't going to nicely solve the
 problem. On the contrary, I foresee a spectrum of confusion, frustration,
 and burden for the user as well as for developers.

 Thanks,
 Xuefu

 On Fri, May 15, 2015 at 8:19 PM, Alan Gates alanfga...@gmail.com wrote:



   Xuefu Zhang xzh...@cloudera.com
  May 15, 2015 at 17:31

 Just make sure that I understand the proposal correctly: we are going to
 have two main branches, one for hadoop-1 and one for hadoop-2.

  We shouldn't tie this to hadoop-1 and 2.  It's about Hive not Hadoop.
 It will be some time before Hive's branch-2 is stable, while Hadoop-2 is
 already well established.

  New features
 are only merged to branch-2. That essentially says we stop development for
 hadoop-1, right?

  If developers want to keep contributing patches to branch-1 then
 there's no need for it to stop.  We would want to avoid putting new
 features only on branch-1, unless they only made sense in that context.
 But I assume we'll see people contributing to branch-1 for some time.

  Are we also making two lines of releases: ene for branch-1
 and one for branch-2? Won't that be confusing and also burdensome 

Re: [DISCUSS] Supporting Hadoop-1 and experimental features

2015-05-18 Thread Alan Gates




Xuefu Zhang mailto:xzh...@cloudera.com
May 15, 2015 at 22:29
Thanks for the explanation, Alan!

While I have understood more on the proposal, I actually see more 
problems than the confusion of two lines of releases. Essentially, 
this proposal forces a user to make a hard choice between a stabler, 
legacy-aware release line and an adventurous, pioneering release line. 
And once the choice is made, there is no easy way back or forward.


Here is my interpretation. Let's say we have two main branches as 
proposed. I develop a new feature which I think useful for both 
branches. So, I commit it to both branches. My feature requires 
additional schema support, so I provide upgrade scripts for both 
branches. The scripts are different because the two branches have 
already diverged in schema.


Now the two branches evolve in a diverging fashion like this. This is 
all good as long as a user stays in his line. The moment the user 
considers a switch, mostly likely, from branch-1 to branch-2, he is 
stuck. Why? Because there is no upgrade path from a release in 
branch-1 to a release in branch-2!


If we want to provide an upgrade path, then there will be MxN paths, 
where M and N are the number of releases in the two branches, 
respectively. This is going to be next to a nightmare, not only for 
users, but also for us.
MxN would indeed be bad, but there is no reason to approach it that 
way.  It's highly unlikely that users will want to migrate from 2.x - 
1.y.  And for a given 1.x release, we can assume that users will want to 
be able to migrate to the current head of branch for 2.y.  So this means 
we would need two upgrade scripts from each 1.x release.  This is extra 
effort but it is not that bad.


Also, the proposal will require two sets of things that Hive provides: 
double documentation, double feature tracking, double build/test 
infrastructures, etc.
Our documentation already handles the fact that certain features are 
only supported in certain releases.  Our test and build infrastructure 
can already be made to work on multiple branches.  I'm not sure what you 
mean by double feature tracking.


This approach can also potentially cause the problem we saw in hadoop 
releases, where 0.23 release was greater than 1.0 release.
I'm sorry, I don't follow what you're saying here.  You mean the numbers 
are just bigger (like 23  1)?  We already have that problem, this 
doesn't make it worse.


To me, the problem we are trying to solve is deprecating old things 
such hadoop-1, Hive CLI, etc. This a valid problem to be solved. As I 
see, however, we approached the problem in less favorable ways.
That is only one of the two problems.  The other is to provide a 
mechanism for experimental features.


First, it seemed we wanted to deprecate something just for the sake of 
deprecation, and it's not based on the rationale that supports the 
desire. Dev might write code that accidentally break hadoop-1 build. 
However, this is more a build infrastructure problem rather than the 
burden of supporting hadoop-1. If our build could catch it at 
precommit test, then I would think the accident can be well avoided. 
Most of the times, fixing the build is trivial. And we have already 
addressed the build infrastructure problem.


Secondly, if we do have a strong reason to deprecate something, we 
should have a deprecation plan rather than declaring on the spot that 
the current release is the last one supporting X. I think Microsoft 
did a better job in terms production deprecation. For instance, they 
announced long before the last day desupporting Windows XP. In my 
opinion, we should have a similar vision, giving users, distributions 
enough time to adjust rather than shocking them with breaking news.


In summary, I do see the need of deprecation in Hive, but I am afraid 
the way we take, including the proposal here, isn't going to nicely 
solve the problem. On the contrary, I foresee a spectrum of confusion, 
frustration, and burden for the user as well as for developers.


Thanks,
Xuefu


Xuefu Zhang mailto:xzh...@cloudera.com
May 15, 2015 at 17:31
Just make sure that I understand the proposal correctly: we are going to
have two main branches, one for hadoop-1 and one for hadoop-2. New features
are only merged to branch-2. That essentially says we stop development for
hadoop-1, right? Are we also making two lines of releases: ene for branch-1
and one for branch-2? Won't that be confusing and also burdensome if we
release say 1.3, 2.0, 2.1, 1.4...

Please note that we will have hadoop 3 soon. What's the story there?

Thanks,
Xuefu



On Fri, May 15, 2015 at 4:43 PM, Vaibhav Gumashtavgumas...@hortonworks.com

wrote:



  +1 on the new branch. I think it’ll help in faster dev time for these
important changes.

  —Vaibhav

   From: Alan Gatesalanfga...@gmail.com
Reply-To: dev@hive.apache.orgdev@hive.apache.org
Date: Friday, May 15, 2015 at 4:11 PM
To: dev@hive.apache.orgdev@hive.apache.org
Subject: Re: [DISCUSS] 

Re: [DISCUSS] Supporting Hadoop-1 and experimental features

2015-05-18 Thread Alan Gates




Edward Capriolo mailto:edlinuxg...@gmail.com
May 18, 2015 at 10:14
This concept of experimental features basically translates to I do not
have the time to care about people not using my version.
No, it does not.  Continuing to support old features is a cost/benefit 
trade off, both for developers and users.  The cost for developers is 
continuing to work around older code, the cost for users that they get 
less new features, less performance improvements, less stability 
improvements because developers are spending time working around the old 
code.


At some point in the cost/benefit analysis the costs are high enough 
that it makes sense to stop supporting it.  I am asserting that we are 
at that point.


Caring about people not on the latest version is an important part of 
what I am proposing.  There are still many users using Hive either on 
Hadoop 1 or for more traditional Hive workloads (batch, ETL).  It is 
important to give these users a good path forward.  My assertion is that 
a branch-1 is the best way to do this.


So to continue in the cost/benefit paradigm, what I have proposed does 
have an additional cost for developers.  As I have said in my responses 
to Xuefu, I don't think these are too bad, and I assert that they are 
less than continuing to carry forward older functionality ad infinitum.  
My intent is that for users who are not interested in new features or 
workloads the cost is at or near zero.  Customers interested in newer 
functionality will continue to have pay the cost of upgrades, but that 
is true anyway.


Alan.


I do not see it
as good. We have seen what happened to upstream hadoop there was this gap
between 0.21 , and ??.??. No one was clear what the API was (mapred,
new mapreduce), no one know what to link off of cdh?, vanilla?, yahoo
distribution?.

IMHO. This is just going to increase fragmentation.

On Mon, May 18, 2015 at 1:04 PM, Edward Capriolo edlinuxg...@gmail.com

Edward Capriolo mailto:edlinuxg...@gmail.com
May 18, 2015 at 10:04
Up until recently Hive supported numerous versions of Hadoop code base 
with

a simple shim layer. I would rather we stick to the shim layer. I think
this was easily the best part about hive was that a single release worked
well regardless of your hadoop version. It was also a key element to 
hive's

success. I do not want to see us have multiple branches.


Xuefu Zhang mailto:xzh...@cloudera.com
May 15, 2015 at 22:29
Thanks for the explanation, Alan!

While I have understood more on the proposal, I actually see more 
problems than the confusion of two lines of releases. Essentially, 
this proposal forces a user to make a hard choice between a stabler, 
legacy-aware release line and an adventurous, pioneering release line. 
And once the choice is made, there is no easy way back or forward.


Here is my interpretation. Let's say we have two main branches as 
proposed. I develop a new feature which I think useful for both 
branches. So, I commit it to both branches. My feature requires 
additional schema support, so I provide upgrade scripts for both 
branches. The scripts are different because the two branches have 
already diverged in schema.


Now the two branches evolve in a diverging fashion like this. This is 
all good as long as a user stays in his line. The moment the user 
considers a switch, mostly likely, from branch-1 to branch-2, he is 
stuck. Why? Because there is no upgrade path from a release in 
branch-1 to a release in branch-2!


If we want to provide an upgrade path, then there will be MxN paths, 
where M and N are the number of releases in the two branches, 
respectively. This is going to be next to a nightmare, not only for 
users, but also for us.


Also, the proposal will require two sets of things that Hive provides: 
double documentation, double feature tracking, double build/test 
infrastructures, etc.


This approach can also potentially cause the problem we saw in hadoop 
releases, where 0.23 release was greater than 1.0 release.


To me, the problem we are trying to solve is deprecating old things 
such hadoop-1, Hive CLI, etc. This a valid problem to be solved. As I 
see, however, we approached the problem in less favorable ways.


First, it seemed we wanted to deprecate something just for the sake of 
deprecation, and it's not based on the rationale that supports the 
desire. Dev might write code that accidentally break hadoop-1 build. 
However, this is more a build infrastructure problem rather than the 
burden of supporting hadoop-1. If our build could catch it at 
precommit test, then I would think the accident can be well avoided. 
Most of the times, fixing the build is trivial. And we have already 
addressed the build infrastructure problem.


Secondly, if we do have a strong reason to deprecate something, we 
should have a deprecation plan rather than declaring on the spot that 
the current release is the last one supporting X. I think Microsoft 
did a better job in terms production 

[jira] [Created] (HIVE-10741) count distinct rewrite is not firing

2015-05-18 Thread Ashutosh Chauhan (JIRA)
Ashutosh Chauhan created HIVE-10741:
---

 Summary: count distinct rewrite is not firing
 Key: HIVE-10741
 URL: https://issues.apache.org/jira/browse/HIVE-10741
 Project: Hive
  Issue Type: Bug
  Components: Query Planning
Affects Versions: 1.2.0
Reporter: Mostafa Mokhtar
Assignee: Ashutosh Chauhan


Rewrite introduced in HIVE-10568 is not effective outside of test environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] Supporting Hadoop-1 and experimental features

2015-05-18 Thread Owen O'Malley
I think that it is past time for Hive to have a stable and next branch.
Every release from Hive 0.11 to Hive 1.2 has been a major release in terms
of changes and functionality. Part of what we've been missing is a way of
making stable releases that don't move as fast and supports the customers
with minor new features, but no big sweeping changes. That will be a win
for users.

I'm +1 on Alan's plan of making a new release branch.

.. Owen


Re: [DISCUSS] Supporting Hadoop-1 and experimental features

2015-05-18 Thread Sergey Shelukhin
I think we need some path for deprecating old Hadoop versions, the same
way we deprecate old Java version support or old RDBMS version support.
At some point the cost of supporting Hadoop 1 exceeds the benefit. Same
goes for stuff like MR; supporting it, esp. for perf work, becomes a
burden, and it’s outdated with 2 alternatives, one of which has been
around for 2 releases.
The branches are a graceful way to get rid of the legacy burden.

Alternatively, when sweeping changes are made, we can do what Hbase did
(which is not pretty imho), where 0.94 version had ~30 dot releases
because people cannot upgrade to 0.96 “singularity” release.


I posit that people who run Hadoop 1 and MR at this day and age (and more
so as time passes) are people who either don’t care about perf and new
features, only stability; so, stability-focused branch would be perfect to
support them.


On 15/5/18, 10:04, Edward Capriolo edlinuxg...@gmail.com wrote:

Up until recently Hive supported numerous versions of Hadoop code base
with
a simple shim layer. I would rather we stick to the shim layer. I think
this was easily the best part about hive was that a single release worked
well regardless of your hadoop version. It was also a key element to
hive's
success. I do not want to see us have multiple branches.

On Sat, May 16, 2015 at 1:29 AM, Xuefu Zhang xzh...@cloudera.com wrote:

 Thanks for the explanation, Alan!

 While I have understood more on the proposal, I actually see more
problems
 than the confusion of two lines of releases. Essentially, this proposal
 forces a user to make a hard choice between a stabler, legacy-aware
release
 line and an adventurous, pioneering release line. And once the choice is
 made, there is no easy way back or forward.

 Here is my interpretation. Let's say we have two main branches as
 proposed. I develop a new feature which I think useful for both
branches.
 So, I commit it to both branches. My feature requires additional schema
 support, so I provide upgrade scripts for both branches. The scripts are
 different because the two branches have already diverged in schema.

 Now the two branches evolve in a diverging fashion like this. This is
all
 good as long as a user stays in his line. The moment the user considers
a
 switch, mostly likely, from branch-1 to branch-2, he is stuck. Why?
Because
 there is no upgrade path from a release in branch-1 to a release in
 branch-2!

 If we want to provide an upgrade path, then there will be MxN paths,
where
 M and N are the number of releases in the two branches, respectively.
This
 is going to be next to a nightmare, not only for users, but also for us.

 Also, the proposal will require two sets of things that Hive provides:
 double documentation, double feature tracking, double build/test
 infrastructures, etc.

 This approach can also potentially cause the problem we saw in hadoop
 releases, where 0.23 release was greater than 1.0 release.

 To me, the problem we are trying to solve is deprecating old things such
 hadoop-1, Hive CLI, etc. This a valid problem to be solved. As I see,
 however, we approached the problem in less favorable ways.

 First, it seemed we wanted to deprecate something just for the sake of
 deprecation, and it's not based on the rationale that supports the
desire.
 Dev might write code that accidentally break hadoop-1 build. However,
this
 is more a build infrastructure problem rather than the burden of
supporting
 hadoop-1. If our build could catch it at precommit test, then I would
think
 the accident can be well avoided. Most of the times, fixing the build is
 trivial. And we have already addressed the build infrastructure problem.

 Secondly, if we do have a strong reason to deprecate something, we
should
 have a deprecation plan rather than declaring on the spot that the
current
 release is the last one supporting X. I think Microsoft did a better
job in
 terms production deprecation. For instance, they announced long before
the
 last day desupporting Windows XP. In my opinion, we should have a
similar
 vision, giving users, distributions enough time to adjust rather than
 shocking them with breaking news.

 In summary, I do see the need of deprecation in Hive, but I am afraid
the
 way we take, including the proposal here, isn't going to nicely solve
the
 problem. On the contrary, I foresee a spectrum of confusion,
frustration,
 and burden for the user as well as for developers.

 Thanks,
 Xuefu

 On Fri, May 15, 2015 at 8:19 PM, Alan Gates alanfga...@gmail.com
wrote:



   Xuefu Zhang xzh...@cloudera.com
  May 15, 2015 at 17:31

 Just make sure that I understand the proposal correctly: we are going
to
 have two main branches, one for hadoop-1 and one for hadoop-2.

  We shouldn't tie this to hadoop-1 and 2.  It's about Hive not Hadoop.
 It will be some time before Hive's branch-2 is stable, while Hadoop-2
is
 already well established.

  New features
 are only merged to branch-2. That essentially says we stop development

Re: [DISCUSS] Supporting Hadoop-1 and experimental features

2015-05-18 Thread Sergey Shelukhin
Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some
people are set in their ways or have practical considerations and don’t
care for new shiny stuff.

On 15/5/18, 11:46, Sergey Shelukhin ser...@hortonworks.com wrote:

I think we need some path for deprecating old Hadoop versions, the same
way we deprecate old Java version support or old RDBMS version support.
At some point the cost of supporting Hadoop 1 exceeds the benefit. Same
goes for stuff like MR; supporting it, esp. for perf work, becomes a
burden, and it’s outdated with 2 alternatives, one of which has been
around for 2 releases.
The branches are a graceful way to get rid of the legacy burden.

Alternatively, when sweeping changes are made, we can do what Hbase did
(which is not pretty imho), where 0.94 version had ~30 dot releases
because people cannot upgrade to 0.96 “singularity” release.


I posit that people who run Hadoop 1 and MR at this day and age (and more
so as time passes) are people who either don’t care about perf and new
features, only stability; so, stability-focused branch would be perfect to
support them.


On 15/5/18, 10:04, Edward Capriolo edlinuxg...@gmail.com wrote:

Up until recently Hive supported numerous versions of Hadoop code base
with
a simple shim layer. I would rather we stick to the shim layer. I think
this was easily the best part about hive was that a single release worked
well regardless of your hadoop version. It was also a key element to
hive's
success. I do not want to see us have multiple branches.

On Sat, May 16, 2015 at 1:29 AM, Xuefu Zhang xzh...@cloudera.com wrote:

 Thanks for the explanation, Alan!

 While I have understood more on the proposal, I actually see more
problems
 than the confusion of two lines of releases. Essentially, this proposal
 forces a user to make a hard choice between a stabler, legacy-aware
release
 line and an adventurous, pioneering release line. And once the choice
is
 made, there is no easy way back or forward.

 Here is my interpretation. Let's say we have two main branches as
 proposed. I develop a new feature which I think useful for both
branches.
 So, I commit it to both branches. My feature requires additional schema
 support, so I provide upgrade scripts for both branches. The scripts
are
 different because the two branches have already diverged in schema.

 Now the two branches evolve in a diverging fashion like this. This is
all
 good as long as a user stays in his line. The moment the user considers
a
 switch, mostly likely, from branch-1 to branch-2, he is stuck. Why?
Because
 there is no upgrade path from a release in branch-1 to a release in
 branch-2!

 If we want to provide an upgrade path, then there will be MxN paths,
where
 M and N are the number of releases in the two branches, respectively.
This
 is going to be next to a nightmare, not only for users, but also for
us.

 Also, the proposal will require two sets of things that Hive provides:
 double documentation, double feature tracking, double build/test
 infrastructures, etc.

 This approach can also potentially cause the problem we saw in hadoop
 releases, where 0.23 release was greater than 1.0 release.

 To me, the problem we are trying to solve is deprecating old things
such
 hadoop-1, Hive CLI, etc. This a valid problem to be solved. As I see,
 however, we approached the problem in less favorable ways.

 First, it seemed we wanted to deprecate something just for the sake of
 deprecation, and it's not based on the rationale that supports the
desire.
 Dev might write code that accidentally break hadoop-1 build. However,
this
 is more a build infrastructure problem rather than the burden of
supporting
 hadoop-1. If our build could catch it at precommit test, then I would
think
 the accident can be well avoided. Most of the times, fixing the build
is
 trivial. And we have already addressed the build infrastructure
problem.

 Secondly, if we do have a strong reason to deprecate something, we
should
 have a deprecation plan rather than declaring on the spot that the
current
 release is the last one supporting X. I think Microsoft did a better
job in
 terms production deprecation. For instance, they announced long before
the
 last day desupporting Windows XP. In my opinion, we should have a
similar
 vision, giving users, distributions enough time to adjust rather than
 shocking them with breaking news.

 In summary, I do see the need of deprecation in Hive, but I am afraid
the
 way we take, including the proposal here, isn't going to nicely solve
the
 problem. On the contrary, I foresee a spectrum of confusion,
frustration,
 and burden for the user as well as for developers.

 Thanks,
 Xuefu

 On Fri, May 15, 2015 at 8:19 PM, Alan Gates alanfga...@gmail.com
wrote:



   Xuefu Zhang xzh...@cloudera.com
  May 15, 2015 at 17:31

 Just make sure that I understand the proposal correctly: we are going
to
 have two main branches, one for hadoop-1 and one for hadoop-2.

  We shouldn't tie 

Review Request 34368: HIVE-10550: Dynamic RDD caching optimization for HoS.[Spark Branch]

2015-05-18 Thread Xuefu Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34368/
---

Review request for hive and chengxiang li.


Bugs: HIVE-10550
https://issues.apache.org/jira/browse/HIVE-10550


Repository: hive-git


Description
---

See jira description.


Diffs
-

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc 
  ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java d5ea96a 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/LocalHiveSparkClient.java 
19d3fee 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 26cfebd 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java 
8b15099 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java a774395 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 
3f240f5 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c 
  
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/impl/LocalSparkJobStatus.java
 5d62596 
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java
 8e56263 
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java
 PRE-CREATION 
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSkewJoinProcFactory.java
 5990d17 
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SplitSparkWorkResolver.java
 fb20080 
  ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70 
  ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79 
  ql/src/test/results/clientpositive/spark/ppd_outer_join3.q.out 6a0654a 
  spark-client/src/main/java/org/apache/hive/spark/client/JobContext.java 
af6332e 
  spark-client/src/main/java/org/apache/hive/spark/client/JobContextImpl.java 
beed8a3 
  spark-client/src/main/java/org/apache/hive/spark/client/MonitorCallback.java 
e1e899e 
  spark-client/src/main/java/org/apache/hive/spark/client/RemoteDriver.java 
b77c9e8 
  spark-client/src/test/java/org/apache/hive/spark/client/TestSparkClient.java 
d33ad7e 

Diff: https://reviews.apache.org/r/34368/diff/


Testing
---


Thanks,

Xuefu Zhang



GenericUDF.getConstantLongValue

2015-05-18 Thread Alexander Pivovarov
Hello Everyone

There is a bug in GenericUDF.getConstantLongValue.

There are 2 patches available:
1. fix the bug https://issues.apache.org/jira/browse/HIVE-10580

2. delete the method because it's not used
https://issues.apache.org/jira/browse/HIVE-10710

Can any committer +1 on one or another solution. I'm fine with any solution.

Thank you
Alex


[ANNOUNCE] Apache Hive 1.2.0 Released

2015-05-18 Thread Sushanth Sowmyan

The Apache Hive team is proud to announce the the release of Apache Hive 
version 1.2.0.

The Apache Hive (TM) data warehouse software facilitates querying and managing 
large datasets residing in distributed storage. Built on top of Apache Hadoop 
(TM), it provides:

* Tools to enable easy data extract/transform/load (ETL)

* A mechanism to impose structure on a variety of data formats

* Access to files stored either directly in Apache HDFS (TM) or in other data 
storage systems such as Apache HBase (TM)

* Query execution via Apache Hadoop MapReduce, Apache Tez or Apache Spark 
frameworks.

For Hive release details and downloads, please visit: 
https://hive.apache.org/downloads.html

Hive 1.2.0 Release Notes are available here: 
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12329345styleName=TextprojectId=12310843

We would like to thank the many contributors who made this release possible.

Regards,

The Apache Hive Team



[jira] [Created] (HIVE-10742) rename_table_location.q test fails

2015-05-18 Thread Vikram Dixit K (JIRA)
Vikram Dixit K created HIVE-10742:
-

 Summary: rename_table_location.q test fails
 Key: HIVE-10742
 URL: https://issues.apache.org/jira/browse/HIVE-10742
 Project: Hive
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.2.0, 1.3.0
Reporter: Vikram Dixit K
Assignee: Sushanth Sowmyan


The test rename_table_location.q fails all the time but is not being caught by 
the HiveQA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 33968: HIVE-10644 create SHA2 UDF

2015-05-18 Thread Jason Dere

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33968/#review84217
---



ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java
https://reviews.apache.org/r/33968/#comment135357

In retrospect, I wish these paremeter utility methods had been put into a 
utility class rather than in GenericUDF - I feel like we are adding a lot of 
clutter to a class that users use to subclass.

Not sure if it's too late to do something about this - I see on HIVE-10580 
there is some discussion about whether this can be removed.

Can you either create a new UDF params utility class for these, or just add 
these methods directly to GenericUDFSha2 for now?


- Jason Dere


On May 13, 2015, 5:48 a.m., Alexander Pivovarov wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33968/
 ---
 
 (Updated May 13, 2015, 5:48 a.m.)
 
 
 Review request for hive and Jason Dere.
 
 
 Bugs: HIVE-10644
 https://issues.apache.org/jira/browse/HIVE-10644
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 HIVE-10644 create SHA2 UDF
 
 
 Diffs
 -
 
   ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 
 02a604ff0a4ed92dfd94b199e8b539f636b66f77 
   ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java 
 b043bdc882af7c0b83787526a5a55c9dc29c6681 
   ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSha2.java 
 PRE-CREATION 
   ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFSha2.java 
 PRE-CREATION 
   ql/src/test/queries/clientpositive/udf_sha2.q PRE-CREATION 
   ql/src/test/results/clientpositive/show_functions.q.out 
 a422760400c62d026324dd667e4a632bfbe01b82 
   ql/src/test/results/clientpositive/udf_sha2.q.out PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/33968/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Alexander Pivovarov
 




Re: Review Request 33968: HIVE-10644 create SHA2 UDF

2015-05-18 Thread Alexander Pivovarov

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33968/
---

(Updated May 18, 2015, 10:24 p.m.)


Review request for hive and Jason Dere.


Changes
---

added GenericUDFParamUtils


Bugs: HIVE-10644
https://issues.apache.org/jira/browse/HIVE-10644


Repository: hive-git


Description
---

HIVE-10644 create SHA2 UDF


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 
02a604ff0a4ed92dfd94b199e8b539f636b66f77 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFParamUtils.java 
PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSha2.java 
PRE-CREATION 
  ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFSha2.java 
PRE-CREATION 
  ql/src/test/queries/clientpositive/udf_sha2.q PRE-CREATION 
  ql/src/test/results/clientpositive/show_functions.q.out 
a422760400c62d026324dd667e4a632bfbe01b82 
  ql/src/test/results/clientpositive/udf_sha2.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/33968/diff/


Testing
---


Thanks,

Alexander Pivovarov



Re: [ANNOUNCE] Apache Hive 1.2.0 Released

2015-05-18 Thread Thejas Nair
Thanks for driving this Sushanth!


On Mon, May 18, 2015 at 2:25 PM, Sushanth Sowmyan khorg...@apache.org wrote:

 The Apache Hive team is proud to announce the the release of Apache Hive 
 version 1.2.0.

 The Apache Hive (TM) data warehouse software facilitates querying and 
 managing large datasets residing in distributed storage. Built on top of 
 Apache Hadoop (TM), it provides:

 * Tools to enable easy data extract/transform/load (ETL)

 * A mechanism to impose structure on a variety of data formats

 * Access to files stored either directly in Apache HDFS (TM) or in other data 
 storage systems such as Apache HBase (TM)

 * Query execution via Apache Hadoop MapReduce, Apache Tez or Apache Spark 
 frameworks.

 For Hive release details and downloads, please visit: 
 https://hive.apache.org/downloads.html

 Hive 1.2.0 Release Notes are available here: 
 https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12329345styleName=TextprojectId=12310843

 We would like to thank the many contributors who made this release possible.

 Regards,

 The Apache Hive Team



[jira] [Created] (HIVE-10747) enable the cleanup side effect for Encryption related qfile test

2015-05-18 Thread Ferdinand Xu (JIRA)
Ferdinand Xu created HIVE-10747:
---

 Summary: enable the cleanup side effect for Encryption related 
qfile test
 Key: HIVE-10747
 URL: https://issues.apache.org/jira/browse/HIVE-10747
 Project: Hive
  Issue Type: Sub-task
  Components: Testing Infrastructure
Reporter: Ferdinand Xu
Assignee: Ferdinand Xu


The hive conf is not reset in the clearTestSideEffects method which is involved 
from HIVE-8900. This will have pollute other qfile's settings running by 
TestEncryptedHDFSCliDriver



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Review Request 34393: HIVE-10427 - collect_list() and collect_set() should accept struct types as argument

2015-05-18 Thread Chao Sun

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34393/
---

Review request for hive.


Bugs: HIVE-10427
https://issues.apache.org/jira/browse/HIVE-10427


Repository: hive-git


Description
---

Currently for collect_list() and collect_set(), only primitive types are 
supported. This patch adds support for struct and map types as well.

It turned out I that all I need is loosen the type checking.


Diffs
-

  data/files/customers.txt PRE-CREATION 
  data/files/nested_orders.txt PRE-CREATION 
  data/files/orders.txt PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCollectList.java 
536c4a7 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCollectSet.java 
6dc424a 
  
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMkCollectionEvaluator.java
 efcc8f5 
  ql/src/test/queries/clientpositive/udaf_collect_list_set_nested.q 
PRE-CREATION 
  ql/src/test/results/clientpositive/udaf_collect_list_set_nested.q.out 
PRE-CREATION 

Diff: https://reviews.apache.org/r/34393/diff/


Testing
---

All but one test (which seems unrelated) are passing.
I also added a test: udaf_collect_list_set_nested.q


Thanks,

Chao Sun



Re: Review Request 34393: HIVE-10427 - collect_list() and collect_set() should accept struct types as argument

2015-05-18 Thread Lenni Kuff

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34393/#review84260
---



ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCollectSet.java
https://reviews.apache.org/r/34393/#comment135437

should we also support arrays and unions?



ql/src/test/queries/clientpositive/udaf_collect_list_set_nested.q
https://reviews.apache.org/r/34393/#comment135438

add a negative test to validate unsupported types?


- Lenni Kuff


On May 19, 2015, 4:47 a.m., Chao Sun wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34393/
 ---
 
 (Updated May 19, 2015, 4:47 a.m.)
 
 
 Review request for hive.
 
 
 Bugs: HIVE-10427
 https://issues.apache.org/jira/browse/HIVE-10427
 
 
 Repository: hive-git
 
 
 Description
 ---
 
 Currently for collect_list() and collect_set(), only primitive types are 
 supported. This patch adds support for struct and map types as well.
 
 It turned out I that all I need is loosen the type checking.
 
 
 Diffs
 -
 
   data/files/customers.txt PRE-CREATION 
   data/files/nested_orders.txt PRE-CREATION 
   data/files/orders.txt PRE-CREATION 
   
 ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCollectList.java 
 536c4a7 
   
 ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCollectSet.java 
 6dc424a 
   
 ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMkCollectionEvaluator.java
  efcc8f5 
   ql/src/test/queries/clientpositive/udaf_collect_list_set_nested.q 
 PRE-CREATION 
   ql/src/test/results/clientpositive/udaf_collect_list_set_nested.q.out 
 PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/34393/diff/
 
 
 Testing
 ---
 
 All but one test (which seems unrelated) are passing.
 I also added a test: udaf_collect_list_set_nested.q
 
 
 Thanks,
 
 Chao Sun
 




[jira] [Created] (HIVE-10745) Better null handling by Vectorizer

2015-05-18 Thread Ashutosh Chauhan (JIRA)
Ashutosh Chauhan created HIVE-10745:
---

 Summary: Better null handling by Vectorizer
 Key: HIVE-10745
 URL: https://issues.apache.org/jira/browse/HIVE-10745
 Project: Hive
  Issue Type: Bug
  Components: Vectorization
Affects Versions: 1.2.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan


Minor refactoring around null handling in Vectorization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Review Request 34385: Better null handling by Vectorizer

2015-05-18 Thread Ashutosh Chauhan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34385/
---

Review request for hive and Gopal V.


Bugs: HIVE-10745
https://issues.apache.org/jira/browse/HIVE-10745


Repository: hive-git


Description
---

Better null handling by Vectorizer


Diffs
-

  ql/src/java/org/apache/hadoop/hive/ql/exec/ExprNodeEvaluatorFactory.java 
f08321c 
  ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorizationContext.java 
48f34a9 

Diff: https://reviews.apache.org/r/34385/diff/


Testing
---


Thanks,

Ashutosh Chauhan



[jira] [Created] (HIVE-10746) Hive 0.14.x and Hive 1.2.0 w/ Tez 0.5.3/Tez 0.6.0 Slow group by/order by

2015-05-18 Thread Greg Senia (JIRA)
Greg Senia created HIVE-10746:
-

 Summary: Hive 0.14.x and Hive 1.2.0 w/ Tez 0.5.3/Tez 0.6.0 Slow 
group by/order by
 Key: HIVE-10746
 URL: https://issues.apache.org/jira/browse/HIVE-10746
 Project: Hive
  Issue Type: Bug
  Components: Hive, Tez
Affects Versions: 1.2.0, 0.14.0, 0.14.1, 1.1.0, 1.1.1
Reporter: Greg Senia
Priority: Critical


The following query: SELECT appl_user_id, arsn_cd, COUNT(*) as RecordCount 
FROM adw.crc_arsn GROUP BY appl_user_id,arsn_cd ORDER BY appl_user_id; runs 
consistently fast in Spark and Mapreduce on Hive 1.2.0. When attempting to run 
this same query against Tez as the execution engine it consistently runs for 
over 300-500 seconds this seems extremely long. This is a basic external table 
delimited by tabs and is a single file in a folder. In Hive 0.13 this query 
with Tez runs fast and I tested with Hive 0.14, 0.14.1/1.0.0 and now Hive 1.2.0 
and there clearly is something going awry with Hive w/Tez as an execution 
engine with Single or small file tables. I can attach further logs if someone 
needs them for deeper analysis.

HDFS Output:
hadoop fs -ls /example_dw/crc/arsn
Found 2 items
-rwxr-x---   6 loaduser hadoopusers  0 2015-05-17 20:03 
/example_dw/crc/arsn/_SUCCESS
-rwxr-x---   6 loaduser hadoopusers3883880 2015-05-17 20:03 
/example_dw/crc/arsn/part-m-0


Hive Table Describe:
hive describe formatted crc_arsn;
OK
# col_name  data_type   comment 
 
arsn_cd string  
clmlvl_cd   string  
arclss_cd   string  
arclssg_cd  string  
arsn_prcsr_rmk_ind  string  
arsn_mbr_rspns_ind  string  
savtyp_cd   string  
arsn_eff_dt string  
arsn_exp_dt string  
arsn_pstd_dts   string  
arsn_lstupd_dts string  
arsn_updrsn_txt string  
appl_user_idstring  
arsntyp_cd  string  
pre_d_indicator string  
arsn_display_txtstring  
arstat_cd   string  
arsn_tracking_nostring  
arsn_cstspcfc_ind   string  
arsn_mstr_rcrd_ind  string  
state_specific_ind  string  
region_specific_in  string  
arsn_dpndnt_cd  string  
unit_adjustment_in  string  
arsn_mbr_only_ind   string  
arsn_qrmb_ind   string  
 
# Detailed Table Information 
Database:   adw  
Owner:  loadu...@exa.example.com   
CreateTime: Mon Apr 28 13:28:05 EDT 2014 
LastAccessTime: UNKNOWN  
Protect Mode:   None 
Retention:  0
Location:   hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn   
 
Table Type: EXTERNAL_TABLE   
Table Parameters:
EXTERNALTRUE
transient_lastDdlTime   1398706085  
 
# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
 
InputFormat:org.apache.hadoop.mapred.TextInputFormat 
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
Compressed: No   
Num Buckets:-1   
Bucket Columns: []   
Sort Columns:   []   
Storage Desc Params: 
field.delim \t  
line.delim  \n  
serialization.format\t  
Time taken: 1.245 seconds, Fetched: 54 row(s)




Explain Hive 1.2.0 w/Tez:
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Tez
  Edges:
Reducer 2 - Map 1 (SIMPLE_EDGE)

[jira] [Created] (HIVE-10743) LLAP: rare NPE in IO

2015-05-18 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created HIVE-10743:
---

 Summary: LLAP: rare NPE in IO
 Key: HIVE-10743
 URL: https://issues.apache.org/jira/browse/HIVE-10743
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin


{noformat}
2015-05-18 15:37:33,702 
[TezTaskRunner_attempt_1431919257083_0116_1_00_09_0(container_1_0116_01_10_sershe_20150518153700_b3649675-c035-4d9a-8dfb-2818b0173022:1_Map
 1_9_0)] INFO org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: 
Processing file 
hdfs://cn041-10.l42scl.hortonworks.com:8020/apps/hive/warehouse/tpch_orc_snappy_1000.db/lineitem/93_0
2015-05-18 15:37:33,743 
[IO-Elevator-Thread-9(container_1_0116_01_10_sershe_20150518153700_b3649675-c035-4d9a-8dfb-2818b0173022:1_Map
 1_9_0)] INFO org.apache.hadoop.hive.ql.io.orc.EncodedReaderImpl: Resulting 
disk ranges to read (file 7895017): [{range start: 28153685 end: 70814209}]
2015-05-18 15:37:33,743 
[IO-Elevator-Thread-9(container_1_0116_01_10_sershe_20150518153700_b3649675-c035-4d9a-8dfb-2818b0173022:1_Map
 1_9_0)] INFO org.apache.hadoop.hive.ql.io.orc.EncodedReaderImpl: Disk ranges 
after cache (file 7895017, base offset 3): [{range start: 28153685 end: 
70814209}]
2015-05-18 15:37:33,791 
[IO-Elevator-Thread-9(container_1_0116_01_10_sershe_20150518153700_b3649675-c035-4d9a-8dfb-2818b0173022:1_Map
 1_9_0)] INFO org.apache.hadoop.hive.ql.io.orc.EncodedReaderImpl: Disk ranges 
after disk read (file 7895017, base offset 3): [{data range [28153685, 
70814209), size: 42660524 type: direct}]
2015-05-18 15:37:33,804 
[IO-Elevator-Thread-9(container_1_0116_01_10_sershe_20150518153700_b3649675-c035-4d9a-8dfb-2818b0173022:1_Map
 1_9_0)] INFO org.apache.hadoop.hive.llap.io.api.impl.LlapIoImpl: setError 
called; closed false, done false, err null, pending 0
...
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.io.orc.InStream.readEncodedStream(InStream.java:763)
at 
org.apache.hadoop.hive.ql.io.orc.EncodedReaderImpl.readEncodedColumns(EncodedReaderImpl.java:445)
at 
org.apache.hadoop.hive.llap.io.encoded.OrcEncodedDataReader.callInternal(OrcEncodedDataReader.java:294)
at 
org.apache.hadoop.hive.llap.io.encoded.OrcEncodedDataReader.callInternal(OrcEncodedDataReader.java:56)
at 
org.apache.hadoop.hive.common.CallableWithNdc.call(CallableWithNdc.java:37)
... 4 more
{noformat}

Not sure yet how this happened. May add some logging or look more if I see it 
again



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10744) LLAP: dags get stuck in yet another way

2015-05-18 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created HIVE-10744:
---

 Summary: LLAP: dags get stuck in yet another way
 Key: HIVE-10744
 URL: https://issues.apache.org/jira/browse/HIVE-10744
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin
Assignee: Siddharth Seth


DAG gets stuck when number of tasks that is multiple of number of containers on 
machine (6, 12, ... in my case) fails to finish at the end of the stage (I am 
running a job with 500-1000 maps). Happened twice on 3rd DAG with 1000-map job 
(TPCH Q1), then when I reduced to 500 happened on 7th DAG so far. [~sseth] has 
the details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)