[jira] [Created] (HIVE-16099) java.lang.NoClassDefFoundError: Lorg/apache/hive/spark/counter/SparkCounters;

2017-03-02 Thread balaji krishnan (JIRA)
balaji krishnan created HIVE-16099:
--

 Summary: java.lang.NoClassDefFoundError: 
Lorg/apache/hive/spark/counter/SparkCounters;
 Key: HIVE-16099
 URL: https://issues.apache.org/jira/browse/HIVE-16099
 Project: Hive
  Issue Type: Bug
  Components: Hive, Spark
Affects Versions: 2.0.1
 Environment: Hadoop Version -- 2.6.2 
Spark Version -- 1.6.2 
Hive Version -- 2.0.1
Reporter: balaji krishnan
Priority: Critical


Hello
I am trying to run a simple select count(*) from table on hive. However i get 
an error after the state changes to STARTED.

The flow is something like this

hive> select count(*) from customers;
Query ID = hadoop_20170302215442_6493f265-3121-4eaf-b2d6-eea5b08ae591
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Spark Job = b22ec8e6-4b11-49fe-9fdd-001aa7248084
state = SENT
state = SENT
state = SENT
state = SENT
state = SENT
state = SENT
state = SENT
state = STARTED
state = STARTED
state = STARTED

Query Hive on Spark job[0] stages:
0
1

Status: Running (Hive on Spark job[0])
Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-
FailedTasksCount)/TotalTasksCount [StageCost]
2017-03-02 21:55:02,526 Stage-0_0: 0(+1)/1  Stage-1_0: 0/1
state = STARTED
2017-03-02 21:55:03,537 Stage-0_0: 0(+1,-1)/1   Stage-1_0: 0/1
state = FAILED
Status: Failed
FAILED: Execution Error, return code 3 from 
org.apache.hadoop.hive.ql.exec.spark.SparkTask

The error i see in yarn logs is the following

17/03/03 05:55:02 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
slcag034.us.oracle.com): java.lang.NoClassDefFoundError: 
Lorg/apache/hive/spark/counter/SparkCounters;
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2583)
at java.lang.Class.getDeclaredField(Class.java:2068)
at 
java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659)
at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at 

[DISCUSS] Looking to the future hivemall graduation

2017-03-02 Thread Edward Capriolo
Hivemall in the incubator has a fairly impressive set of features that do
machine learning directly from hive.

http://hivemall.incubator.apache.org/overview.html
https://github.com/myui/hivemall/wiki/Logistic-regression-dataset-generation

While we can not put the cart before the horse, i can imagine that upon
graduation hivemall would be a natural fit to become part of hive (maybe as
a  sub project).

I could imagine we can setup like we did for hcat where we make a subtree
and give commit rights to the tree eventually converting those interested
in other parts of hive to hive committers as well.

In any case hivemall devs, amazing work!

Thanks,
Edward


[jira] [Created] (HIVE-16098) Describe table doesn't show stats for partitioned tables

2017-03-02 Thread Ashutosh Chauhan (JIRA)
Ashutosh Chauhan created HIVE-16098:
---

 Summary: Describe table doesn't show stats for partitioned tables
 Key: HIVE-16098
 URL: https://issues.apache.org/jira/browse/HIVE-16098
 Project: Hive
  Issue Type: Improvement
  Components: Diagnosability
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HIVE-16097) minor fixes to metrics and logs in LlapTaskScheduler

2017-03-02 Thread Siddharth Seth (JIRA)
Siddharth Seth created HIVE-16097:
-

 Summary: minor fixes to metrics and logs in LlapTaskScheduler
 Key: HIVE-16097
 URL: https://issues.apache.org/jira/browse/HIVE-16097
 Project: Hive
  Issue Type: Bug
  Components: llap
Reporter: Siddharth Seth
Assignee: Siddharth Seth






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HIVE-16096) Predicate `__time` In ("date", "date") is not pused

2017-03-02 Thread slim bouguerra (JIRA)
slim bouguerra created HIVE-16096:
-

 Summary: Predicate `__time` In ("date", "date") is not pused
 Key: HIVE-16096
 URL: https://issues.apache.org/jira/browse/HIVE-16096
 Project: Hive
  Issue Type: Bug
Reporter: slim bouguerra


{code}
 explain select * from login_druid where `__time` in ("2003-1-1", "2004-1-1" );
OK
Plan optimized by CBO.

Stage-0
  Fetch Operator
limit:-1
Select Operator [SEL_2]
  Output:["_col0","_col1","_col2"]
  Filter Operator [FIL_4]
predicate:(__time) IN ('2003-1-1', '2004-1-1')
TableScan [TS_0]
  
Output:["__time","userid","num_l"],properties:{"druid.query.json":"{\"queryType\":\"select\",\"dataSource\":\"druid_user_login\",\"descending\":false,\"intervals\":[\"1900-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z\"],\"dimensions\":[\"userid\"],\"metrics\":[\"num_l\"],\"granularity\":\"all\",\"pagingSpec\":{\"threshold\":16384},\"context\":{\"druid.query.fetch\":false}}","druid.query.type":"select"}

{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HIVE-16095) Filter generation is not taking into account the column type.

2017-03-02 Thread slim bouguerra (JIRA)
slim bouguerra created HIVE-16095:
-

 Summary: Filter generation is not taking into account the column 
type.
 Key: HIVE-16095
 URL: https://issues.apache.org/jira/browse/HIVE-16095
 Project: Hive
  Issue Type: Bug
Reporter: slim bouguerra


We are suppose to get alphanumeric comparison when we have a cast to numeric 
type. This looks like to be a calcite issue.  
{code}
hive> explain select * from login_druid where userid < 2
> ;
OK
Plan optimized by CBO.

Stage-0
  Fetch Operator
limit:-1
Select Operator [SEL_1]
  Output:["_col0","_col1","_col2"]
  TableScan [TS_0]

Output:["__time","userid","num_l"],properties:{"druid.query.json":"{\"queryType\":\"select\",\"dataSource\":\"druid_user_login\",\"descending\":false,\"intervals\":[\"1900-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z\"],\"filter\":{\"type\":\"bound\",\"dimension\":\"userid\",\"upper\":\"2\",\"upperStrict\":true,\"alphaNumeric\":false},\"dimensions\":[\"userid\"],\"metrics\":[\"num_l\"],\"granularity\":\"all\",\"pagingSpec\":{\"threshold\":16384},\"context\":{\"druid.query.fetch\":false}}","druid.query.type":"select"}

Time taken: 1.548 seconds, Fetched: 10 row(s)
hive> explain select * from login_druid where cast (userid as int) < 2;
OK
Plan optimized by CBO.

Stage-0
  Fetch Operator
limit:-1
Select Operator [SEL_1]
  Output:["_col0","_col1","_col2"]
  TableScan [TS_0]

Output:["__time","userid","num_l"],properties:{"druid.query.json":"{\"queryType\":\"select\",\"dataSource\":\"druid_user_login\",\"descending\":false,\"intervals\":[\"1900-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z\"],\"filter\":{\"type\":\"bound\",\"dimension\":\"userid\",\"upper\":\"2\",\"upperStrict\":true,\"alphaNumeric\":false},\"dimensions\":[\"userid\"],\"metrics\":[\"num_l\"],\"granularity\":\"all\",\"pagingSpec\":{\"threshold\":16384},\"context\":{\"druid.query.fetch\":false}}","druid.query.type":"select"}

Time taken: 0.27 seconds, Fetched: 10 row(s)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HIVE-16094) queued containers may timeout if they don't get to run for a long time

2017-03-02 Thread Siddharth Seth (JIRA)
Siddharth Seth created HIVE-16094:
-

 Summary: queued containers may timeout if they don't get to run 
for a long time
 Key: HIVE-16094
 URL: https://issues.apache.org/jira/browse/HIVE-16094
 Project: Hive
  Issue Type: Bug
Affects Versions: 2.2.0
Reporter: Siddharth Seth
Assignee: Siddharth Seth
Priority: Critical


I believe this happened after HIVE-15958 - since we end up keeping amNodeInfo 
in knownAppMaters, and that can result in the callable not being scheduled on 
new task registration.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HIVE-16093) LLAP: Slider placement policy gets ignored since we destroy application

2017-03-02 Thread Prasanth Jayachandran (JIRA)
Prasanth Jayachandran created HIVE-16093:


 Summary: LLAP: Slider placement policy gets ignored since we 
destroy application
 Key: HIVE-16093
 URL: https://issues.apache.org/jira/browse/HIVE-16093
 Project: Hive
  Issue Type: Bug
  Components: llap
Affects Versions: 2.2.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran


Slider placement '1' (host affinity) does not work since launch script is 
invoking destroy of llap instance. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Review Request 56810: Compute table stats when user computes column stats

2017-03-02 Thread pengcheng xiong

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56810/
---

(Updated March 2, 2017, 10:37 p.m.)


Review request for hive and Ashutosh Chauhan.


Repository: hive-git


Description
---

HIVE-15903


Diffs (updated)
-

  itests/src/test/resources/testconfiguration.properties b01ebd8 
  ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java 3e749eb 
  ql/src/java/org/apache/hadoop/hive/ql/parse/GenTezUtils.java 7f5fdff 
  ql/src/java/org/apache/hadoop/hive/ql/parse/ProcessAnalyzeTable.java c13a404 
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 0872e53 
  ql/src/test/queries/clientpositive/column_table_stats.q PRE-CREATION 
  ql/src/test/queries/clientpositive/column_table_stats_orc.q PRE-CREATION 
  ql/src/test/results/clientpositive/llap/column_table_stats.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/llap/column_table_stats_orc.q.out 
PRE-CREATION 
  ql/src/test/results/clientpositive/perf/query14.q.out 9821180 
  ql/src/test/results/clientpositive/tez/explainanalyze_3.q.out 20c330a 
  ql/src/test/results/clientpositive/tez/explainanalyze_5.q.out ee9affb 
  ql/src/test/results/clientpositive/tez/explainuser_3.q.out 74e4693 


Diff: https://reviews.apache.org/r/56810/diff/5/

Changes: https://reviews.apache.org/r/56810/diff/4-5/


Testing
---


Thanks,

pengcheng xiong



[jira] [Created] (HIVE-16092) Generate and use universal mmId instead of per db/table

2017-03-02 Thread Wei Zheng (JIRA)
Wei Zheng created HIVE-16092:


 Summary: Generate and use universal mmId instead of per db/table
 Key: HIVE-16092
 URL: https://issues.apache.org/jira/browse/HIVE-16092
 Project: Hive
  Issue Type: Sub-task
Reporter: Wei Zheng
Assignee: Wei Zheng


To facilitate later replacement for it with txnId



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: [DISCUSS] Spark's fork of hive

2017-03-02 Thread Alan Gates
I think the issues you point out fall into a couple of different buckets:

1) Spark forking Hive code.  As you point out, based on the license, there's 
nothing wrong with this (see naming concerns below).  I agree it's crappy 
technical practice because in the long term Hive and Spark will diverge and the 
Spark community will either give up on interoperability or spend more and more 
time maintaining it.  But if their MO is "we take the best of whatever you 
write and include it in Spark", then I think all we can do about it is 1) 
remember that imitation is the sincerest form of flattery; and 2) see what of 
there's we can incorporate into Hive.

2) I agree that they should not call Hive what they incorporate into Spark.  In 
particular shipping maven jars with org.apache.hive that do not contain the 
same functionality as ours seems problematic.  IIRC the Hive community raised 
concerns about this before with the Spark community.  I don't recall the 
outcome.  But it would make sense to me to approach the Spark community and ask 
that they not do this.

As for them dissing on us in benchmarks, we all know you can set up Hive to run 
like mule (use MR on text files) and people do it all the time to make their 
stuff look good.  I'm not sure what to do about that other than publish our own 
benchmarks showing what Hive can do.

Alan.

> On Mar 2, 2017, at 6:55 AM, Edward Capriolo  wrote:
> 
> All,
> 
> I have compiled a short (non exhaustive) list of items related to Spark's
> forking of Apache Hive code and usage of Apache Hive trademarks.
> 
> 1)
> 
> The original spark proposal repeatedly claims that Spark "inter operates"
> with hive.
> 
> https://wiki.apache.org/incubator/SparkProposal
> 
> "Finally, Shark (a higher layer framework built on Spark) inter-operates
> with Apache Hive."
> 
> (EC note: Originally spark may have linked to hive, but now the situation
> is much different.)
> -
> 
> 2)
> --
> Spark distributes jar files to maven repositories carrying the hive name.
> 
> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec
> 
> (EC note These are not simple "ports" features are added/missing/broken in
> artifacts named "hive")
> ---
> 
> 3)
> -
> Spark carries forked and modified copies of hive source code
> 
> https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionHookContextImpl.java
> 
> 
> 4
> ---
> Spark has "imported" and modified components of hive
> 
> 
> https://issues.apache.org/jira/browse/SPARK-12572
> 
> (EC note: Further discussions of the code make little no reference to it's
> origins in propaganda)
> -
> 
> 5
> 
> Databricks, a company heaving involved in spark development, uses the Hive
> trademark to make claims
> 
> https://databricks.com/blog/2017/01/30/integrating-central-hive-metastore-apache-spark-databricks.html
> 
> "The Databricks platform provides a fully managed Hive Metastore that
> allows users to share a data catalog across multiple Spark clusters."
> 
> 
> This blog defining hadoop (draft) is clear on this:
> https://wiki.apache.org/hadoop/Defining%20Hadoop
> 
> "Products that are derivative works of Apache Hadoop are not Apache Hadoop,
> and may not call themselves versions of Apache Hadoop, nor Distributions of
> Apache Hadoop."
> 
> 
> 
> 6
> --
> https://databricks.com/blog/2017/01/30/integrating-central-hive-metastore-apache-spark-databricks.html
> 
> "Apache Spark supports multiple versions of Hive, from 0.12 up to 1.2.1. "
> 
> Apache spark can NOT support multiple versions of Hive because they are
> working with a fork, and there is no standard body for "supporting hive"
> 
> Some products have been released that have been described as "compatible"
> with Hadoop, even though parts of the Hadoop codebase have either been
> changed or replaced. The Apache™ Hadoop® developer team are not a standards
> body: they do not qualify such (derivative) works as compatible. Nor do
> they feel constrained by the requirements of external entities when
> changing the behavior of Apache Hadoop software or related Apache software.
> ---
> 
> 7
> -
> The spark committers openly use the word "take" during the process of
> "importing" hive code.
> 
> https://github.com/apache/spark/pull/10583/files
> "are there unit tests from Hive that we can take?"
> 
> Apache foundation will not take a hostile fork for a proposal. Had the
> original Spark proposal implied they wished to fork portions of the hive
> code base, I would have considered it a hostile fork. (this is open to
> 

Re: [DISCUSS] Spark's fork of hive

2017-03-02 Thread Edward Capriolo
On Thu, Mar 2, 2017 at 12:35 PM, Gopal Vijayaraghavan 
wrote:

>
> > Had the original Spark proposal implied they wished to fork portions of
> the hive
> > code base, I would have considered it a hostile fork. (this is open to
> interpretation).
>
> FYI, I did ask bluntly whether Spark intends to cut-paste Hive code into
> their repos previously & got an affirmative answer from rxin.
>
> http://grokbase.com/t/hive/dev/15cjb3kjvn/using-the-hive-
> sql-parser-in-spark
>
> > People have the right to fork it via the licence. We can not stop that.
>
> Later, I did get a response that they never made a release with the said
> copy-paste & they deprecated the "HiveContext" object in Spark 2.0.
>
> > than what Hive could handle at peak."
> >
> >  (EC Note: How is this statement verifiable?)
>
> Reading about Hive at Facebook, I feel like we've already solved those
> problems that were due to FB Corona + Hadoop-1 (or, 0.20 *shudder*)
> limitations.
>
> Spark does not need be limited by Corona and the version of Hive being
> compared might not have YARN or Tez on its side.
>
> Cheers,
> Gopal
>
> On 3/2/17, 8:25 PM, "Edward Capriolo"  wrote:
>
> All,
>
> I have compiled a short (non exhaustive) list of items related to
> Spark's
> forking of Apache Hive code and usage of Apache Hive trademarks.
>
> 1)
> 
> The original spark proposal repeatedly claims that Spark "inter
> operates"
> with hive.
>
> https://wiki.apache.org/incubator/SparkProposal
>
> "Finally, Shark (a higher layer framework built on Spark)
> inter-operates
> with Apache Hive."
>
> (EC note: Originally spark may have linked to hive, but now the
> situation
> is much different.)
> -
>
> 2)
> --
> Spark distributes jar files to maven repositories carrying the hive
> name.
>
> https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec
>
> (EC note These are not simple "ports" features are
> added/missing/broken in
> artifacts named "hive")
> ---
>
> 3)
> -
> Spark carries forked and modified copies of hive source code
>
> https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a
> 1da3ee84cc/sql/hive-thriftserver/src/main/java/
> org/apache/hive/service/cli/session/HiveSessionHookContextImpl.java
> 
>
> 4
> ---
> Spark has "imported" and modified components of hive
>
>
> https://issues.apache.org/jira/browse/SPARK-12572
>
> (EC note: Further discussions of the code make little no reference to
> it's
> origins in propaganda)
> -
>
> 5
> 
> Databricks, a company heaving involved in spark development, uses the
> Hive
> trademark to make claims
>
> https://databricks.com/blog/2017/01/30/integrating-
> central-hive-metastore-apache-spark-databricks.html
>
> "The Databricks platform provides a fully managed Hive Metastore that
> allows users to share a data catalog across multiple Spark clusters."
>
>
> This blog defining hadoop (draft) is clear on this:
> https://wiki.apache.org/hadoop/Defining%20Hadoop
>
> "Products that are derivative works of Apache Hadoop are not Apache
> Hadoop,
> and may not call themselves versions of Apache Hadoop, nor
> Distributions of
> Apache Hadoop."
>
> 
>
> 6
> --
> https://databricks.com/blog/2017/01/30/integrating-
> central-hive-metastore-apache-spark-databricks.html
>
> "Apache Spark supports multiple versions of Hive, from 0.12 up to
> 1.2.1. "
>
> Apache spark can NOT support multiple versions of Hive because they are
> working with a fork, and there is no standard body for "supporting
> hive"
>
> Some products have been released that have been described as
> "compatible"
> with Hadoop, even though parts of the Hadoop codebase have either been
> changed or replaced. The Apache™ Hadoop® developer team are not a
> standards
> body: they do not qualify such (derivative) works as compatible. Nor do
> they feel constrained by the requirements of external entities when
> changing the behavior of Apache Hadoop software or related Apache
> software.
> ---
>
> 7
> -
> The spark committers openly use the word "take" during the process of
> "importing" hive code.
>
> https://github.com/apache/spark/pull/10583/files
> "are there unit tests from Hive that we can take?"
>
> Apache foundation will not take a hostile fork for a proposal. Had the
> original Spark proposal implied they wished to fork portions of the
> hive
> code base, I would have considered it 

[jira] [Created] (HIVE-16091) Support scalar subqueries in project/select

2017-03-02 Thread Vineet Garg (JIRA)
Vineet Garg created HIVE-16091:
--

 Summary: Support scalar subqueries in project/select
 Key: HIVE-16091
 URL: https://issues.apache.org/jira/browse/HIVE-16091
 Project: Hive
  Issue Type: Sub-task
Reporter: Vineet Garg
Assignee: Vineet Garg


Currently scalar subqueries are supported in filter only (WHERE/HAVING). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HIVE-16090) TestHiveMetastoreChecker should use METASTORE_FS_HANDLER_THREADS_COUNT instead of HIVE_MOVE_FILES_THREAD_COUNT

2017-03-02 Thread Vihang Karajgaonkar (JIRA)
Vihang Karajgaonkar created HIVE-16090:
--

 Summary: TestHiveMetastoreChecker should use 
METASTORE_FS_HANDLER_THREADS_COUNT instead of HIVE_MOVE_FILES_THREAD_COUNT
 Key: HIVE-16090
 URL: https://issues.apache.org/jira/browse/HIVE-16090
 Project: Hive
  Issue Type: Task
  Components: Hive
Reporter: Vihang Karajgaonkar
Assignee: Vihang Karajgaonkar
Priority: Minor


HIVE-16014 changed the HiveMetastoreChecker to use 
{{METASTORE_FS_HANDLER_THREADS_COUNT}} for pool size. Some of the tests in 
HiveMetastoreChecker still use {{HIVE_MOVE_FILES_THREAD_COUNT}} which leads to 
incorrect test behavior.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: [DISCUSS] Spark's fork of hive

2017-03-02 Thread Gopal Vijayaraghavan

> Had the original Spark proposal implied they wished to fork portions of the 
> hive
> code base, I would have considered it a hostile fork. (this is open to 
> interpretation).

FYI, I did ask bluntly whether Spark intends to cut-paste Hive code into their 
repos previously & got an affirmative answer from rxin.

http://grokbase.com/t/hive/dev/15cjb3kjvn/using-the-hive-sql-parser-in-spark

> People have the right to fork it via the licence. We can not stop that.

Later, I did get a response that they never made a release with the said 
copy-paste & they deprecated the "HiveContext" object in Spark 2.0.

> than what Hive could handle at peak."
> 
>  (EC Note: How is this statement verifiable?)

Reading about Hive at Facebook, I feel like we've already solved those problems 
that were due to FB Corona + Hadoop-1 (or, 0.20 *shudder*) limitations.

Spark does not need be limited by Corona and the version of Hive being compared 
might not have YARN or Tez on its side.

Cheers,
Gopal

On 3/2/17, 8:25 PM, "Edward Capriolo"  wrote:

All,

I have compiled a short (non exhaustive) list of items related to Spark's
forking of Apache Hive code and usage of Apache Hive trademarks.

1)

The original spark proposal repeatedly claims that Spark "inter operates"
with hive.

https://wiki.apache.org/incubator/SparkProposal

"Finally, Shark (a higher layer framework built on Spark) inter-operates
with Apache Hive."

(EC note: Originally spark may have linked to hive, but now the situation
is much different.)
-

2)
--
Spark distributes jar files to maven repositories carrying the hive name.

https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec

(EC note These are not simple "ports" features are added/missing/broken in
artifacts named "hive")
---

3)
-
Spark carries forked and modified copies of hive source code


https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionHookContextImpl.java


4
---
Spark has "imported" and modified components of hive


https://issues.apache.org/jira/browse/SPARK-12572

(EC note: Further discussions of the code make little no reference to it's
origins in propaganda)
-

5

Databricks, a company heaving involved in spark development, uses the Hive
trademark to make claims


https://databricks.com/blog/2017/01/30/integrating-central-hive-metastore-apache-spark-databricks.html

"The Databricks platform provides a fully managed Hive Metastore that
allows users to share a data catalog across multiple Spark clusters."


This blog defining hadoop (draft) is clear on this:
https://wiki.apache.org/hadoop/Defining%20Hadoop

"Products that are derivative works of Apache Hadoop are not Apache Hadoop,
and may not call themselves versions of Apache Hadoop, nor Distributions of
Apache Hadoop."



6
--

https://databricks.com/blog/2017/01/30/integrating-central-hive-metastore-apache-spark-databricks.html

"Apache Spark supports multiple versions of Hive, from 0.12 up to 1.2.1. "

Apache spark can NOT support multiple versions of Hive because they are
working with a fork, and there is no standard body for "supporting hive"

Some products have been released that have been described as "compatible"
with Hadoop, even though parts of the Hadoop codebase have either been
changed or replaced. The Apache™ Hadoop® developer team are not a standards
body: they do not qualify such (derivative) works as compatible. Nor do
they feel constrained by the requirements of external entities when
changing the behavior of Apache Hadoop software or related Apache software.
---

7
-
The spark committers openly use the word "take" during the process of
"importing" hive code.

https://github.com/apache/spark/pull/10583/files
"are there unit tests from Hive that we can take?"

Apache foundation will not take a hostile fork for a proposal. Had the
original Spark proposal implied they wished to fork portions of the hive
code base, I would have considered it a hostile fork. (this is open to
interpretation).

(EC Note: Is this the Apache way? How can we build communities? How would
small projects feel if for example hive "imported" copying code while they
   

Re: Review Request 54065: HIVE-15282: Different modification times are used when an index is built and when its staleness is checked

2017-03-02 Thread Marta Kuczora


> On Jan. 27, 2017, 4:18 p.m., Sergio Pena wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
> > Line 967 (original), 978 (patched)
> > 
> >
> > Agree with Peter. Should we use the dataLocation variable instead of 
> > calling the method?

Agree too. :) Fixed it.


- Marta


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54065/#review163281
---


On Dec. 12, 2016, 1:04 p.m., Marta Kuczora wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54065/
> ---
> 
> (Updated Dec. 12, 2016, 1:04 p.m.)
> 
> 
> Review request for hive, Aihua Xu, Chaoyu Tang, Peter Vary, and Sergio Pena.
> 
> 
> Bugs: HIVE-15282
> https://issues.apache.org/jira/browse/HIVE-15282
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Changed the way how the modification time is determined for partitions in the 
> DDLTask.alterIndex method to be the same as when the index staleness is 
> checked. Instead of using the modification date of the partition folder, go 
> through the files in the folder and use the highest modification time and 
> save it as index property. With this we can avoid the issue when the folder 
> and the file is created when the second turns. So the modification time of 
> the folder is in the previous second compared to the modification time of the 
> file.
> If the partition folder doesn't contain any files, then use the folder's 
> modification time, just as before.
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java cfece77 
> 
> 
> Diff: https://reviews.apache.org/r/54065/diff/1/
> 
> 
> Testing
> ---
> 
> Ran the index_auto_mult_tables_compact and index_auto_mult_tables q tests 
> multiple times, with hard-coded delay with which the test failure described 
> in HIVE-15282 could be reproduced. With the patch, the tests were always 
> successful.
> Also ran all index related q tests.
> 
> 
> Thanks,
> 
> Marta Kuczora
> 
>



Re: Review Request 54065: HIVE-15282: Different modification times are used when an index is built and when its staleness is checked

2017-03-02 Thread Marta Kuczora


> On Nov. 24, 2016, 5:15 p.m., Barna Zsombor Klara wrote:
> > Thanks for the patch.
> > I haven't looked into it closely, but I guess there is no way to write a 
> > junit test to cover it?

I tried to figure out how this could be unit tested. The problem was that a lot 
of things had to be mocked and I ended up with a huge, complicated unit test.


- Marta


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54065/#review156858
---


On Dec. 12, 2016, 1:04 p.m., Marta Kuczora wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54065/
> ---
> 
> (Updated Dec. 12, 2016, 1:04 p.m.)
> 
> 
> Review request for hive, Aihua Xu, Chaoyu Tang, Peter Vary, and Sergio Pena.
> 
> 
> Bugs: HIVE-15282
> https://issues.apache.org/jira/browse/HIVE-15282
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Changed the way how the modification time is determined for partitions in the 
> DDLTask.alterIndex method to be the same as when the index staleness is 
> checked. Instead of using the modification date of the partition folder, go 
> through the files in the folder and use the highest modification time and 
> save it as index property. With this we can avoid the issue when the folder 
> and the file is created when the second turns. So the modification time of 
> the folder is in the previous second compared to the modification time of the 
> file.
> If the partition folder doesn't contain any files, then use the folder's 
> modification time, just as before.
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java cfece77 
> 
> 
> Diff: https://reviews.apache.org/r/54065/diff/1/
> 
> 
> Testing
> ---
> 
> Ran the index_auto_mult_tables_compact and index_auto_mult_tables q tests 
> multiple times, with hard-coded delay with which the test failure described 
> in HIVE-15282 could be reproduced. With the patch, the tests were always 
> successful.
> Also ran all index related q tests.
> 
> 
> Thanks,
> 
> Marta Kuczora
> 
>



Re: Review Request 54065: HIVE-15282: Different modification times are used when an index is built and when its staleness is checked

2017-03-02 Thread Marta Kuczora

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54065/
---

(Updated March 2, 2017, 3:32 p.m.)


Review request for hive, Aihua Xu, Chaoyu Tang, Peter Vary, and Sergio Pena.


Changes
---

Fixed the patch according to the review.


Bugs: HIVE-15282
https://issues.apache.org/jira/browse/HIVE-15282


Repository: hive-git


Description
---

Changed the way how the modification time is determined for partitions in the 
DDLTask.alterIndex method to be the same as when the index staleness is 
checked. Instead of using the modification date of the partition folder, go 
through the files in the folder and use the highest modification time and save 
it as index property. With this we can avoid the issue when the folder and the 
file is created when the second turns. So the modification time of the folder 
is in the previous second compared to the modification time of the file.
If the partition folder doesn't contain any files, then use the folder's 
modification time, just as before.


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 641e9d9 


Diff: https://reviews.apache.org/r/54065/diff/2/

Changes: https://reviews.apache.org/r/54065/diff/1-2/


Testing
---

Ran the index_auto_mult_tables_compact and index_auto_mult_tables q tests 
multiple times, with hard-coded delay with which the test failure described in 
HIVE-15282 could be reproduced. With the patch, the tests were always 
successful.
Also ran all index related q tests.


Thanks,

Marta Kuczora



Re: Review Request 54065: HIVE-15282: Different modification times are used when an index is built and when its staleness is checked

2017-03-02 Thread Marta Kuczora


> On Jan. 27, 2017, 4:18 p.m., Sergio Pena wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
> > Lines 972 (patched)
> > 
> >
> > If this condition does not happen ever, then lastModificationTime will 
> > end up being null, and basePartTs will contain the null value. Should we 
> > use the dataLocation timestamp in case of this condition is never called?

Yes, you are right! This issue is fixed by fixing the previous one.


- Marta


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54065/#review163281
---


On Dec. 12, 2016, 1:04 p.m., Marta Kuczora wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54065/
> ---
> 
> (Updated Dec. 12, 2016, 1:04 p.m.)
> 
> 
> Review request for hive, Aihua Xu, Chaoyu Tang, Peter Vary, and Sergio Pena.
> 
> 
> Bugs: HIVE-15282
> https://issues.apache.org/jira/browse/HIVE-15282
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Changed the way how the modification time is determined for partitions in the 
> DDLTask.alterIndex method to be the same as when the index staleness is 
> checked. Instead of using the modification date of the partition folder, go 
> through the files in the folder and use the highest modification time and 
> save it as index property. With this we can avoid the issue when the folder 
> and the file is created when the second turns. So the modification time of 
> the folder is in the previous second compared to the modification time of the 
> file.
> If the partition folder doesn't contain any files, then use the folder's 
> modification time, just as before.
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java cfece77 
> 
> 
> Diff: https://reviews.apache.org/r/54065/diff/1/
> 
> 
> Testing
> ---
> 
> Ran the index_auto_mult_tables_compact and index_auto_mult_tables q tests 
> multiple times, with hard-coded delay with which the test failure described 
> in HIVE-15282 could be reproduced. With the patch, the tests were always 
> successful.
> Also ran all index related q tests.
> 
> 
> Thanks,
> 
> Marta Kuczora
> 
>



Re: Review Request 54065: HIVE-15282: Different modification times are used when an index is built and when its staleness is checked

2017-03-02 Thread Marta Kuczora


> On Jan. 27, 2017, 4:18 p.m., Sergio Pena wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
> > Lines 968 (patched)
> > 
> >
> > shouldn't be easier if we set the dataLocation modification time first 
> > to lastModificationTime, and then compare this value with the rest of the 
> > partitions found? This way we could avoid the null value and the Long 
> > object, and use long instead.

Thanks a lot for the review!
Yeah, you are right, I fixed it.


- Marta


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54065/#review163281
---


On Dec. 12, 2016, 1:04 p.m., Marta Kuczora wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54065/
> ---
> 
> (Updated Dec. 12, 2016, 1:04 p.m.)
> 
> 
> Review request for hive, Aihua Xu, Chaoyu Tang, Peter Vary, and Sergio Pena.
> 
> 
> Bugs: HIVE-15282
> https://issues.apache.org/jira/browse/HIVE-15282
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Changed the way how the modification time is determined for partitions in the 
> DDLTask.alterIndex method to be the same as when the index staleness is 
> checked. Instead of using the modification date of the partition folder, go 
> through the files in the folder and use the highest modification time and 
> save it as index property. With this we can avoid the issue when the folder 
> and the file is created when the second turns. So the modification time of 
> the folder is in the previous second compared to the modification time of the 
> file.
> If the partition folder doesn't contain any files, then use the folder's 
> modification time, just as before.
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java cfece77 
> 
> 
> Diff: https://reviews.apache.org/r/54065/diff/1/
> 
> 
> Testing
> ---
> 
> Ran the index_auto_mult_tables_compact and index_auto_mult_tables q tests 
> multiple times, with hard-coded delay with which the test failure described 
> in HIVE-15282 could be reproduced. With the patch, the tests were always 
> successful.
> Also ran all index related q tests.
> 
> 
> Thanks,
> 
> Marta Kuczora
> 
>



Re: Review Request 54065: HIVE-15282: Different modification times are used when an index is built and when its staleness is checked

2017-03-02 Thread Marta Kuczora


> On Nov. 24, 2016, 5:15 p.m., Barna Zsombor Klara wrote:
> > Thanks for the patch.
> > I haven't looked into it closely, but I guess there is no way to write a 
> > junit test to cover it?
> 
> Marta Kuczora wrote:
> I tried to figure out how this could be unit tested. The problem was that 
> a lot of things had to be mocked and I ended up with a huge, complicated unit 
> test.

Thanks a lot for the review by the way! :)


- Marta


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54065/#review156858
---


On Dec. 12, 2016, 1:04 p.m., Marta Kuczora wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54065/
> ---
> 
> (Updated Dec. 12, 2016, 1:04 p.m.)
> 
> 
> Review request for hive, Aihua Xu, Chaoyu Tang, Peter Vary, and Sergio Pena.
> 
> 
> Bugs: HIVE-15282
> https://issues.apache.org/jira/browse/HIVE-15282
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Changed the way how the modification time is determined for partitions in the 
> DDLTask.alterIndex method to be the same as when the index staleness is 
> checked. Instead of using the modification date of the partition folder, go 
> through the files in the folder and use the highest modification time and 
> save it as index property. With this we can avoid the issue when the folder 
> and the file is created when the second turns. So the modification time of 
> the folder is in the previous second compared to the modification time of the 
> file.
> If the partition folder doesn't contain any files, then use the folder's 
> modification time, just as before.
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java cfece77 
> 
> 
> Diff: https://reviews.apache.org/r/54065/diff/1/
> 
> 
> Testing
> ---
> 
> Ran the index_auto_mult_tables_compact and index_auto_mult_tables q tests 
> multiple times, with hard-coded delay with which the test failure described 
> in HIVE-15282 could be reproduced. With the patch, the tests were always 
> successful.
> Also ran all index related q tests.
> 
> 
> Thanks,
> 
> Marta Kuczora
> 
>



Re: Review Request 54065: HIVE-15282: Different modification times are used when an index is built and when its staleness is checked

2017-03-02 Thread Marta Kuczora


> On Nov. 28, 2016, 11:10 a.m., Peter Vary wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
> > Line 967 (original), 978 (patched)
> > 
> >
> > We might want to use the dataLocation local variable here

Thanks a lot for the review!
You are right, I fixed it.


- Marta


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54065/#review157014
---


On Dec. 12, 2016, 1:04 p.m., Marta Kuczora wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54065/
> ---
> 
> (Updated Dec. 12, 2016, 1:04 p.m.)
> 
> 
> Review request for hive, Aihua Xu, Chaoyu Tang, Peter Vary, and Sergio Pena.
> 
> 
> Bugs: HIVE-15282
> https://issues.apache.org/jira/browse/HIVE-15282
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Changed the way how the modification time is determined for partitions in the 
> DDLTask.alterIndex method to be the same as when the index staleness is 
> checked. Instead of using the modification date of the partition folder, go 
> through the files in the folder and use the highest modification time and 
> save it as index property. With this we can avoid the issue when the folder 
> and the file is created when the second turns. So the modification time of 
> the folder is in the previous second compared to the modification time of the 
> file.
> If the partition folder doesn't contain any files, then use the folder's 
> modification time, just as before.
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java cfece77 
> 
> 
> Diff: https://reviews.apache.org/r/54065/diff/1/
> 
> 
> Testing
> ---
> 
> Ran the index_auto_mult_tables_compact and index_auto_mult_tables q tests 
> multiple times, with hard-coded delay with which the test failure described 
> in HIVE-15282 could be reproduced. With the patch, the tests were always 
> successful.
> Also ran all index related q tests.
> 
> 
> Thanks,
> 
> Marta Kuczora
> 
>



Re: Review Request 53845: 'like any' and 'like all' operators in hive

2017-03-02 Thread Simanchal Das

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53845/
---

(Updated March 2, 2017, 3:26 p.m.)


Review request for hive, Carl Steinbach and Vineet Garg.


Repository: hive-git


Description
---

https://issues.apache.org/jira/browse/HIVE-15229


In Teradata 'like any' and 'like all' operators are mostly used when we are 
matching a text field with numbers of patterns.
'like any' and 'like all' operator are equivalents of multiple like operator 
like example below.
--like any
select col1 from table1 where col2 like any ('%accountant%', '%accounting%', 
'%retail%', '%bank%', '%insurance%');

--Can be written using multiple like condition 
select col1 from table1 where col2 like '%accountant%' or col2 like 
'%accounting%' or col2 like '%retail%' or col2 like '%bank%' or col2 like 
'%insurance%' ;

--like all
select col1 from table1 where col2 like all ('%accountant%', '%accounting%', 
'%retail%', '%bank%', '%insurance%');

--Can be written using multiple like operator 
select col1 from table1 where col2 like '%accountant%' and col2 like 
'%accounting%' and col2 like '%retail%' and col2 like '%bank%' and col2 like 
'%insurance%' ;

Problem statement:

Now a days so many data warehouse projects are being migrated from Teradata to 
Hive.
Always Data engineer and Business analyst are searching for these two operator.
If we introduce these two operator in hive then so many scripts will be 
migrated smoothly instead of converting these operators to multiple like 
operators.


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java aaf2399 
  ql/src/java/org/apache/hadoop/hive/ql/parse/HiveLexer.g ad61f83 
  ql/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g 61778f6 
  ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g 81efadc 
  ql/src/java/org/apache/hadoop/hive/ql/parse/TypeCheckProcFactory.java f979c14 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLikeAll.java 
PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLikeAny.java 
PRE-CREATION 
  ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFLikeAll.java 
PRE-CREATION 
  ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFLikeAny.java 
PRE-CREATION 
  ql/src/test/queries/clientnegative/udf_likeall_wrong1.q PRE-CREATION 
  ql/src/test/queries/clientnegative/udf_likeany_wrong1.q PRE-CREATION 
  ql/src/test/queries/clientpositive/udf_likeall.q PRE-CREATION 
  ql/src/test/queries/clientpositive/udf_likeany.q PRE-CREATION 
  ql/src/test/results/clientnegative/udf_likeall_wrong1.q.out PRE-CREATION 
  ql/src/test/results/clientnegative/udf_likeany_wrong1.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/show_functions.q.out 3c9bb4a 
  ql/src/test/results/clientpositive/udf_likeall.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/udf_likeany.q.out PRE-CREATION 


Diff: https://reviews.apache.org/r/53845/diff/8/

Changes: https://reviews.apache.org/r/53845/diff/7-8/


Testing
---

Junit test cases and query.q files are attached


Thanks,

Simanchal Das



Re: Review Request 53845: 'like any' and 'like all' operators in hive

2017-03-02 Thread Simanchal Das


> On March 2, 2017, 4:57 a.m., Carl Steinbach wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/parse/HiveLexer.g
> > Line 2 (original), 2 (patched)
> > 
> >
> > Please make whitespace fixes in a separate patch. Including them here 
> > pollutes the diff and draws the attention of reviewers away from the 
> > important parts. I saw the same problem in several other files as well.

Added new patch after fixing whitespaces.


> On March 2, 2017, 4:57 a.m., Carl Steinbach wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLikeAll.java
> > Lines 51 (patched)
> > 
> >
> > Please change to "Returns NULL if the expression on the left hand side 
> > is NULL or if one of the patterns in the list is NULL."

Added this message.


- Simanchal


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53845/#review167648
---


On Feb. 15, 2017, 6:47 a.m., Simanchal Das wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53845/
> ---
> 
> (Updated Feb. 15, 2017, 6:47 a.m.)
> 
> 
> Review request for hive, Carl Steinbach and Vineet Garg.
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> https://issues.apache.org/jira/browse/HIVE-15229
> 
> 
> In Teradata 'like any' and 'like all' operators are mostly used when we are 
> matching a text field with numbers of patterns.
> 'like any' and 'like all' operator are equivalents of multiple like operator 
> like example below.
> --like any
> select col1 from table1 where col2 like any ('%accountant%', '%accounting%', 
> '%retail%', '%bank%', '%insurance%');
> 
> --Can be written using multiple like condition 
> select col1 from table1 where col2 like '%accountant%' or col2 like 
> '%accounting%' or col2 like '%retail%' or col2 like '%bank%' or col2 like 
> '%insurance%' ;
> 
> --like all
> select col1 from table1 where col2 like all ('%accountant%', '%accounting%', 
> '%retail%', '%bank%', '%insurance%');
> 
> --Can be written using multiple like operator 
> select col1 from table1 where col2 like '%accountant%' and col2 like 
> '%accounting%' and col2 like '%retail%' and col2 like '%bank%' and col2 like 
> '%insurance%' ;
> 
> Problem statement:
> 
> Now a days so many data warehouse projects are being migrated from Teradata 
> to Hive.
> Always Data engineer and Business analyst are searching for these two 
> operator.
> If we introduce these two operator in hive then so many scripts will be 
> migrated smoothly instead of converting these operators to multiple like 
> operators.
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 0f05160 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/HiveLexer.g f80642b 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g eb81393 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g 81efadc 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/TypeCheckProcFactory.java 
> f979c14 
>   ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLikeAll.java 
> PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLikeAny.java 
> PRE-CREATION 
>   
> ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFLikeAll.java 
> PRE-CREATION 
>   
> ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFLikeAny.java 
> PRE-CREATION 
>   ql/src/test/queries/clientnegative/udf_likeall_wrong1.q PRE-CREATION 
>   ql/src/test/queries/clientnegative/udf_likeany_wrong1.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/udf_likeall.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/udf_likeany.q PRE-CREATION 
>   ql/src/test/results/clientnegative/udf_likeall_wrong1.q.out PRE-CREATION 
>   ql/src/test/results/clientnegative/udf_likeany_wrong1.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/show_functions.q.out 3c9bb4a 
>   ql/src/test/results/clientpositive/udf_likeall.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/udf_likeany.q.out PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/53845/diff/7/
> 
> 
> Testing
> ---
> 
> Junit test cases and query.q files are attached
> 
> 
> Thanks,
> 
> Simanchal Das
> 
>



[DISCUSS] Spark's fork of hive

2017-03-02 Thread Edward Capriolo
All,

I have compiled a short (non exhaustive) list of items related to Spark's
forking of Apache Hive code and usage of Apache Hive trademarks.

1)

The original spark proposal repeatedly claims that Spark "inter operates"
with hive.

https://wiki.apache.org/incubator/SparkProposal

"Finally, Shark (a higher layer framework built on Spark) inter-operates
with Apache Hive."

(EC note: Originally spark may have linked to hive, but now the situation
is much different.)
-

2)
--
Spark distributes jar files to maven repositories carrying the hive name.

https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec

(EC note These are not simple "ports" features are added/missing/broken in
artifacts named "hive")
---

3)
-
Spark carries forked and modified copies of hive source code

https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionHookContextImpl.java


4
---
Spark has "imported" and modified components of hive


https://issues.apache.org/jira/browse/SPARK-12572

(EC note: Further discussions of the code make little no reference to it's
origins in propaganda)
-

5

Databricks, a company heaving involved in spark development, uses the Hive
trademark to make claims

https://databricks.com/blog/2017/01/30/integrating-central-hive-metastore-apache-spark-databricks.html

"The Databricks platform provides a fully managed Hive Metastore that
allows users to share a data catalog across multiple Spark clusters."


This blog defining hadoop (draft) is clear on this:
https://wiki.apache.org/hadoop/Defining%20Hadoop

"Products that are derivative works of Apache Hadoop are not Apache Hadoop,
and may not call themselves versions of Apache Hadoop, nor Distributions of
Apache Hadoop."



6
--
https://databricks.com/blog/2017/01/30/integrating-central-hive-metastore-apache-spark-databricks.html

"Apache Spark supports multiple versions of Hive, from 0.12 up to 1.2.1. "

Apache spark can NOT support multiple versions of Hive because they are
working with a fork, and there is no standard body for "supporting hive"

Some products have been released that have been described as "compatible"
with Hadoop, even though parts of the Hadoop codebase have either been
changed or replaced. The Apache™ Hadoop® developer team are not a standards
body: they do not qualify such (derivative) works as compatible. Nor do
they feel constrained by the requirements of external entities when
changing the behavior of Apache Hadoop software or related Apache software.
---

7
-
The spark committers openly use the word "take" during the process of
"importing" hive code.

https://github.com/apache/spark/pull/10583/files
"are there unit tests from Hive that we can take?"

Apache foundation will not take a hostile fork for a proposal. Had the
original Spark proposal implied they wished to fork portions of the hive
code base, I would have considered it a hostile fork. (this is open to
interpretation).

(EC Note: Is this the Apache way? How can we build communities? How would
small projects feel if for example hive "imported" copying code while they
sat in incubation)
--

8

Databricks (after borrowing slabs of hive code, using our trademarks, etc)
makes disparaging comments about the performance of hive.

https://databricks.com/blog/2017/02/28/voice-facebook-using-apache-spark-large-scale-language-model-training.html

"Spark-based pipelines can scale comfortably to process many times more
input data than what Hive could handle at peak. "

(EC Note: How is this statement verifiable?)
---

9
--
https://issues.apache.org/jira/browse/SPARK-10793

It's easily enough added, to the code, there's just the risk of the fork
diverging more from ASF hive.

(EC Note Even those responsible for this admit the code is diverging and
will diverge more from there actions.)


10
--

My opinion of all of this:
The above points are hurtful to Hive.First, we are robbed of community.
People could be improving hive by making it more modular, but instead they
are improving Spark's fork of hive. Next, our code base is subject to
continued "poaching". Apache Spark "imports", copies, alter, and claim
compatibility with/from Hive (I pointed out above why the compatibility
claims should not be made). Finally, We are subject to unfair performance
comparisons "x is faster then hive", by software (spark) that is
essentially

*POWERED BY Hive (via the 

[jira] [Created] (HIVE-16089) "trustStorePassword" is logged as part of jdbc connection url

2017-03-02 Thread Sebastian (JIRA)
Sebastian created HIVE-16089:


 Summary: "trustStorePassword" is logged as part of jdbc connection 
url
 Key: HIVE-16089
 URL: https://issues.apache.org/jira/browse/HIVE-16089
 Project: Hive
  Issue Type: Bug
  Components: JDBC
Affects Versions: 1.1.0
Reporter: Sebastian


h5. General Story
The use case is to connect via the Apache Hive JDBC driver to a Hive where SSL 
encryption is enabled.
It was required to set the ssl-trust store password property 
{{trustStorePassword}} in the jdbc connection url.

If the property is passed via "properties" parameter into {{Driver.connect(url, 
properties)}} this will not recognized.

h5. Suggested Behavior
The property {{trustStorePassword}} could be part of the "properties" 
parameter. This way the password is not part of the JDBC connection url.

h5. Acceptance Criteria
The ssl trust store password should not be logged as part of the JDBC 
connection string.
Support the trust store password via the properties parameter within connect.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HIVE-16088) Fix hive conf property name introduced in HIVE-12767

2017-03-02 Thread Barna Zsombor Klara (JIRA)
Barna Zsombor Klara created HIVE-16088:
--

 Summary: Fix hive conf property name introduced in HIVE-12767
 Key: HIVE-16088
 URL: https://issues.apache.org/jira/browse/HIVE-16088
 Project: Hive
  Issue Type: Bug
Reporter: Barna Zsombor Klara
Assignee: Barna Zsombor Klara
Priority: Trivial


The configuration property {{parquet.mr.int96.enable.utc.write.zone}} should be 
called {{hive.parquet.mr.int96.enable.utc.write.zone}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HIVE-16087) Remove multi append of log4j.configurationFile in hive script

2017-03-02 Thread Prasanth Jayachandran (JIRA)
Prasanth Jayachandran created HIVE-16087:


 Summary: Remove multi append of log4j.configurationFile in hive 
script
 Key: HIVE-16087
 URL: https://issues.apache.org/jira/browse/HIVE-16087
 Project: Hive
  Issue Type: Bug
  Components: Logging
Affects Versions: 2.2.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
Priority: Trivial


hive script appends -Dlog4j.configurationFile twice. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)