[jira] [Updated] (HIVE-18049) Enable Hive on Tez to provide globally sorted clustered table

2017-11-12 Thread LingXiao Lan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LingXiao Lan updated HIVE-18049:

Attachment: CombinedPartitioner.txt
tez-0.8.5.txt

> Enable Hive on Tez to provide globally sorted clustered table
> -
>
> Key: HIVE-18049
> URL: https://issues.apache.org/jira/browse/HIVE-18049
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Reporter: LingXiao Lan
> Fix For: 2.1.1
>
> Attachments: CombinedPartitioner.txt, HIVE-18049.1.patch, 
> tez-0.8.5.txt
>
>
> {code:sql}
> CREATE TABLE `test`(
>`time` int,
>`userid` bigint)
>  CLUSTERED BY (
>userid)
>  SORTED BY (
>userid ASC)
>  INTO 4 BUCKETS
>  ;
> {code}
> When insert data into this table, the data will be sorted into 4 buckets 
> automatically. But because hive uses hash partitioner by default, the data is 
> only sorted in each bucket and isn't sorted among different buckets. 
> Sometimes we need the data to be globally sorted, to optimizing indexing, for 
> example.
> If we can sample the table first and use TotalOrderPartitioner, this work 
> could be done. The difficulty is how do we automatically decide when to use 
> TotalOrderPartitioner and when not, because a insertion query can be complex, 
> which results in a complex DAG in Tez.
> I have implemented a temporary version. It uses a customer partitioner which 
> combines hash partitioner and totalorder partitioner. A physical optimizer is 
> added to hive to decide to choose which partitioner. But in order to reduce 
> the work load, this version should affect tez source code, which is not 
> necessary in fact.
> I'm wondering if we can implement a more common version which addresses this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-18049) Enable Hive on Tez to provide globally sorted clustered table

2017-11-12 Thread LingXiao Lan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LingXiao Lan updated HIVE-18049:

Attachment: (was: HIVE-18049.3.patch)

> Enable Hive on Tez to provide globally sorted clustered table
> -
>
> Key: HIVE-18049
> URL: https://issues.apache.org/jira/browse/HIVE-18049
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Reporter: LingXiao Lan
> Fix For: 2.1.1
>
> Attachments: HIVE-18049.1.patch
>
>
> {code:sql}
> CREATE TABLE `test`(
>`time` int,
>`userid` bigint)
>  CLUSTERED BY (
>userid)
>  SORTED BY (
>userid ASC)
>  INTO 4 BUCKETS
>  ;
> {code}
> When insert data into this table, the data will be sorted into 4 buckets 
> automatically. But because hive uses hash partitioner by default, the data is 
> only sorted in each bucket and isn't sorted among different buckets. 
> Sometimes we need the data to be globally sorted, to optimizing indexing, for 
> example.
> If we can sample the table first and use TotalOrderPartitioner, this work 
> could be done. The difficulty is how do we automatically decide when to use 
> TotalOrderPartitioner and when not, because a insertion query can be complex, 
> which results in a complex DAG in Tez.
> I have implemented a temporary version. It uses a customer partitioner which 
> combines hash partitioner and totalorder partitioner. A physical optimizer is 
> added to hive to decide to choose which partitioner. But in order to reduce 
> the work load, this version should affect tez source code, which is not 
> necessary in fact.
> I'm wondering if we can implement a more common version which addresses this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-18049) Enable Hive on Tez to provide globally sorted clustered table

2017-11-12 Thread LingXiao Lan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LingXiao Lan updated HIVE-18049:

Attachment: (was: HIVE-18049.2.patch)

> Enable Hive on Tez to provide globally sorted clustered table
> -
>
> Key: HIVE-18049
> URL: https://issues.apache.org/jira/browse/HIVE-18049
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Reporter: LingXiao Lan
> Fix For: 2.1.1
>
> Attachments: HIVE-18049.1.patch
>
>
> {code:sql}
> CREATE TABLE `test`(
>`time` int,
>`userid` bigint)
>  CLUSTERED BY (
>userid)
>  SORTED BY (
>userid ASC)
>  INTO 4 BUCKETS
>  ;
> {code}
> When insert data into this table, the data will be sorted into 4 buckets 
> automatically. But because hive uses hash partitioner by default, the data is 
> only sorted in each bucket and isn't sorted among different buckets. 
> Sometimes we need the data to be globally sorted, to optimizing indexing, for 
> example.
> If we can sample the table first and use TotalOrderPartitioner, this work 
> could be done. The difficulty is how do we automatically decide when to use 
> TotalOrderPartitioner and when not, because a insertion query can be complex, 
> which results in a complex DAG in Tez.
> I have implemented a temporary version. It uses a customer partitioner which 
> combines hash partitioner and totalorder partitioner. A physical optimizer is 
> added to hive to decide to choose which partitioner. But in order to reduce 
> the work load, this version should affect tez source code, which is not 
> necessary in fact.
> I'm wondering if we can implement a more common version which addresses this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18049) Enable Hive on Tez to provide globally sorted clustered table

2017-11-12 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249162#comment-16249162
 ] 

Hive QA commented on HIVE-18049:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12897286/HIVE-18049.2.patch

{color:red}ERROR:{color} -1 due to build exiting with an error

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/7784/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/7784/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-7784/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Tests exited with: NonZeroExitCodeException
Command 'bash /data/hiveptest/working/scratch/source-prep.sh' failed with exit 
status 1 and output '+ date '+%Y-%m-%d %T.%3N'
2017-11-13 06:53:32.575
+ [[ -n /usr/lib/jvm/java-8-openjdk-amd64 ]]
+ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ export 
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ 
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m '
+ ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m '
+ export 'MAVEN_OPTS=-Xmx1g '
+ MAVEN_OPTS='-Xmx1g '
+ cd /data/hiveptest/working/
+ tee /data/hiveptest/logs/PreCommit-HIVE-Build-7784/source-prep.txt
+ [[ false == \t\r\u\e ]]
+ mkdir -p maven ivy
+ [[ git = \s\v\n ]]
+ [[ git = \g\i\t ]]
+ [[ -z master ]]
+ [[ -d apache-github-source-source ]]
+ [[ ! -d apache-github-source-source/.git ]]
+ [[ ! -d apache-github-source-source ]]
+ date '+%Y-%m-%d %T.%3N'
2017-11-13 06:53:32.578
+ cd apache-github-source-source
+ git fetch origin
>From https://github.com/apache/hive
   67888cf..25a6f4c  master -> origin/master
+ git reset --hard HEAD
HEAD is now at 67888cf HIVE-17995 Run checkstyle on standalone-metastore module 
with proper configuration (Adam Szita via Alan Gates)
+ git clean -f -d
Removing ${project.basedir}/
Removing 
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/BaseVectorizedColumnReader.java
Removing 
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedListColumnReader.java
+ git checkout master
Already on 'master'
Your branch is behind 'origin/master' by 1 commit, and can be fast-forwarded.
  (use "git pull" to update your local branch)
+ git reset --hard origin/master
HEAD is now at 25a6f4c HIVE-17615: Task.executeTask has to be thread safe for 
parallel execution (Anishek Agarwal reviewed by Daniel Dai)
+ git merge --ff-only origin/master
Already up-to-date.
+ date '+%Y-%m-%d %T.%3N'
2017-11-13 06:53:37.768
+ patchCommandPath=/data/hiveptest/working/scratch/smart-apply-patch.sh
+ patchFilePath=/data/hiveptest/working/scratch/build.patch
+ [[ -f /data/hiveptest/working/scratch/build.patch ]]
+ chmod +x /data/hiveptest/working/scratch/smart-apply-patch.sh
+ /data/hiveptest/working/scratch/smart-apply-patch.sh 
/data/hiveptest/working/scratch/build.patch
fatal: git diff header lacks filename information when removing 0 leading 
pathname components (line 41)
The patch does not appear to apply with p0, p1, or p2
+ exit 1
'
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12897286 - PreCommit-HIVE-Build

> Enable Hive on Tez to provide globally sorted clustered table
> -
>
> Key: HIVE-18049
> URL: https://issues.apache.org/jira/browse/HIVE-18049
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Reporter: LingXiao Lan
> Fix For: 2.1.1
>
> Attachments: HIVE-18049.1.patch, HIVE-18049.2.patch, 
> HIVE-18049.3.patch
>
>
> {code:sql}
> CREATE TABLE `test`(
>`time` int,
>`userid` bigint)
>  CLUSTERED BY (
>userid)
>  SORTED BY (
>userid ASC)
>  INTO 4 BUCKETS
>  ;
> {code}
> When insert data into this table, the data will be sorted into 4 buckets 
> automatically. But because hive uses hash partitioner by default, the data is 
> only sorted in each bucket and isn't sorted among different buckets. 
> Sometimes we need the data to be globally sorted, to optimizing indexing, for 
> example.
> If we can sample the table first and use TotalOrderPartitioner, this work 
> could be done. The difficulty is how do we automatically decide when to use 
> TotalOrderPartitioner and when not, because a insertion query can be complex, 
> which results in a complex DAG in Tez.
> I have implemented a temporary version. It uses a customer partitioner which 
> combines hash partitioner and totalorder partitioner. A physical optimizer is 
> added to hive to decide to choose which partitioner. But in order to reduce 
> the work load, this version 

[jira] [Issue Comment Deleted] (HIVE-18049) Enable Hive on Tez to provide globally sorted clustered table

2017-11-12 Thread LingXiao Lan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LingXiao Lan updated HIVE-18049:

Comment: was deleted

(was: Tez-0.8.4 source)

> Enable Hive on Tez to provide globally sorted clustered table
> -
>
> Key: HIVE-18049
> URL: https://issues.apache.org/jira/browse/HIVE-18049
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Reporter: LingXiao Lan
> Fix For: 2.1.1
>
> Attachments: HIVE-18049.1.patch, HIVE-18049.2.patch, 
> HIVE-18049.3.patch
>
>
> {code:sql}
> CREATE TABLE `test`(
>`time` int,
>`userid` bigint)
>  CLUSTERED BY (
>userid)
>  SORTED BY (
>userid ASC)
>  INTO 4 BUCKETS
>  ;
> {code}
> When insert data into this table, the data will be sorted into 4 buckets 
> automatically. But because hive uses hash partitioner by default, the data is 
> only sorted in each bucket and isn't sorted among different buckets. 
> Sometimes we need the data to be globally sorted, to optimizing indexing, for 
> example.
> If we can sample the table first and use TotalOrderPartitioner, this work 
> could be done. The difficulty is how do we automatically decide when to use 
> TotalOrderPartitioner and when not, because a insertion query can be complex, 
> which results in a complex DAG in Tez.
> I have implemented a temporary version. It uses a customer partitioner which 
> combines hash partitioner and totalorder partitioner. A physical optimizer is 
> added to hive to decide to choose which partitioner. But in order to reduce 
> the work load, this version should affect tez source code, which is not 
> necessary in fact.
> I'm wondering if we can implement a more common version which addresses this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-18049) Enable Hive on Tez to provide globally sorted clustered table

2017-11-12 Thread LingXiao Lan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LingXiao Lan updated HIVE-18049:

Attachment: HIVE-18049.3.patch

> Enable Hive on Tez to provide globally sorted clustered table
> -
>
> Key: HIVE-18049
> URL: https://issues.apache.org/jira/browse/HIVE-18049
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Reporter: LingXiao Lan
> Fix For: 2.1.1
>
> Attachments: HIVE-18049.1.patch, HIVE-18049.2.patch, 
> HIVE-18049.3.patch
>
>
> {code:sql}
> CREATE TABLE `test`(
>`time` int,
>`userid` bigint)
>  CLUSTERED BY (
>userid)
>  SORTED BY (
>userid ASC)
>  INTO 4 BUCKETS
>  ;
> {code}
> When insert data into this table, the data will be sorted into 4 buckets 
> automatically. But because hive uses hash partitioner by default, the data is 
> only sorted in each bucket and isn't sorted among different buckets. 
> Sometimes we need the data to be globally sorted, to optimizing indexing, for 
> example.
> If we can sample the table first and use TotalOrderPartitioner, this work 
> could be done. The difficulty is how do we automatically decide when to use 
> TotalOrderPartitioner and when not, because a insertion query can be complex, 
> which results in a complex DAG in Tez.
> I have implemented a temporary version. It uses a customer partitioner which 
> combines hash partitioner and totalorder partitioner. A physical optimizer is 
> added to hive to decide to choose which partitioner. But in order to reduce 
> the work load, this version should affect tez source code, which is not 
> necessary in fact.
> I'm wondering if we can implement a more common version which addresses this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-18049) Enable Hive on Tez to provide globally sorted clustered table

2017-11-12 Thread LingXiao Lan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LingXiao Lan updated HIVE-18049:

Status: Patch Available  (was: Open)

> Enable Hive on Tez to provide globally sorted clustered table
> -
>
> Key: HIVE-18049
> URL: https://issues.apache.org/jira/browse/HIVE-18049
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Reporter: LingXiao Lan
> Fix For: 2.1.1
>
> Attachments: HIVE-18049.1.patch, HIVE-18049.2.patch
>
>
> {code:sql}
> CREATE TABLE `test`(
>`time` int,
>`userid` bigint)
>  CLUSTERED BY (
>userid)
>  SORTED BY (
>userid ASC)
>  INTO 4 BUCKETS
>  ;
> {code}
> When insert data into this table, the data will be sorted into 4 buckets 
> automatically. But because hive uses hash partitioner by default, the data is 
> only sorted in each bucket and isn't sorted among different buckets. 
> Sometimes we need the data to be globally sorted, to optimizing indexing, for 
> example.
> If we can sample the table first and use TotalOrderPartitioner, this work 
> could be done. The difficulty is how do we automatically decide when to use 
> TotalOrderPartitioner and when not, because a insertion query can be complex, 
> which results in a complex DAG in Tez.
> I have implemented a temporary version. It uses a customer partitioner which 
> combines hash partitioner and totalorder partitioner. A physical optimizer is 
> added to hive to decide to choose which partitioner. But in order to reduce 
> the work load, this version should affect tez source code, which is not 
> necessary in fact.
> I'm wondering if we can implement a more common version which addresses this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-18049) Enable Hive on Tez to provide globally sorted clustered table

2017-11-12 Thread LingXiao Lan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LingXiao Lan updated HIVE-18049:

Attachment: HIVE-18049.2.patch

Tez-0.8.4 source

> Enable Hive on Tez to provide globally sorted clustered table
> -
>
> Key: HIVE-18049
> URL: https://issues.apache.org/jira/browse/HIVE-18049
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Reporter: LingXiao Lan
> Fix For: 2.1.1
>
> Attachments: HIVE-18049.1.patch, HIVE-18049.2.patch
>
>
> {code:sql}
> CREATE TABLE `test`(
>`time` int,
>`userid` bigint)
>  CLUSTERED BY (
>userid)
>  SORTED BY (
>userid ASC)
>  INTO 4 BUCKETS
>  ;
> {code}
> When insert data into this table, the data will be sorted into 4 buckets 
> automatically. But because hive uses hash partitioner by default, the data is 
> only sorted in each bucket and isn't sorted among different buckets. 
> Sometimes we need the data to be globally sorted, to optimizing indexing, for 
> example.
> If we can sample the table first and use TotalOrderPartitioner, this work 
> could be done. The difficulty is how do we automatically decide when to use 
> TotalOrderPartitioner and when not, because a insertion query can be complex, 
> which results in a complex DAG in Tez.
> I have implemented a temporary version. It uses a customer partitioner which 
> combines hash partitioner and totalorder partitioner. A physical optimizer is 
> added to hive to decide to choose which partitioner. But in order to reduce 
> the work load, this version should affect tez source code, which is not 
> necessary in fact.
> I'm wondering if we can implement a more common version which addresses this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-18049) Enable Hive on Tez to provide globally sorted clustered table

2017-11-12 Thread LingXiao Lan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LingXiao Lan updated HIVE-18049:

Attachment: HIVE-18049.1.patch

> Enable Hive on Tez to provide globally sorted clustered table
> -
>
> Key: HIVE-18049
> URL: https://issues.apache.org/jira/browse/HIVE-18049
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Reporter: LingXiao Lan
> Fix For: 2.1.1
>
> Attachments: HIVE-18049.1.patch
>
>
> {code:sql}
> CREATE TABLE `test`(
>`time` int,
>`userid` bigint)
>  CLUSTERED BY (
>userid)
>  SORTED BY (
>userid ASC)
>  INTO 4 BUCKETS
>  ;
> {code}
> When insert data into this table, the data will be sorted into 4 buckets 
> automatically. But because hive uses hash partitioner by default, the data is 
> only sorted in each bucket and isn't sorted among different buckets. 
> Sometimes we need the data to be globally sorted, to optimizing indexing, for 
> example.
> If we can sample the table first and use TotalOrderPartitioner, this work 
> could be done. The difficulty is how do we automatically decide when to use 
> TotalOrderPartitioner and when not, because a insertion query can be complex, 
> which results in a complex DAG in Tez.
> I have implemented a temporary version. It uses a customer partitioner which 
> combines hash partitioner and totalorder partitioner. A physical optimizer is 
> added to hive to decide to choose which partitioner. But in order to reduce 
> the work load, this version should affect tez source code, which is not 
> necessary in fact.
> I'm wondering if we can implement a more common version which addresses this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17976) HoS: don't set output collector if there's no data to process

2017-11-12 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249135#comment-16249135
 ] 

Rui Li commented on HIVE-17976:
---

[~xuefuz], the empty row is generated in operator.close(). RS relies on output 
collector to collect the row. If we don't set output collector, the row is 
simply discarded by RS.

> HoS: don't set output collector if there's no data to process
> -
>
> Key: HIVE-17976
> URL: https://issues.apache.org/jira/browse/HIVE-17976
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
> Attachments: HIVE-17976.1.patch, HIVE-17976.2.patch
>
>
> MR doesn't set an output collector if no row is processed, i.e. 
> {{ExecMapper::map}} is never called. Let's investigate whether Spark should 
> do the same.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17976) HoS: don't set output collector if there's no data to process

2017-11-12 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249133#comment-16249133
 ] 

Xuefu Zhang commented on HIVE-17976:


Patch looks good to me. +1

[~lirui] Do you know why setting OutputConnector at init time will generate an 
empty row when there are no input rows? Is it due to operator.close()? 

> HoS: don't set output collector if there's no data to process
> -
>
> Key: HIVE-17976
> URL: https://issues.apache.org/jira/browse/HIVE-17976
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
> Attachments: HIVE-17976.1.patch, HIVE-17976.2.patch
>
>
> MR doesn't set an output collector if no row is processed, i.e. 
> {{ExecMapper::map}} is never called. Let's investigate whether Spark should 
> do the same.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17615) Task.executeTask has to be thread safe for parallel execution

2017-11-12 Thread anishek (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-17615:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Task.executeTask has to be thread safe for parallel execution
> -
>
> Key: HIVE-17615
> URL: https://issues.apache.org/jira/browse/HIVE-17615
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 3.0.0
>Reporter: anishek
>Assignee: anishek
>  Labels: pull-request-available
> Fix For: 3.0.0
>
> Attachments: HIVE-17615.0.patch
>
>
> With parallel execution enabled we should make sure that the 
> {{Task.executeTask}} has to be thread safe, which is not the case with 
> hiveHistory object.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17615) Task.executeTask has to be thread safe for parallel execution

2017-11-12 Thread anishek (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249110#comment-16249110
 ] 

anishek commented on HIVE-17615:


Patch committed to master, Thanks [~daijy] for the review.

> Task.executeTask has to be thread safe for parallel execution
> -
>
> Key: HIVE-17615
> URL: https://issues.apache.org/jira/browse/HIVE-17615
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 3.0.0
>Reporter: anishek
>Assignee: anishek
>  Labels: pull-request-available
> Fix For: 3.0.0
>
> Attachments: HIVE-17615.0.patch
>
>
> With parallel execution enabled we should make sure that the 
> {{Task.executeTask}} has to be thread safe, which is not the case with 
> hiveHistory object.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-18049) Enable Hive on Tez to provide globally sorted clustered table

2017-11-12 Thread LingXiao Lan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LingXiao Lan updated HIVE-18049:

Description: 

{code:sql}

{code}
CREATE TABLE `test`(
   `time` int,
   `userid` bigint)
 CLUSTERED BY (
   userid)
 SORTED BY (
   userid ASC)
 INTO 4 BUCKETS
 ;
When insert data into this table, the data will be sorted into 4 buckets 
automatically. But because hive uses hash partitioner by default, the data is 
only sorted in each bucket and isn't sorted among different buckets. Sometimes 
we need the data to be globally sorted, to optimizing indexing, for example.

If we can sample the table first and use TotalOrderPartitioner, this work could 
be done. The difficulty is how do we automatically decide when to use 
TotalOrderPartitioner and when not, because a insertion query can be complex, 
which results in a complex DAG in Tez.

I have implemented a temporary version. It uses a customer partitioner which 
combines hash partitioner and totalorder partitioner. A physical optimizer is 
added to hive to decide to choose which partitioner. But in order to reduce the 
work load, this version should affect tez source code, which is not necessary 
in fact.

I'm wondering if we can implement a more common version which addresses this 
issue.

  was:
CREATE TABLE `test`(
   `time` int,
   `userid` bigint)
 CLUSTERED BY (
   userid)
 SORTED BY (
   userid ASC)
 INTO 4 BUCKETS
 ;
When insert data into this table, the data will be sorted into 4 buckets 
automatically. But because hive uses hash partitioner by default, the data is 
only sorted in each bucket and isn't sorted among different buckets. Sometimes 
we need the data to be globally sorted, to optimizing indexing, for example.

If we can sample the table first and use TotalOrderPartitioner, this work could 
be done. The difficulty is how do we automatically decide when to use 
TotalOrderPartitioner and when not, because a insertion query can be complex, 
which results in a complex DAG in Tez.

I have implemented a temporary version. It uses a customer partitioner which 
combines hash partitioner and totalorder partitioner. A physical optimizer is 
added to hive to decide to choose which partitioner. But in order to reduce the 
work load, this version should affect tez source code, which is not necessary 
in fact.

I'm wondering if we can implement a more common version which addresses this 
issue.


> Enable Hive on Tez to provide globally sorted clustered table
> -
>
> Key: HIVE-18049
> URL: https://issues.apache.org/jira/browse/HIVE-18049
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Reporter: LingXiao Lan
> Fix For: 2.1.1
>
>
> {code:sql}
> {code}
> CREATE TABLE `test`(
>`time` int,
>`userid` bigint)
>  CLUSTERED BY (
>userid)
>  SORTED BY (
>userid ASC)
>  INTO 4 BUCKETS
>  ;
> When insert data into this table, the data will be sorted into 4 buckets 
> automatically. But because hive uses hash partitioner by default, the data is 
> only sorted in each bucket and isn't sorted among different buckets. 
> Sometimes we need the data to be globally sorted, to optimizing indexing, for 
> example.
> If we can sample the table first and use TotalOrderPartitioner, this work 
> could be done. The difficulty is how do we automatically decide when to use 
> TotalOrderPartitioner and when not, because a insertion query can be complex, 
> which results in a complex DAG in Tez.
> I have implemented a temporary version. It uses a customer partitioner which 
> combines hash partitioner and totalorder partitioner. A physical optimizer is 
> added to hive to decide to choose which partitioner. But in order to reduce 
> the work load, this version should affect tez source code, which is not 
> necessary in fact.
> I'm wondering if we can implement a more common version which addresses this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-18049) Enable Hive on Tez to provide globally sorted clustered table

2017-11-12 Thread LingXiao Lan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LingXiao Lan updated HIVE-18049:

Description: 
{code:sql}
CREATE TABLE `test`(
   `time` int,
   `userid` bigint)
 CLUSTERED BY (
   userid)
 SORTED BY (
   userid ASC)
 INTO 4 BUCKETS
 ;
{code}

When insert data into this table, the data will be sorted into 4 buckets 
automatically. But because hive uses hash partitioner by default, the data is 
only sorted in each bucket and isn't sorted among different buckets. Sometimes 
we need the data to be globally sorted, to optimizing indexing, for example.

If we can sample the table first and use TotalOrderPartitioner, this work could 
be done. The difficulty is how do we automatically decide when to use 
TotalOrderPartitioner and when not, because a insertion query can be complex, 
which results in a complex DAG in Tez.

I have implemented a temporary version. It uses a customer partitioner which 
combines hash partitioner and totalorder partitioner. A physical optimizer is 
added to hive to decide to choose which partitioner. But in order to reduce the 
work load, this version should affect tez source code, which is not necessary 
in fact.

I'm wondering if we can implement a more common version which addresses this 
issue.

  was:

{code:sql}

{code}
CREATE TABLE `test`(
   `time` int,
   `userid` bigint)
 CLUSTERED BY (
   userid)
 SORTED BY (
   userid ASC)
 INTO 4 BUCKETS
 ;
When insert data into this table, the data will be sorted into 4 buckets 
automatically. But because hive uses hash partitioner by default, the data is 
only sorted in each bucket and isn't sorted among different buckets. Sometimes 
we need the data to be globally sorted, to optimizing indexing, for example.

If we can sample the table first and use TotalOrderPartitioner, this work could 
be done. The difficulty is how do we automatically decide when to use 
TotalOrderPartitioner and when not, because a insertion query can be complex, 
which results in a complex DAG in Tez.

I have implemented a temporary version. It uses a customer partitioner which 
combines hash partitioner and totalorder partitioner. A physical optimizer is 
added to hive to decide to choose which partitioner. But in order to reduce the 
work load, this version should affect tez source code, which is not necessary 
in fact.

I'm wondering if we can implement a more common version which addresses this 
issue.


> Enable Hive on Tez to provide globally sorted clustered table
> -
>
> Key: HIVE-18049
> URL: https://issues.apache.org/jira/browse/HIVE-18049
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Reporter: LingXiao Lan
> Fix For: 2.1.1
>
>
> {code:sql}
> CREATE TABLE `test`(
>`time` int,
>`userid` bigint)
>  CLUSTERED BY (
>userid)
>  SORTED BY (
>userid ASC)
>  INTO 4 BUCKETS
>  ;
> {code}
> When insert data into this table, the data will be sorted into 4 buckets 
> automatically. But because hive uses hash partitioner by default, the data is 
> only sorted in each bucket and isn't sorted among different buckets. 
> Sometimes we need the data to be globally sorted, to optimizing indexing, for 
> example.
> If we can sample the table first and use TotalOrderPartitioner, this work 
> could be done. The difficulty is how do we automatically decide when to use 
> TotalOrderPartitioner and when not, because a insertion query can be complex, 
> which results in a complex DAG in Tez.
> I have implemented a temporary version. It uses a customer partitioner which 
> combines hash partitioner and totalorder partitioner. A physical optimizer is 
> added to hive to decide to choose which partitioner. But in order to reduce 
> the work load, this version should affect tez source code, which is not 
> necessary in fact.
> I'm wondering if we can implement a more common version which addresses this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-18049) Enable Hive on Tez to provide globally sorted clustered table

2017-11-12 Thread LingXiao Lan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LingXiao Lan updated HIVE-18049:

Status: Open  (was: Patch Available)

> Enable Hive on Tez to provide globally sorted clustered table
> -
>
> Key: HIVE-18049
> URL: https://issues.apache.org/jira/browse/HIVE-18049
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Reporter: LingXiao Lan
> Fix For: 2.1.1
>
>
> CREATE TABLE `test`(
>`time` int,
>`userid` bigint)
>  CLUSTERED BY (
>userid)
>  SORTED BY (
>userid ASC)
>  INTO 4 BUCKETS
>  ;
> When insert data into this table, the data will be sorted into 4 buckets 
> automatically. But because hive uses hash partitioner by default, the data is 
> only sorted in each bucket and isn't sorted among different buckets. 
> Sometimes we need the data to be globally sorted, to optimizing indexing, for 
> example.
> If we can sample the table first and use TotalOrderPartitioner, this work 
> could be done. The difficulty is how do we automatically decide when to use 
> TotalOrderPartitioner and when not, because a insertion query can be complex, 
> which results in a complex DAG in Tez.
> I have implemented a temporary version. It uses a customer partitioner which 
> combines hash partitioner and totalorder partitioner. A physical optimizer is 
> added to hive to decide to choose which partitioner. But in order to reduce 
> the work load, this version should affect tez source code, which is not 
> necessary in fact.
> I'm wondering if we can implement a more common version which addresses this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-18049) Enable Hive on Tez to provide globally sorted clustered table

2017-11-12 Thread LingXiao Lan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LingXiao Lan updated HIVE-18049:

Status: Patch Available  (was: Open)

> Enable Hive on Tez to provide globally sorted clustered table
> -
>
> Key: HIVE-18049
> URL: https://issues.apache.org/jira/browse/HIVE-18049
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Reporter: LingXiao Lan
> Fix For: 2.1.1
>
>
> CREATE TABLE `test`(
>`time` int,
>`userid` bigint)
>  CLUSTERED BY (
>userid)
>  SORTED BY (
>userid ASC)
>  INTO 4 BUCKETS
>  ;
> When insert data into this table, the data will be sorted into 4 buckets 
> automatically. But because hive uses hash partitioner by default, the data is 
> only sorted in each bucket and isn't sorted among different buckets. 
> Sometimes we need the data to be globally sorted, to optimizing indexing, for 
> example.
> If we can sample the table first and use TotalOrderPartitioner, this work 
> could be done. The difficulty is how do we automatically decide when to use 
> TotalOrderPartitioner and when not, because a insertion query can be complex, 
> which results in a complex DAG in Tez.
> I have implemented a temporary version. It uses a customer partitioner which 
> combines hash partitioner and totalorder partitioner. A physical optimizer is 
> added to hive to decide to choose which partitioner. But in order to reduce 
> the work load, this version should affect tez source code, which is not 
> necessary in fact.
> I'm wondering if we can implement a more common version which addresses this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17931) Implement Parquet vectorization reader for Array type

2017-11-12 Thread Colin Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249050#comment-16249050
 ] 

Colin Ma commented on HIVE-17931:
-

[~Ferd], the failed tests are not patch related, do you have further comments? 
I'll add the qtests after HIVE-18043 resolved.

> Implement Parquet vectorization reader for Array type
> -
>
> Key: HIVE-17931
> URL: https://issues.apache.org/jira/browse/HIVE-17931
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Colin Ma
>Assignee: Colin Ma
> Attachments: HIVE-17931.001.patch, HIVE-17931.002.patch, 
> HIVE-17931.003.patch, HIVE-17931.004.patch
>
>
> Parquet vectorized reader can't support array type, it should be supported to 
> improve the performance when the query with array type. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17931) Implement Parquet vectorization reader for Array type

2017-11-12 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249049#comment-16249049
 ] 

Hive QA commented on HIVE-17931:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12896995/HIVE-17931.004.patch

{color:green}SUCCESS:{color} +1 due to 3 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 12 failed/errored test(s), 11376 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dbtxnmgr_showlocks] 
(batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_ppd_decimal] 
(batchId=9)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[unionDistinct_1] 
(batchId=146)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=162)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid_fast]
 (batchId=157)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] 
(batchId=156)
org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testCliDriver[ct_noperm_loc]
 (batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_multi] 
(batchId=111)
org.apache.hadoop.hive.cli.control.TestDanglingQOuts.checkDanglingQOut 
(batchId=206)
org.apache.hadoop.hive.ql.exec.tez.TestWorkloadManager.testApplyPlanQpChanges 
(batchId=281)
org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints 
(batchId=223)
org.apache.hive.jdbc.TestJdbcWithMiniHS2.testHttpRetryOnServerIdleTimeout 
(batchId=233)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/7783/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/7783/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-7783/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 12 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12896995 - PreCommit-HIVE-Build

> Implement Parquet vectorization reader for Array type
> -
>
> Key: HIVE-17931
> URL: https://issues.apache.org/jira/browse/HIVE-17931
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Colin Ma
>Assignee: Colin Ma
> Attachments: HIVE-17931.001.patch, HIVE-17931.002.patch, 
> HIVE-17931.003.patch, HIVE-17931.004.patch
>
>
> Parquet vectorized reader can't support array type, it should be supported to 
> improve the performance when the query with array type. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-18042) Correlation Optimizer lead to NPE when there's multi subquery(select distinct) union all operation after join

2017-11-12 Thread Hengyu Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hengyu Dai updated HIVE-18042:
--
Summary: Correlation Optimizer lead to NPE when there's multi 
subquery(select distinct) union all operation after join   (was: Correlation 
Optimizer lead to NPE when there is multi union all operation after join )

> Correlation Optimizer lead to NPE when there's multi subquery(select 
> distinct) union all operation after join 
> --
>
> Key: HIVE-18042
> URL: https://issues.apache.org/jira/browse/HIVE-18042
> Project: Hive
>  Issue Type: Bug
>  Components: Logical Optimizer
>Affects Versions: 2.1.1
> Environment: 
>Reporter: Hengyu Dai
>
> test sql:
> {code:sql}
> SELECT DISTINCT a.logday AS push_day, a.mtype, a.t, If(b.msgid IS NULL, 'no', 
> 'yes') AS isnotdaoda, a.platform
> , a.uid, a.dt
> FROM (SELECT DISTINCT If(tokentype = '7', msgid, If(tokentype = '6', 
> regexp_extract(sendpushresult, 'msgId":"([^"]+)', 1), 
> regexp_extract(sendpushresult, 'msgId=(.+?),', 1))) AS msgid, logday, If(vid 
> LIKE '60%', 'adr', If(vid LIKE '8%', 'ios', 'other')) AS platform, mtype, t
> , If(vid LIKE '8%', uid, gid) AS uid, concat(substr(logday, 1, 4), 
> '-', substr(logday, 5, 2), '-', substr(logday, 7, 2)) AS dt
> FROM wirelessdata.orig_push_client
> ) a
> LEFT JOIN (SELECT DISTINCT msgid
> FROM (
> SELECT DISTINCT msgid
> FROM wirelessdata.orig_push_return
> UNION ALL
> SELECT DISTINCT msgid
> FROM wirelessdata.orig_push_return_xiaomi
> UNION ALL
> SELECT DISTINCT regexp_extract(action, '"id":"([^"]+)', 1) AS 
> msgid
> FROM wirelessdata.ods_client_behavior_hour4spark
> ) bb
> ) b ON lower(a.msgid) = lower(b.msgid)
> {code}
> the error stack
> {code:java}
> 2017-11-10T16:01:21,123 ERROR [9b7d82f5-dfc8-43ac-8d6f-a019d8677392 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.optimizer.GenMapRedUtils.setUnionPlan(GenMapRedUtils.java:230)
>   at 
> org.apache.hadoop.hive.ql.optimizer.GenMapRedUtils.joinUnionPlan(GenMapRedUtils.java:287)
>   at 
> org.apache.hadoop.hive.ql.optimizer.GenMRRedSink3.process(GenMRRedSink3.java:100)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
>   at 
> org.apache.hadoop.hive.ql.parse.GenMapRedWalker.walk(GenMapRedWalker.java:54)
>   at 
> org.apache.hadoop.hive.ql.parse.GenMapRedWalker.walk(GenMapRedWalker.java:65)
>   at 
> org.apache.hadoop.hive.ql.parse.GenMapRedWalker.walk(GenMapRedWalker.java:65)
>   at 
> org.apache.hadoop.hive.ql.parse.GenMapRedWalker.walk(GenMapRedWalker.java:65)
>   at 
> org.apache.hadoop.hive.ql.parse.GenMapRedWalker.walk(GenMapRedWalker.java:65)
>   at 
> org.apache.hadoop.hive.ql.parse.GenMapRedWalker.walk(GenMapRedWalker.java:65)
>   at 
> org.apache.hadoop.hive.ql.parse.GenMapRedWalker.walk(GenMapRedWalker.java:65)
>   at 
> org.apache.hadoop.hive.ql.parse.GenMapRedWalker.walk(GenMapRedWalker.java:65)
>   at 
> org.apache.hadoop.hive.ql.parse.GenMapRedWalker.walk(GenMapRedWalker.java:65)
>   at 
> org.apache.hadoop.hive.ql.parse.GenMapRedWalker.walk(GenMapRedWalker.java:65)
>   at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120)
>   at 
> org.apache.hadoop.hive.ql.parse.MapReduceCompiler.generateTaskTree(MapReduceCompiler.java:323)
>   at 
> org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:267)
>   at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11008)
>   at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10547)
>   at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:250)
>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:483)
>   at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1254)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1396)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1181)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1170)
>   at 
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:229)
>   at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:180)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:396)
>   at 
> 

[jira] [Assigned] (HIVE-18048) Add qtests for Struct type with vectorization

2017-11-12 Thread Colin Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Ma reassigned HIVE-18048:
---


> Add qtests for Struct type with vectorization
> -
>
> Key: HIVE-18048
> URL: https://issues.apache.org/jira/browse/HIVE-18048
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Colin Ma
>Assignee: Colin Ma
>
> Struct type is supported in vectorization, but there is no qtests to test 
> such case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16406) Remove unwanted interning when creating PartitionDesc

2017-11-12 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249010#comment-16249010
 ] 

Hive QA commented on HIVE-16406:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12896974/HIVE-16406.3.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 12 failed/errored test(s), 11374 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dbtxnmgr_showlocks] 
(batchId=77)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[unionDistinct_1] 
(batchId=146)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=162)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[ppd_union_view]
 (batchId=154)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] 
(batchId=156)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=102)
org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testCliDriver[ct_noperm_loc]
 (batchId=94)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_multi] 
(batchId=111)
org.apache.hadoop.hive.cli.control.TestDanglingQOuts.checkDanglingQOut 
(batchId=206)
org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints 
(batchId=223)
org.apache.hive.hcatalog.templeton.TestConcurrentJobRequestsThreadsAndTimeout.ConcurrentListJobsVerifyExceptions
 (batchId=185)
org.apache.hive.jdbc.TestJdbcWithLocalClusterSpark.testTempTable (batchId=233)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/7782/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/7782/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-7782/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 12 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12896974 - PreCommit-HIVE-Build

> Remove unwanted interning when creating PartitionDesc
> -
>
> Key: HIVE-16406
> URL: https://issues.apache.org/jira/browse/HIVE-16406
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Minor
> Attachments: HIVE-16406.1.patch, HIVE-16406.2.patch, 
> HIVE-16406.3.patch, HIVE-16406.profiler.png
>
>
> {{PartitionDesc::getTableDesc}} interns all table description properties by 
> default. But the table description properties are already interned and need 
> not be interned again. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17714) move custom SerDe schema considerations into metastore from QL

2017-11-12 Thread Vihang Karajgaonkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248970#comment-16248970
 ] 

Vihang Karajgaonkar commented on HIVE-17714:


I looked into this a bit more and followed the history of the changes to SerDes 
related to this. Initially, I thought of move Serializer, Deserializer and 
AbstractSerde classes to storage-api. This turned out to be pretty 
straight-forward with no backward compatibility implications since the package 
name still remains the same of the moved classes.

However, this may not solve the problem entirely because it still means that 
standalone Metastore JVM will need these jars in its classpath to instantiate 
and get the schema from Deserializer in the runtime. SerDe implementations are 
spread all over the code and I am afraid that bringing one jar will bring in 
the rest of the world in terms of dependencies. This is probably not an issue 
in embedded mode of metastore though because metastore resides in the HS2 
process and will have access to all the hive jars anyways, but in case of 
remote standalone metastore it doesn't make sense to add all these jars in the 
class path in the runtime.

I also was a bit confused by this [line of code here | 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java#L980
 ] in {{Table.java}} where it says that any SerDe which is a subclass of 
AbstractSerDe should store the fields information in metastore. While 
{{AbstractSerDe}} itself returns {{false}} in {{shouldStoreFieldsInMetastore}} 
which is contradictory.

Based on what I have looked so far there is no easy way out for this and 
HIVE-17580 to solve it consistently for all the use-cases without breaking 
backwards compatibility. I propose we make the following changes:

1.  Change {{AbstractSerDe:shouldStoreFieldsInMetastore}} to return {{true}} 
It still behaves as if its true based on what we see in Table.java above and 
claim that all the SerDes implementations which extend from AbstractSerDe will 
store schema in metastore unless explicitly overridden to return false. This 
should cover all the SerDes in Hive source code since HIVE-15167 moved them to 
subclass from AbstractSerDe instead of directly implementing interfaces.
2. We move the Serializer, Deserializer and AbstractSerDe classes to 
storage-api.
This enables metastore to consume them without having to create a compile time 
dependency on hive.
3. We claim that if there are users who implement directly from the 
Serializer/Deserializer interfaces and still want metastore to store schema for 
them should make sure that their jar can be added into the classpath of the 
standalone metastore and metastore will use the existing mechanism to load and 
deserialize from the Serde class.
4. Add the check in {{HiveMetaStoreUtils.getFieldsFromDeserializer}} to throw 
exception before trying to use deserializer to get the schema if the 
implementation of {{shouldStoreFieldsInMetastore}} returns false. I don't think 
metastore can ever be 100% sure if SerDes declares that fields are not supposed 
to be stored in metastore.

[~sershe] and [~alangates] What do you guys think about these suggestions?

> move custom SerDe schema considerations into metastore from QL
> --
>
> Key: HIVE-17714
> URL: https://issues.apache.org/jira/browse/HIVE-17714
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Alan Gates
>
> Columns in metastore for tables that use external schema don't have the type 
> information (since HIVE-11985) and may be entirely inconsistent (since 
> forever, due to issues like HIVE-17713; or for SerDes that allow an URL for 
> the schema, due to a change in the underlying file).
> Currently, if you trace the usage of ConfVars.SERDESUSINGMETASTOREFORSCHEMA, 
> and to MetaStoreUtils.getFieldsFromDeserializer, you'd see that the code in 
> QL handles this in Hive. So, for the most part metastore just returns 
> whatever is stored for columns in the database.
> One exception appears to be get_fields_with_environment_context, which is 
> interesting... so getTable will return incorrect columns (potentially), but 
> get_fields/get_schema will return correct ones from SerDe as far as I can 
> tell.
> As part of separating the metastore, we should make sure all the APIs return 
> the correct schema for the columns; it's not a good idea to have everyone 
> reimplement getFieldsFromDeserializer.
> Note: this should also remove a flag introduced in HIVE-17731



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (HIVE-18047) Support dynamic service discovery for HiveMetaStore

2017-11-12 Thread Bing Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bing Li reassigned HIVE-18047:
--


> Support dynamic service discovery for HiveMetaStore
> ---
>
> Key: HIVE-18047
> URL: https://issues.apache.org/jira/browse/HIVE-18047
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Reporter: Bing Li
>Assignee: Bing Li
>
> Similar like what Hive does on HiveServer2 (HIVE-7935), a HiveMetaStore 
> client can dynamically resolve an HiveMetaStore service to connect to via 
> ZooKeeper.
> *High Level Design:*
> Whether dynamic service discovery is supported or not can be configured by 
> setting
> HIVE_METASTORE_SUPPORT_DYNAMIC_SERVICE_DISCOVERY.  
> * This property should ONLY work when HiveMetaStrore service is in remote 
> mode.
> * When an instance of HiveMetaStore comes up, it adds itself as a znode to 
> Zookeeper under a configurable namespace (HIVE_METASTORE_ZOOKEEPER_NAMESPACE, 
> e.g. hivemetastore).
> * A thrift client specifies the ZooKeeper ensemble in its connection string, 
> instead of pointing to a specific HiveMetaStore instance. The ZooKeeper 
> ensemble will pick an instance of HiveMetaStore to connect for the session.
> * When an instance is removed from ZooKeeper, the existing client sessions 
> continue till completion. When the last client session completes, the 
> instance shuts down.
> * All new client connection pick one of the available HiveMetaStore uris from 
> ZooKeeper.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-15212) merge branch into master

2017-11-12 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248816#comment-16248816
 ] 

Lefty Leverenz commented on HIVE-15212:
---

Okay, thanks Sergey.

So far I've only found one configuration parameter added to master by this 
merge (*hive.mm.avoid.s3.globstatus* in HIVE-14953) but there may be a few more.

> merge branch into master
> 
>
> Key: HIVE-15212
> URL: https://issues.apache.org/jira/browse/HIVE-15212
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Fix For: 3.0.0
>
> Attachments: HIVE-15212.01.patch, HIVE-15212.02.patch, 
> HIVE-15212.03.patch, HIVE-15212.04.patch, HIVE-15212.05.patch, 
> HIVE-15212.06.patch, HIVE-15212.07.patch, HIVE-15212.08.patch, 
> HIVE-15212.09.patch, HIVE-15212.10.patch, HIVE-15212.11.patch, 
> HIVE-15212.12.patch, HIVE-15212.12.patch, HIVE-15212.13.patch, 
> HIVE-15212.13.patch, HIVE-15212.14.patch, HIVE-15212.15.patch, 
> HIVE-15212.16.patch, HIVE-15212.17.patch, HIVE-15212.18.patch, 
> HIVE-15212.19.patch, HIVE-15212.20.patch, HIVE-15212.21.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)