date:20170223

[jira] [Commented] (HIVE-15848) count or sum distinct incorrect when hive.optimize.reducededuplication set to true

2017-02-23 Thread Zoltan Haindrich (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882170#comment-15882170
 ] 

Zoltan Haindrich commented on HIVE-15848:
-

this bug is present on the current master branch

> count or sum distinct incorrect when hive.optimize.reducededuplication set to 
> true
> --
>
> Key: HIVE-15848
> URL: https://issues.apache.org/jira/browse/HIVE-15848
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Biao Wu
>Priority: Critical
>
> Test Table:
> {code:sql}
> create table test(id int,key int,name int);
> {code}
> Data：
> ||id||key||name||
> |1|1  |2
> |1|2  |3
> |1|3  |2
> |1|4  |2
> |1|5  |3
> Test SQL1:
> {code:sql}
> select id,count(Distinct key),count(Distinct name)
> from (select id,key,name from count_distinct_test group by id,key,name)m
> group by id;
> {code}
> result：
> |1|5|4
> expect:
> |1|5|2
> Test SQL2:
> {code:sql}
> select id,count(Distinct name),count(Distinct key)
> from (select id,key,name from count_distinct_test group by id,name,key)m
> group by id;
> {code}
> result:
> |1|2|5



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Assigned] (HIVE-15848) count or sum distinct incorrect when hive.optimize.reducededuplication set to true

2017-02-23 Thread Zoltan Haindrich (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Haindrich reassigned HIVE-15848:
---

Assignee: Zoltan Haindrich

> count or sum distinct incorrect when hive.optimize.reducededuplication set to 
> true
> --
>
> Key: HIVE-15848
> URL: https://issues.apache.org/jira/browse/HIVE-15848
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Biao Wu
>Assignee: Zoltan Haindrich
>Priority: Critical
>
> Test Table:
> {code:sql}
> create table test(id int,key int,name int);
> {code}
> Data：
> ||id||key||name||
> |1|1  |2
> |1|2  |3
> |1|3  |2
> |1|4  |2
> |1|5  |3
> Test SQL1:
> {code:sql}
> select id,count(Distinct key),count(Distinct name)
> from (select id,key,name from count_distinct_test group by id,key,name)m
> group by id;
> {code}
> result：
> |1|5|4
> expect:
> |1|5|2
> Test SQL2:
> {code:sql}
> select id,count(Distinct name),count(Distinct key)
> from (select id,key,name from count_distinct_test group by id,name,key)m
> group by id;
> {code}
> result:
> |1|2|5



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15882) HS2 generating high memory pressure with many partitions and concurrent queries

2017-02-23 Thread Rui Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882157#comment-15882157
 ] 

Rui Li commented on HIVE-15882:
---

The patch looks good to me overall. I left some minor comments on the RB.

> HS2 generating high memory pressure with many partitions and concurrent 
> queries
> ---
>
> Key: HIVE-15882
> URL: https://issues.apache.org/jira/browse/HIVE-15882
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Misha Dmitriev
>Assignee: Misha Dmitriev
> Attachments: HIVE-15882.01.patch, hs2-crash-2000p-500m-50q.txt
>
>
> I've created a Hive table with 2000 partitions, each backed by two files, 
> with one row in each file. When I execute some number of concurrent queries 
> against this table, e.g. as follows
> {code}
> for i in `seq 1 50`; do beeline -u jdbc:hive2://localhost:1 -n admin -p 
> admin -e "select count(i_f_1) from misha_table;" & done
> {code}
> it results in a big memory spike. With 20 queries I caused an OOM in a HS2 
> server with -Xmx200m and with 50 queries - in the one with -Xmx500m.
> I am attaching the results of jxray (www.jxray.com) analysis of a heap dump 
> that was generated in the 50queries/500m heap scenario. It suggests that 
> there are several opportunities to reduce memory pressure with not very 
> invasive changes to the code:
> 1. 24.5% of memory is wasted by duplicate strings (see section 6). With 
> String.intern() calls added in the ~10 relevant places in the code, this 
> overhead can be highly reduced.
> 2. Almost 20% of memory is wasted due to various suboptimally used 
> collections (see section 8). There are many maps and lists that are either 
> empty or have just 1 element. By modifying the code that creates and 
> populates these collections, we may likely save 5-10% of memory.
> 3. Almost 20% of memory is used by instances of java.util.Properties. It 
> looks like these objects are highly duplicate, since for each Partition each 
> concurrently running query creates its own copy of Partion, PartitionDesc and 
> Properties. Thus we have nearly 100,000 (50 queries * 2,000 partitions) 
> Properties in memory. By interning/deduplicating these objects we may be able 
> to save perhaps 15% of memory.
> So overall, I think there is a good chance to reduce HS2 memory consumption 
> in this scenario by ~40%.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-6590) Hive does not work properly with boolean partition columns (wrong results and inserts to incorrect HDFS path)

2017-02-23 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882155#comment-15882155
 ] 

Ashutosh Chauhan commented on HIVE-6590:


I looked the spec. Section 6.13 clearly says only strings are allowed to be 
cast into boolean. All other types (which Hive supports) if attempted to be 
cast to boolean need to raise an error. Further 20) in section states for 
string only valid boolean literals are allowed in strings, which implies string 
"true" should parse as boolean true and string "false" parse as boolean false. 
All other strings need to raise error.

So, if we want to be spec compliant this behavior needs to change. Perhaps, in 
major version : 3.0?

> Hive does not work properly with boolean partition columns (wrong results and 
> inserts to incorrect HDFS path)
> -
>
> Key: HIVE-6590
> URL: https://issues.apache.org/jira/browse/HIVE-6590
> Project: Hive
>  Issue Type: Bug
>  Components: Database/Schema, Metastore
>Affects Versions: 0.10.0
>Reporter: Lenni Kuff
>Assignee: Zoltan Haindrich
> Attachments: HIVE-6590.1.patch, HIVE-6590.2.patch, HIVE-6590.3.patch
>
>
> Hive does not work properly with boolean partition columns. Queries return 
> wrong results and also insert to incorrect HDFS paths.
> {code}
> create table bool_part(int_col int) partitioned by(bool_col boolean);
> # This works, creating 3 unique partitions!
> ALTER TABLE bool_table ADD PARTITION (bool_col=FALSE);
> ALTER TABLE bool_table ADD PARTITION (bool_col=false);
> ALTER TABLE bool_table ADD PARTITION (bool_col=False);
> {code}
> The first problem is that Hive cannot filter on a bool partition key column. 
> "select * from bool_part" returns the correct results, but if you apply a 
> filter on the bool partition key column hive won't return any results.
> The second problem is that Hive seems to just call "toString()" on the 
> boolean literal value. This means you can end up with multiple partitions 
> (FALSE, false, FaLSE, etc) mapping to the literal value 'FALSE'. For example, 
> if you can add three partition in have for the same logic value "false" doing:
> ALTER TABLE bool_table ADD PARTITION (bool_col=FALSE) -> 
> /test-warehouse/bool_table/bool_col=FALSE/
> ALTER TABLE bool_table ADD PARTITION (bool_col=false) -> 
> /test-warehouse/bool_table/bool_col=false/
> ALTER TABLE bool_table ADD PARTITION (bool_col=False) -> 
> /test-warehouse/bool_table/bool_col=False/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-14731) Use Tez cartesian product edge in Hive (unpartitioned case only)

2017-02-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-14731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882138#comment-15882138
 ] 

Hive QA commented on HIVE-14731:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12854292/HIVE-14731.14.patch

{color:green}SUCCESS:{color} +1 due to 4 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 10 failed/errored test(s), 10231 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=235)
TestMiniLlapLocalCliDriver - did not produce a TEST-*.xml file (likely timed 
out) (batchId=143)

[acid_vectorization_missing_cols.q,orc_merge9.q,vector_acid3.q,delete_where_no_match.q,vector_reduce1.q,vector_join_nulls.q,stats_only_null.q,vectorization_part_project.q,vectorization_6.q,count.q,tez_vector_dynpart_hashjoin_2.q,parallel.q,delete_all_non_partitioned.q,delete_all_partitioned.q,vectorization_10.q,insert1.q,custom_input_output_format.q,vectorized_bucketmapjoin1.q,cbo_rp_windowing_2.q,vector_reduce3.q,cte_mat_3.q,smb_cache.q,hybridgrace_hashjoin_1.q,vector_count_distinct.q,vector_decimal_round_2.q,hybridgrace_hashjoin_2.q,parquet_predicate_pushdown.q,vector_varchar_mapjoin1.q,quotedid_smb.q,vector_bucket.q]
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[leftsemijoin]
 (batchId=147)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_exists]
 (batchId=146)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_in]
 (batchId=150)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_multi]
 (batchId=142)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_null_agg]
 (batchId=152)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_scalar]
 (batchId=146)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=223)
org.apache.hive.beeline.TestBeeLineWithArgs.testQueryProgressParallel 
(batchId=211)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/3744/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/3744/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-3744/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 10 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12854292 - PreCommit-HIVE-Build

> Use Tez cartesian product edge in Hive (unpartitioned case only)
> 
>
> Key: HIVE-14731
> URL: https://issues.apache.org/jira/browse/HIVE-14731
> Project: Hive
>  Issue Type: Bug
>Reporter: Zhiyuan Yang
>Assignee: Zhiyuan Yang
> Attachments: HIVE-14731.10.patch, HIVE-14731.11.patch, 
> HIVE-14731.12.patch, HIVE-14731.13.patch, HIVE-14731.14.patch, 
> HIVE-14731.1.patch, HIVE-14731.2.patch, HIVE-14731.3.patch, 
> HIVE-14731.4.patch, HIVE-14731.5.patch, HIVE-14731.6.patch, 
> HIVE-14731.7.patch, HIVE-14731.8.patch, HIVE-14731.9.patch
>
>
> Given cartesian product edge is available in Tez now (see TEZ-3230), let's 
> integrate it into Hive on Tez. This allows us to have more than one reducer 
> in cross product queries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-15708) Upgrade calcite version to 1.12

2017-02-23 Thread Remus Rusanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Remus Rusanu updated HIVE-15708:

Attachment: HIVE-15708.10.patch

10.patch consumes CALCITE-1653 {{RexExecutor}} changes

> Upgrade calcite version to 1.12
> ---
>
> Key: HIVE-15708
> URL: https://issues.apache.org/jira/browse/HIVE-15708
> Project: Hive
>  Issue Type: Task
>  Components: CBO, Logical Optimizer
>Affects Versions: 2.2.0
>Reporter: Ashutosh Chauhan
>Assignee: Remus Rusanu
> Attachments: HIVE-15708.01.patch, HIVE-15708.02.patch, 
> HIVE-15708.03.patch, HIVE-15708.04.patch, HIVE-15708.05.patch, 
> HIVE-15708.06.patch, HIVE-15708.07.patch, HIVE-15708.08.patch, 
> HIVE-15708.09.patch, HIVE-15708.10.patch
>
>
> Currently we are on 1.10 Need to upgrade calcite version to 1.11



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16014) HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of hive.mv.files.thread for pool size

2017-02-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882053#comment-15882053
 ] 

Hive QA commented on HIVE-16014:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12854275/HIVE-16014.02.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 10258 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=235)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=223)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=223)
org.apache.hadoop.hive.thrift.TestHadoopAuthBridge23.testSaslWithHiveMetaStore 
(batchId=220)
org.apache.hive.beeline.TestBeeLineWithArgs.testQueryProgressParallel 
(batchId=211)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/3743/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/3743/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-3743/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 5 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12854275 - PreCommit-HIVE-Build

> HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of 
> hive.mv.files.thread for pool size
> --
>
> Key: HIVE-16014
> URL: https://issues.apache.org/jira/browse/HIVE-16014
> Project: Hive
>  Issue Type: Improvement
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
> Attachments: HIVE-16014.01.patch, HIVE-16014.02.patch
>
>
> HiveMetastoreChecker uses hive.mv.files.thread configuration value for 
> determining the pool size as below :
> {noformat}
> private void checkPartitionDirs(Path basePath, Set allDirs, int 
> maxDepth) throws IOException, HiveException {
> ConcurrentLinkedQueue basePaths = new ConcurrentLinkedQueue<>();
> basePaths.add(basePath);
> Set dirSet = Collections.newSetFromMap(new ConcurrentHashMap Boolean>());
> // Here we just reuse the THREAD_COUNT configuration for
> // HIVE_MOVE_FILES_THREAD_COUNT
> int poolSize = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 
> 15);
> // Check if too low config is provided for move files. 2x CPU is 
> reasonable max count.
> poolSize = poolSize == 0 ? poolSize : Math.max(poolSize,
> Runtime.getRuntime().availableProcessors() * 2);
> {noformat}
> msck is commonly used to add the missing partitions for the table from the 
> Filesystem. In such a case different pool sizes for HMSHandler and 
> HiveMetastoreChecker can affect the performance. Eg. If 
> {{hive.metastore.fshandler.threads}} is set to a lower value like 15 and 
> {{hive.mv.files.thread}} is much higher like 100 or vice versa the smaller 
> pool will become the bottleneck. If would be good to use 
> {{hive.metastore.fshandler.threads}} to size the pool for 
> HiveMetastoreChecker since the number missing partitions and number of 
> partitions to be added will most likely be the same. In such a case the 
> performance of the query will be optimum when both the pool sizes are same.
> Since it is possible to tune both the configs individually it will be very 
> likely that they may be different. But since there is a strong co-relation 
> between amount of work done by HiveMetastoreChecker and 
> HiveMetastore.add_partitions call it might be a good idea to use 
> {{hive.metastore.fshandler.threads}} for pool size instead of 
> {{hive.mv.files.thread}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15958) LLAP: IPC connections are not being reused for umbilical protocol

2017-02-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881998#comment-15881998
 ] 

Hive QA commented on HIVE-15958:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12854319/HIVE-15958.4.patch

{color:red}ERROR:{color} -1 due to build exiting with an error

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/3742/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/3742/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-3742/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Tests exited with: NonZeroExitCodeException
Command 'bash /data/hiveptest/working/scratch/source-prep.sh' failed with exit 
status 1 and output '+ date '+%Y-%m-%d %T.%3N'
2017-02-24 05:30:55.487
+ [[ -n /usr/lib/jvm/java-8-openjdk-amd64 ]]
+ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ export 
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ 
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m '
+ ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m '
+ export 'MAVEN_OPTS=-Xmx1g '
+ MAVEN_OPTS='-Xmx1g '
+ cd /data/hiveptest/working/
+ tee /data/hiveptest/logs/PreCommit-HIVE-Build-3742/source-prep.txt
+ [[ false == \t\r\u\e ]]
+ mkdir -p maven ivy
+ [[ git = \s\v\n ]]
+ [[ git = \g\i\t ]]
+ [[ -z master ]]
+ [[ -d apache-github-source-source ]]
+ [[ ! -d apache-github-source-source/.git ]]
+ [[ ! -d apache-github-source-source ]]
+ date '+%Y-%m-%d %T.%3N'
2017-02-24 05:30:55.490
+ cd apache-github-source-source
+ git fetch origin
>From https://github.com/apache/hive
   4f18acd..338a7ee  master -> origin/master
+ git reset --hard HEAD
HEAD is now at 4f18acd HIVE-15668 : change REPL DUMP syntax to use "LIMIT" 
instead of "BATCH" keyword
+ git clean -f -d
+ git checkout master
Already on 'master'
Your branch is behind 'origin/master' by 1 commit, and can be fast-forwarded.
  (use "git pull" to update your local branch)
+ git reset --hard origin/master
HEAD is now at 338a7ee HIVE-15993 : Hive REPL STATUS is not returning last 
event ID (Sankar Hariappan, reviewed by Sushanth Sowmyan)
+ git merge --ff-only origin/master
Already up-to-date.
+ date '+%Y-%m-%d %T.%3N'
2017-02-24 05:30:58.390
+ patchCommandPath=/data/hiveptest/working/scratch/smart-apply-patch.sh
+ patchFilePath=/data/hiveptest/working/scratch/build.patch
+ [[ -f /data/hiveptest/working/scratch/build.patch ]]
+ chmod +x /data/hiveptest/working/scratch/smart-apply-patch.sh
+ /data/hiveptest/working/scratch/smart-apply-patch.sh 
/data/hiveptest/working/scratch/build.patch
error: 
a/llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/AMReporter.java: 
No such file or directory
error: 
a/llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/ContainerRunnerImpl.java:
 No such file or directory
error: 
a/llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/QueryTracker.java:
 No such file or directory
The patch does not appear to apply with p0, p1, or p2
+ exit 1
'
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12854319 - PreCommit-HIVE-Build

> LLAP: IPC connections are not being reused for umbilical protocol
> -
>
> Key: HIVE-15958
> URL: https://issues.apache.org/jira/browse/HIVE-15958
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Rajesh Balamohan
>Assignee: Prasanth Jayachandran
> Attachments: HIVE-15958.1.patch, HIVE-15958.2.patch, 
> HIVE-15958.3.patch, HIVE-15958.4.patch
>
>
> During concurrency testing, observed 1000s of ipc thread creations. Ideally, 
> the connections to same hosts should be reused.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15879) Fix HiveMetaStoreChecker.checkPartitionDirs method

2017-02-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881994#comment-15881994
 ] 

Hive QA commented on HIVE-15879:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12854272/HIVE-15879.01.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 12 failed/errored test(s), 10194 tests 
executed
*Failed tests:*
{noformat}
TestContext - did not produce a TEST-*.xml file (likely timed out) (batchId=258)
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=235)
TestHiveCopyFiles - did not produce a TEST-*.xml file (likely timed out) 
(batchId=258)
TestHiveCredentialProviders - did not produce a TEST-*.xml file (likely timed 
out) (batchId=258)
TestHiveMetaStoreChecker - did not produce a TEST-*.xml file (likely timed out) 
(batchId=258)
TestLog4j2Appenders - did not produce a TEST-*.xml file (likely timed out) 
(batchId=258)
TestOperators - did not produce a TEST-*.xml file (likely timed out) 
(batchId=258)
TestTableIterable - did not produce a TEST-*.xml file (likely timed out) 
(batchId=258)
TestTxnCommands2 - did not produce a TEST-*.xml file (likely timed out) 
(batchId=258)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=223)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=223)
org.apache.hive.beeline.TestBeeLineWithArgs.testQueryProgressParallel 
(batchId=211)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/3741/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/3741/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-3741/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 12 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12854272 - PreCommit-HIVE-Build

> Fix HiveMetaStoreChecker.checkPartitionDirs method
> --
>
> Key: HIVE-15879
> URL: https://issues.apache.org/jira/browse/HIVE-15879
> Project: Hive
>  Issue Type: Bug
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
> Attachments: HIVE-15879.01.patch
>
>
> HIVE-15803 fixes the msck hang issue in 
> HiveMetaStoreChecker.checkPartitionDirs method by adding a check to see if 
> the Threadpool has any spare threads. If not it uses single threaded listing 
> of the files.
> {noformat}
> if (pool != null) {
>   synchronized (pool) {
> // In case of recursive calls, it is possible to deadlock with TP. 
> Check TP usage here.
> if (pool.getActiveCount() < pool.getMaximumPoolSize()) {
>   useThreadPool = true;
> }
> if (!useThreadPool) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Not using threadPool as active count:" + 
> pool.getActiveCount()
> + ", max:" + pool.getMaximumPoolSize());
>   }
> }
>   }
> }
> {noformat}
> Based on the java doc of getActiveCount() below 
> bq. Returns the approximate number of threads that are actively executing 
> tasks.
> it returns only approximate number of threads and it cannot be guaranteed 
> that it always returns the exact number of active threads. This still exposes 
> the method implementation to the msck hang bug in rare corner cases.
> We could either:
> 1. Use a atomic counter to track exactly how many threads are actively running
> 2. Relook at the method itself to make it much simpler. Like eg, look into 
> the possibility of changing the recursive implementation to an iterative 
> implementation where worker threads pick tasks from a queue until the queue 
> is empty.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (HIVE-16006) Incremental REPL LOAD doesn't operate on the target database if name differs from source database.

2017-02-23 Thread Sushanth Sowmyan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881988#comment-15881988
 ] 

Sushanth Sowmyan edited comment on HIVE-16006 at 2/24/17 5:24 AM:
--

[~sankarh], good catch on the issue for the target name difference. Your change 
to ReplicationSemanticAnalyzer.java is very much on-point. 

However, the other change, to ImportSemanticAnalyzer will break current replv1, 
as the expectation of repl-import is very much a replace, not an insert-into. 
This is a case for expanding ReplicationSpec with an additional .isInsert() 
semantic that defaults to false, but in the cases of INSERT events, can be 
true, and when we instantiate the ReplicationSpec object to pass in, we'll set 
that.

Also, could you please expand on the tests in TestReplicationScenarios when 
fixing these bugs? That helps us make sure we don't regress on this in the 
future due to some other edit.



was (Author: sushanth):
[~sankarh], good catch on the issue for the target name difference. Your change 
to ReplicationSemanticAnalyzer.java is very much on-point. 

However, the other change, to ImportSemanticAnalyzer will break current replv1, 
as the expectation of repl-import is very much a replace, not an insert-into. 
This is a case for expanding ReplicationSpec with an additional .isInsert() 
semantic that defaults to false, but in the cases of INSERT events, can be true.

Also, could you please expand on the tests in TestReplicationScenarios when 
fixing these bugs? That helps us make sure we don't regress on this in the 
future due to some other edit.


> Incremental REPL LOAD doesn't operate on the target database if name differs 
> from source database.
> --
>
> Key: HIVE-16006
> URL: https://issues.apache.org/jira/browse/HIVE-16006
> Project: Hive
>  Issue Type: Sub-task
>  Components: repl
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
> Attachments: HIVE-16006.01.patch
>
>
> During "Incremental Load", it is not considering the database name input in 
> the command line. Hence load doesn't happen. At the same time, database with 
> original name is getting modified.
> Steps:
> 1. REPL DUMP default FROM 52;
> 2. REPL LOAD replDb FROM '/tmp/dump/1487588522621';
> – This step modifies the default Db instead of replDb.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16006) Incremental REPL LOAD doesn't operate on the target database if name differs from source database.

2017-02-23 Thread Sushanth Sowmyan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881988#comment-15881988
 ] 

Sushanth Sowmyan commented on HIVE-16006:
-

[~sankarh], good catch on the issue for the target name difference. Your change 
to ReplicationSemanticAnalyzer.java is very much on-point. 

However, the other change, to ImportSemanticAnalyzer will break current replv1, 
as the expectation of repl-import is very much a replace, not an insert-into. 
This is a case for expanding ReplicationSpec with an additional .isInsert() 
semantic that defaults to false, but in the cases of INSERT events, can be true.

Also, could you please expand on the tests in TestReplicationScenarios when 
fixing these bugs? That helps us make sure we don't regress on this in the 
future due to some other edit.


> Incremental REPL LOAD doesn't operate on the target database if name differs 
> from source database.
> --
>
> Key: HIVE-16006
> URL: https://issues.apache.org/jira/browse/HIVE-16006
> Project: Hive
>  Issue Type: Sub-task
>  Components: repl
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
> Attachments: HIVE-16006.01.patch
>
>
> During "Incremental Load", it is not considering the database name input in 
> the command line. Hence load doesn't happen. At the same time, database with 
> original name is getting modified.
> Steps:
> 1. REPL DUMP default FROM 52;
> 2. REPL LOAD replDb FROM '/tmp/dump/1487588522621';
> – This step modifies the default Db instead of replDb.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15993) Hive REPL STATUS is not returning last event ID

2017-02-23 Thread Sankar Hariappan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881987#comment-15881987
 ] 

Sankar Hariappan commented on HIVE-15993:
-

Thanks a lot [~sushanth]!

> Hive REPL STATUS is not returning last event ID
> ---
>
> Key: HIVE-15993
> URL: https://issues.apache.org/jira/browse/HIVE-15993
> Project: Hive
>  Issue Type: Sub-task
>  Components: repl
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
> Fix For: 2.2.0
>
> Attachments: HIVE-15993.01.patch
>
>
> While running "REPL STATUS" on target to get last event ID for DB, it returns 
> zero rows.
> 0: jdbc:hive2://localhost:10001/repl> REPL status repl;
> No rows affected (932.167 seconds)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16006) Incremental REPL LOAD doesn't operate on the target database if name differs from source database.

2017-02-23 Thread Sushanth Sowmyan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushanth Sowmyan updated HIVE-16006:

Issue Type: Sub-task  (was: Bug)
Parent: HIVE-14841

> Incremental REPL LOAD doesn't operate on the target database if name differs 
> from source database.
> --
>
> Key: HIVE-16006
> URL: https://issues.apache.org/jira/browse/HIVE-16006
> Project: Hive
>  Issue Type: Sub-task
>  Components: repl
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
> Attachments: HIVE-16006.01.patch
>
>
> During "Incremental Load", it is not considering the database name input in 
> the command line. Hence load doesn't happen. At the same time, database with 
> original name is getting modified.
> Steps:
> 1. REPL DUMP default FROM 52;
> 2. REPL LOAD replDb FROM '/tmp/dump/1487588522621';
> – This step modifies the default Db instead of replDb.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-15993) Hive REPL STATUS is not returning last event ID

2017-02-23 Thread Sushanth Sowmyan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushanth Sowmyan updated HIVE-15993:

   Resolution: Fixed
Fix Version/s: 2.2.0
   Status: Resolved  (was: Patch Available)

Committed to master.

> Hive REPL STATUS is not returning last event ID
> ---
>
> Key: HIVE-15993
> URL: https://issues.apache.org/jira/browse/HIVE-15993
> Project: Hive
>  Issue Type: Sub-task
>  Components: repl
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
> Fix For: 2.2.0
>
> Attachments: HIVE-15993.01.patch
>
>
> While running "REPL STATUS" on target to get last event ID for DB, it returns 
> zero rows.
> 0: jdbc:hive2://localhost:10001/repl> REPL status repl;
> No rows affected (932.167 seconds)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15993) Hive REPL STATUS is not returning last event ID

2017-02-23 Thread Sushanth Sowmyan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881962#comment-15881962
 ] 

Sushanth Sowmyan commented on HIVE-15993:
-

Verified, HIVE-15333 fixed the same issue for REPL DUMP. HS2 depends on a 
FetchTask being created, even if we set the stream output.

+1, committing. Thanks, [~sankarh]

> Hive REPL STATUS is not returning last event ID
> ---
>
> Key: HIVE-15993
> URL: https://issues.apache.org/jira/browse/HIVE-15993
> Project: Hive
>  Issue Type: Sub-task
>  Components: repl
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
> Attachments: HIVE-15993.01.patch
>
>
> While running "REPL STATUS" on target to get last event ID for DB, it returns 
> zero rows.
> 0: jdbc:hive2://localhost:10001/repl> REPL status repl;
> No rows affected (932.167 seconds)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-15993) Hive REPL STATUS is not returning last event ID

2017-02-23 Thread Sushanth Sowmyan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushanth Sowmyan updated HIVE-15993:

Issue Type: Sub-task  (was: Bug)
Parent: HIVE-14841

> Hive REPL STATUS is not returning last event ID
> ---
>
> Key: HIVE-15993
> URL: https://issues.apache.org/jira/browse/HIVE-15993
> Project: Hive
>  Issue Type: Sub-task
>  Components: repl
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
> Attachments: HIVE-15993.01.patch
>
>
> While running "REPL STATUS" on target to get last event ID for DB, it returns 
> zero rows.
> 0: jdbc:hive2://localhost:10001/repl> REPL status repl;
> No rows affected (932.167 seconds)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16033) LLAP: Use PrintGCDateStamps for gc logging

2017-02-23 Thread Prasanth Jayachandran (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth Jayachandran updated HIVE-16033:
-
Status: Open  (was: Patch Available)

No unit tests are required for this

> LLAP: Use PrintGCDateStamps for gc logging
> --
>
> Key: HIVE-16033
> URL: https://issues.apache.org/jira/browse/HIVE-16033
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
> Attachments: HIVE-16033.1.patch
>
>
> This print human readable timestamps instead of timestamp relative to jvm 
> startup



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16033) LLAP: Use PrintGCDateStamps for gc logging

2017-02-23 Thread Prasanth Jayachandran (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth Jayachandran updated HIVE-16033:
-
Status: Patch Available  (was: Open)

> LLAP: Use PrintGCDateStamps for gc logging
> --
>
> Key: HIVE-16033
> URL: https://issues.apache.org/jira/browse/HIVE-16033
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
> Attachments: HIVE-16033.1.patch
>
>
> This print human readable timestamps instead of timestamp relative to jvm 
> startup



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16033) LLAP: Use PrintGCDateStamps for gc logging

2017-02-23 Thread Prasanth Jayachandran (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth Jayachandran updated HIVE-16033:
-
Attachment: HIVE-16033.1.patch

[~sseth] can you please take a look?

> LLAP: Use PrintGCDateStamps for gc logging
> --
>
> Key: HIVE-16033
> URL: https://issues.apache.org/jira/browse/HIVE-16033
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
> Attachments: HIVE-16033.1.patch
>
>
> This print human readable timestamps instead of timestamp relative to jvm 
> startup



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Assigned] (HIVE-16033) LLAP: Use PrintGCDateStamps for gc logging

2017-02-23 Thread Prasanth Jayachandran (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth Jayachandran reassigned HIVE-16033:



> LLAP: Use PrintGCDateStamps for gc logging
> --
>
> Key: HIVE-16033
> URL: https://issues.apache.org/jira/browse/HIVE-16033
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
> Attachments: HIVE-16033.1.patch
>
>
> This print human readable timestamps instead of timestamp relative to jvm 
> startup



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16028) Fail UPDATE/DELETE/MERGE queries when Ranger authorization manager is used

2017-02-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881901#comment-15881901
 ] 

Hive QA commented on HIVE-16028:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12854300/HIVE-16028.2.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 10258 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=235)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=223)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=223)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/3739/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/3739/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-3739/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 3 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12854300 - PreCommit-HIVE-Build

> Fail UPDATE/DELETE/MERGE queries when Ranger authorization manager is used
> --
>
> Key: HIVE-16028
> URL: https://issues.apache.org/jira/browse/HIVE-16028
> Project: Hive
>  Issue Type: Bug
>  Components: Authorization, Transactions
>Affects Versions: 2.2.0
>Reporter: Wei Zheng
>Assignee: Wei Zheng
> Attachments: HIVE-16028.1.patch, HIVE-16028.2.patch
>
>
> This is a followup of HIVE-15891. In that jira an error-out logic was added, 
> but the assumption that we need to do row filtering/column masking for 
> entries in a non-empty list of tables returned by 
> applyRowFilterAndColumnMasking is wrong, because on Ranger side, 
> RangerHiveAuthorizer#applyRowFilterAndColumnMasking will unconditionally 
> return a list of tables no matter whether row filtering/column masking is 
> applicable on the tables.
> The fix for Hive for now will be to move the error-out logic after we figure 
> out there's no replacement text for the query. But ideally we should consider 
> modifying Ranger logic to only return tables that need to be masked.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15903) Compute table stats when user computes column stats

2017-02-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881849#comment-15881849
 ] 

Hive QA commented on HIVE-15903:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12854260/HIVE-15903.03.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 10244 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=235)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=223)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=125)
org.apache.hive.beeline.TestBeeLineWithArgs.testQueryProgressParallel 
(batchId=211)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/3738/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/3738/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-3738/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 4 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12854260 - PreCommit-HIVE-Build

> Compute table stats when user computes column stats
> ---
>
> Key: HIVE-15903
> URL: https://issues.apache.org/jira/browse/HIVE-15903
> Project: Hive
>  Issue Type: Bug
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-15903.01.patch, HIVE-15903.02.patch, 
> HIVE-15903.03.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16032) MM tables: encrypted/(minimr?) CLI driver + fetch optimizer => no results

2017-02-23 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881846#comment-15881846
 ] 

Sergey Shelukhin commented on HIVE-16032:
-

Interestingly, even on master, runTasks when executing Driver for the select 
produces 0 logging output and lasts for 0ms.
I thought it was a problem with the branch but master still produces correct 
result somehow.

> MM tables: encrypted/(minimr?) CLI driver + fetch optimizer => no results
> -
>
> Key: HIVE-16032
> URL: https://issues.apache.org/jira/browse/HIVE-16032
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>
> The repro does not require encryption, but it doesn't happen on CliDriver.
> The easiest way to repro (results for the query w/none, no results w/more 
> (the default)):
> {noformat}
> DROP TABLE IF EXISTS encrypted_table PURGE;
> CREATE TABLE encrypted_table (key INT, value STRING) LOCATION 
> '${hiveconf:hive.metastore.warehouse.dir}/default/encrypted_table';
> INSERT INTO encrypted_table values(1,'foo'),(2,'bar');
> set hive.fetch.task.conversion=none;
> select * from encrypted_table;
> set hive.fetch.task.conversion=more;
> select * from encrypted_table;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Assigned] (HIVE-16032) MM tables: encrypted/(minimr?) CLI driver + fetch optimizer => no results

2017-02-23 Thread Sergey Shelukhin (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin reassigned HIVE-16032:
---


> MM tables: encrypted/(minimr?) CLI driver + fetch optimizer => no results
> -
>
> Key: HIVE-16032
> URL: https://issues.apache.org/jira/browse/HIVE-16032
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>
> The repro does not require encryption, but it doesn't happen on CliDriver.
> The easiest way to repro (results for the query w/none, no results w/more 
> (the default)):
> {noformat}
> DROP TABLE IF EXISTS encrypted_table PURGE;
> CREATE TABLE encrypted_table (key INT, value STRING) LOCATION 
> '${hiveconf:hive.metastore.warehouse.dir}/default/encrypted_table';
> INSERT INTO encrypted_table values(1,'foo'),(2,'bar');
> set hive.fetch.task.conversion=none;
> select * from encrypted_table;
> set hive.fetch.task.conversion=more;
> select * from encrypted_table;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-15668) change REPL DUMP syntax to use "LIMIT" instead of "BATCH" keyword

2017-02-23 Thread Sushanth Sowmyan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushanth Sowmyan updated HIVE-15668:

   Resolution: Fixed
Fix Version/s: 2.2.0
   Status: Resolved  (was: Patch Available)

> change REPL DUMP syntax to use "LIMIT" instead of "BATCH" keyword
> -
>
> Key: HIVE-15668
> URL: https://issues.apache.org/jira/browse/HIVE-15668
> Project: Hive
>  Issue Type: Sub-task
>  Components: repl
>Reporter: Sushanth Sowmyan
>Assignee: Sushanth Sowmyan
> Fix For: 2.2.0
>
> Attachments: HIVE-15668.2.patch, HIVE-15668.patch
>
>
> Currently, REPL DUMP syntax goes:
> {noformat}
> REPL DUMP [[.]] [FROM  [BATCH ]]
> {noformat}
> The BATCH directive says that when doing an event dump, to not dump out more 
> than _batchSize_ number of events. However, there is a clearer keyword for 
> the same effect, and that is LIMIT. Thus, rephrasing the syntax as follows 
> makes it clearer:
> {noformat}
> REPL DUMP [[.]] [FROM  [LIMIT ]]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15668) change REPL DUMP syntax to use "LIMIT" instead of "BATCH" keyword

2017-02-23 Thread Sushanth Sowmyan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881835#comment-15881835
 ] 

Sushanth Sowmyan commented on HIVE-15668:
-

Thanks, Thejas. Committed to master.

> change REPL DUMP syntax to use "LIMIT" instead of "BATCH" keyword
> -
>
> Key: HIVE-15668
> URL: https://issues.apache.org/jira/browse/HIVE-15668
> Project: Hive
>  Issue Type: Sub-task
>  Components: repl
>Reporter: Sushanth Sowmyan
>Assignee: Sushanth Sowmyan
> Attachments: HIVE-15668.2.patch, HIVE-15668.patch
>
>
> Currently, REPL DUMP syntax goes:
> {noformat}
> REPL DUMP [[.]] [FROM  [BATCH ]]
> {noformat}
> The BATCH directive says that when doing an event dump, to not dump out more 
> than _batchSize_ number of events. However, there is a clearer keyword for 
> the same effect, and that is LIMIT. Thus, rephrasing the syntax as follows 
> makes it clearer:
> {noformat}
> REPL DUMP [[.]] [FROM  [LIMIT ]]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (HIVE-15879) Fix HiveMetaStoreChecker.checkPartitionDirs method

2017-02-23 Thread Rajesh Balamohan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881827#comment-15881827
 ] 

Rajesh Balamohan edited comment on HIVE-15879 at 2/24/17 2:53 AM:
--

Thanks for sharing the details [~vihangk1]. I have a different point of view 
here.

I agree that ThreadPoolExecutor.getActiveCount() is approximate. It is 
approximate because,  by the time {{getActiveCount()}} iterates over the 
running threads in the worker list, it is possible that some of the threads 
which were executing are complete. 
http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/concurrent/ThreadPoolExecutor.java#l1818.
 So the reported numbers could be slightly higher than the actually running 
threads. But it would never be less, as new Worker in ThreadPoolExecutor is 
added with {{mainLock}}. 

In the context of MSCK logic, this approximation should not be a problem. 

This is due to the check of "(pool.getActiveCount() < 
pool.getMaximumPoolSize())". In case threadpool executor reports approximate 
value (i.e higher than the actual number of threads), thread pool would not be 
used as per current logic. So in corner cases there can be instances where in 
threadpool executor could have been used, but failed due to the approximate 
(higher values)  reported by ThreadPoolExecutor. 


was (Author: rajesh.balamohan):
Thanks for sharing the details [~vihangk1]. I have a different point of view 
here.

I agree that ThreadPoolExecutor.getActiveCount() is approximate. It is 
approximate because,  by the time {{getActiveCount()}} iterates over the 
running threads in the worker list, it is possible that some of the threads 
which were executing are complete. 
http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/concurrent/ThreadPoolExecutor.java#l1818.
 So the reported numbers could be slightly higher than the actually running 
threads. But it would never be less, as new Worker in ThreadPoolExecutor is 
added with {{mainLock}}. 

In the context of MSCK logic, this approximation should not be a problem. This 
is due to the check of "(pool.getActiveCount() < pool.getMaximumPoolSize())". 
In case threadpool executor reports approximate value (i.e higher than the 
actual number of threads), thread pool would not be used as per current logic. 
So in corner cases there can be instances where in threadpool executor could 
have been used, but failed due to the approximate (higher values)  reported by 
ThreadPoolExecutor. 

> Fix HiveMetaStoreChecker.checkPartitionDirs method
> --
>
> Key: HIVE-15879
> URL: https://issues.apache.org/jira/browse/HIVE-15879
> Project: Hive
>  Issue Type: Bug
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
> Attachments: HIVE-15879.01.patch
>
>
> HIVE-15803 fixes the msck hang issue in 
> HiveMetaStoreChecker.checkPartitionDirs method by adding a check to see if 
> the Threadpool has any spare threads. If not it uses single threaded listing 
> of the files.
> {noformat}
> if (pool != null) {
>   synchronized (pool) {
> // In case of recursive calls, it is possible to deadlock with TP. 
> Check TP usage here.
> if (pool.getActiveCount() < pool.getMaximumPoolSize()) {
>   useThreadPool = true;
> }
> if (!useThreadPool) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Not using threadPool as active count:" + 
> pool.getActiveCount()
> + ", max:" + pool.getMaximumPoolSize());
>   }
> }
>   }
> }
> {noformat}
> Based on the java doc of getActiveCount() below 
> bq. Returns the approximate number of threads that are actively executing 
> tasks.
> it returns only approximate number of threads and it cannot be guaranteed 
> that it always returns the exact number of active threads. This still exposes 
> the method implementation to the msck hang bug in rare corner cases.
> We could either:
> 1. Use a atomic counter to track exactly how many threads are actively running
> 2. Relook at the method itself to make it much simpler. Like eg, look into 
> the possibility of changing the recursive implementation to an iterative 
> implementation where worker threads pick tasks from a queue until the queue 
> is empty.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15879) Fix HiveMetaStoreChecker.checkPartitionDirs method

2017-02-23 Thread Rajesh Balamohan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881827#comment-15881827
 ] 

Rajesh Balamohan commented on HIVE-15879:
-

Thanks for sharing the details [~vihangk1]. I have a different point of view 
here.

I agree that ThreadPoolExecutor.getActiveCount() is approximate. It is 
approximate because,  by the time {{getActiveCount()}} iterates over the 
running threads in the worker list, it is possible that some of the threads 
which were executing are complete. 
http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/concurrent/ThreadPoolExecutor.java#l1818.
 So the reported numbers could be slightly higher than the actually running 
threads. But it would never be less, as new Worker in ThreadPoolExecutor is 
added with {{mainLock}}. 

In the context of MSCK logic, this approximation should not be a problem. This 
is due to the check of "(pool.getActiveCount() < pool.getMaximumPoolSize())". 
In case threadpool executor reports approximate value (i.e higher than the 
actual number of threads), thread pool would not be used as per current logic. 
So in corner cases there can be instances where in threadpool executor could 
have been used, but failed due to the approximate (higher values)  reported by 
ThreadPoolExecutor. 

> Fix HiveMetaStoreChecker.checkPartitionDirs method
> --
>
> Key: HIVE-15879
> URL: https://issues.apache.org/jira/browse/HIVE-15879
> Project: Hive
>  Issue Type: Bug
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
> Attachments: HIVE-15879.01.patch
>
>
> HIVE-15803 fixes the msck hang issue in 
> HiveMetaStoreChecker.checkPartitionDirs method by adding a check to see if 
> the Threadpool has any spare threads. If not it uses single threaded listing 
> of the files.
> {noformat}
> if (pool != null) {
>   synchronized (pool) {
> // In case of recursive calls, it is possible to deadlock with TP. 
> Check TP usage here.
> if (pool.getActiveCount() < pool.getMaximumPoolSize()) {
>   useThreadPool = true;
> }
> if (!useThreadPool) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Not using threadPool as active count:" + 
> pool.getActiveCount()
> + ", max:" + pool.getMaximumPoolSize());
>   }
> }
>   }
> }
> {noformat}
> Based on the java doc of getActiveCount() below 
> bq. Returns the approximate number of threads that are actively executing 
> tasks.
> it returns only approximate number of threads and it cannot be guaranteed 
> that it always returns the exact number of active threads. This still exposes 
> the method implementation to the msck hang bug in rare corner cases.
> We could either:
> 1. Use a atomic counter to track exactly how many threads are actively running
> 2. Relook at the method itself to make it much simpler. Like eg, look into 
> the possibility of changing the recursive implementation to an iterative 
> implementation where worker threads pick tasks from a queue until the queue 
> is empty.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16004) OutOfMemory in SparkReduceRecordHandler with vectorization mode

2017-02-23 Thread Ferdinand Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdinand Xu updated HIVE-16004:

   Resolution: Fixed
Fix Version/s: 2.2.0
   Status: Resolved  (was: Patch Available)

Committed to the master. Thanks [~colin_mjj] for the contribution and [~xuefuz] 
for the review.

> OutOfMemory in SparkReduceRecordHandler with vectorization mode
> ---
>
> Key: HIVE-16004
> URL: https://issues.apache.org/jira/browse/HIVE-16004
> Project: Hive
>  Issue Type: Bug
>Reporter: Colin Ma
>Assignee: Colin Ma
> Fix For: 2.2.0
>
> Attachments: HIVE-16004.001.patch, HIVE-16004.002.patch
>
>
> For the query 28 of TPCs-BB with 1T data, the executor memory is set as 30G. 
> Get the following exception:
> java.lang.OutOfMemoryError
>   at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
>   at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
>   at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>   at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>   at java.io.DataOutputStream.write(DataOutputStream.java:107)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.setVector(VectorizedBatchUtil.java:467)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.addRowToBatchFrom(VectorizedBatchUtil.java:238)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processVectors(SparkReduceRecordHandler.java:367)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:286)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:220)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:49)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>   at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974)
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1974)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745) 
> I think DataOutputBuffer isn't cleared on time cause this problem.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16014) HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of hive.mv.files.thread for pool size

2017-02-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881805#comment-15881805
 ] 

Hive QA commented on HIVE-16014:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12854275/HIVE-16014.02.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 10257 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=235)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr]
 (batchId=140)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=223)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/3737/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/3737/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-3737/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 3 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12854275 - PreCommit-HIVE-Build

> HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of 
> hive.mv.files.thread for pool size
> --
>
> Key: HIVE-16014
> URL: https://issues.apache.org/jira/browse/HIVE-16014
> Project: Hive
>  Issue Type: Improvement
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
> Attachments: HIVE-16014.01.patch, HIVE-16014.02.patch
>
>
> HiveMetastoreChecker uses hive.mv.files.thread configuration value for 
> determining the pool size as below :
> {noformat}
> private void checkPartitionDirs(Path basePath, Set allDirs, int 
> maxDepth) throws IOException, HiveException {
> ConcurrentLinkedQueue basePaths = new ConcurrentLinkedQueue<>();
> basePaths.add(basePath);
> Set dirSet = Collections.newSetFromMap(new ConcurrentHashMap Boolean>());
> // Here we just reuse the THREAD_COUNT configuration for
> // HIVE_MOVE_FILES_THREAD_COUNT
> int poolSize = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 
> 15);
> // Check if too low config is provided for move files. 2x CPU is 
> reasonable max count.
> poolSize = poolSize == 0 ? poolSize : Math.max(poolSize,
> Runtime.getRuntime().availableProcessors() * 2);
> {noformat}
> msck is commonly used to add the missing partitions for the table from the 
> Filesystem. In such a case different pool sizes for HMSHandler and 
> HiveMetastoreChecker can affect the performance. Eg. If 
> {{hive.metastore.fshandler.threads}} is set to a lower value like 15 and 
> {{hive.mv.files.thread}} is much higher like 100 or vice versa the smaller 
> pool will become the bottleneck. If would be good to use 
> {{hive.metastore.fshandler.threads}} to size the pool for 
> HiveMetastoreChecker since the number missing partitions and number of 
> partitions to be added will most likely be the same. In such a case the 
> performance of the query will be optimum when both the pool sizes are same.
> Since it is possible to tune both the configs individually it will be very 
> likely that they may be different. But since there is a strong co-relation 
> between amount of work done by HiveMetastoreChecker and 
> HiveMetastore.add_partitions call it might be a good idea to use 
> {{hive.metastore.fshandler.threads}} for pool size instead of 
> {{hive.mv.files.thread}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15856) Hive export/import (hive.exim.uri.scheme.whitelist) to support s3a

2017-02-23 Thread John Zhuge (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881798#comment-15881798
 ] 

John Zhuge commented on HIVE-15856:
---

With [~stakiar]'s help, I was able to run {{hive-blobstore}} itests on ADLS 
after adding {{adl}} to the whitelist. The itests eventually failed at other 
places, we will file a separate JIRA.

> Hive export/import (hive.exim.uri.scheme.whitelist) to support s3a
> --
>
> Key: HIVE-15856
> URL: https://issues.apache.org/jira/browse/HIVE-15856
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>
> Add support for export / import operations on S3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15882) HS2 generating high memory pressure with many partitions and concurrent queries

2017-02-23 Thread Misha Dmitriev (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881788#comment-15881788
 ] 

Misha Dmitriev commented on HIVE-15882:
---

[~lirui] sure - the RB for the first change (string interning), that matches 
this patch, has already been created. See https://reviews.apache.org/r/56687/

> HS2 generating high memory pressure with many partitions and concurrent 
> queries
> ---
>
> Key: HIVE-15882
> URL: https://issues.apache.org/jira/browse/HIVE-15882
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Misha Dmitriev
>Assignee: Misha Dmitriev
> Attachments: HIVE-15882.01.patch, hs2-crash-2000p-500m-50q.txt
>
>
> I've created a Hive table with 2000 partitions, each backed by two files, 
> with one row in each file. When I execute some number of concurrent queries 
> against this table, e.g. as follows
> {code}
> for i in `seq 1 50`; do beeline -u jdbc:hive2://localhost:1 -n admin -p 
> admin -e "select count(i_f_1) from misha_table;" & done
> {code}
> it results in a big memory spike. With 20 queries I caused an OOM in a HS2 
> server with -Xmx200m and with 50 queries - in the one with -Xmx500m.
> I am attaching the results of jxray (www.jxray.com) analysis of a heap dump 
> that was generated in the 50queries/500m heap scenario. It suggests that 
> there are several opportunities to reduce memory pressure with not very 
> invasive changes to the code:
> 1. 24.5% of memory is wasted by duplicate strings (see section 6). With 
> String.intern() calls added in the ~10 relevant places in the code, this 
> overhead can be highly reduced.
> 2. Almost 20% of memory is wasted due to various suboptimally used 
> collections (see section 8). There are many maps and lists that are either 
> empty or have just 1 element. By modifying the code that creates and 
> populates these collections, we may likely save 5-10% of memory.
> 3. Almost 20% of memory is used by instances of java.util.Properties. It 
> looks like these objects are highly duplicate, since for each Partition each 
> concurrently running query creates its own copy of Partion, PartitionDesc and 
> Properties. Thus we have nearly 100,000 (50 queries * 2,000 partitions) 
> Properties in memory. By interning/deduplicating these objects we may be able 
> to save perhaps 15% of memory.
> So overall, I think there is a good chance to reduce HS2 memory consumption 
> in this scenario by ~40%.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-13864) Beeline ignores the command that follows a semicolon and comment

2017-02-23 Thread Yongzhi Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-13864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881784#comment-15881784
 ] 

Yongzhi Chen commented on HIVE-13864:
-

The failures are not related.

> Beeline ignores the command that follows a semicolon and comment
> 
>
> Key: HIVE-13864
> URL: https://issues.apache.org/jira/browse/HIVE-13864
> Project: Hive
>  Issue Type: Bug
>Reporter: Muthu Manickam
>Assignee: Yongzhi Chen
> Attachments: HIVE-13864.01.patch, HIVE-13864.02.patch, 
> HIVE-13864.3.patch, HIVE-13864.4.patch
>
>
> Beeline ignores the next line/command that follows a command with semicolon 
> and comments.
> Example 1:
> select *
> from table1; -- comments
> select * from table2;
> In this case, only the first command is executed.. second command "select * 
> from table2" is not executed.
> --
> Example 2:
> select *
> from table1; -- comments
> select * from table2;
> select * from table3;
> In this case, first command and third command is executed. second command 
> "select * from table2" is not executed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15743) vectorized text parsing: speed up double parse

2017-02-23 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881743#comment-15881743
 ] 

Sergey Shelukhin commented on HIVE-15743:
-

Hmm... should this be committed?

> vectorized text parsing: speed up double parse
> --
>
> Key: HIVE-15743
> URL: https://issues.apache.org/jira/browse/HIVE-15743
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Teddy Choi
> Attachments: HIVE-15743.1.patch, HIVE-15743.2.patch, 
> HIVE-15743.3.patch, HIVE-15743.4.patch, tpch-without.png
>
>
> {noformat}
> Double.parseDouble(
> new String(bytes, fieldStart, fieldLength, 
> StandardCharsets.UTF_8));{noformat}
> This takes ~25% of the query time in some cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Issue Comment Deleted] (HIVE-15924) move ORC PPD failure message caused by a dynamic value to DEBUG level

2017-02-23 Thread Sergey Shelukhin (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-15924:

Comment: was deleted

(was: +1. is it necessary to log the exception? I noticed that for the case of 
dynamic values this is being logged many times, I wonder if on debug level it 
would log  10 exception stacks.)

> move ORC PPD failure message caused by a dynamic value to DEBUG level
> -
>
> Key: HIVE-15924
> URL: https://issues.apache.org/jira/browse/HIVE-15924
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Prasanth Jayachandran
> Attachments: HIVE-15924.1.patch
>
>
> Several WARN msgs are observed like below when running LLAP with default 
> configurations
> {code}
> 2017-02-14T17:42:06,665  WARN [IO-Elevator-Thread-8 
> (1484282558103_6753_2_05_30_2)] impl.RecordReaderImpl: 
> IllegalStateException when evaluating predicate. Skipping ORC PPD. Exception: 
> Failed to retrieve dynamic value for RS_19_store_ss_store_sk_min StatsType: 
> Long PredicateType: null
> 2017-02-14T17:42:06,665  WARN [IO-Elevator-Thread-3 
> (1484282558103_6753_2_05_57_0)] impl.RecordReaderImpl: 
> IllegalStateException when evaluating predicate. Skipping ORC PPD. Exception: 
> Failed to retrieve dynamic value for RS_19_store_ss_store_sk_min StatsType: 
> Long PredicateType: null
> 2017-02-14T17:42:06,665  WARN [IO-Elevator-Thread-8 
> (1484282558103_6753_2_05_30_2)] impl.RecordReaderImpl: 
> IllegalStateException when evaluating predicate. Skipping ORC PPD. Exception: 
> Failed to retrieve dynamic value for RS_13_item_ss_item_sk_min StatsType: 
> Long PredicateType: null
> 2017-02-14T17:42:06,665  WARN [IO-Elevator-Thread-3 
> (1484282558103_6753_2_05_57_0)] impl.RecordReaderImpl: 
> IllegalStateException when evaluating predicate. Skipping ORC PPD. Exception: 
> Failed to retrieve dynamic value for RS_13_item_ss_item_sk_min StatsType: 
> Long PredicateType: null
> 2017-02-14T17:42:06,665  WARN [IO-Elevator-Thread-8 
> (1484282558103_6753_2_05_30_2)] impl.RecordReaderImpl: 
> IllegalStateException when evaluating predicate. Skipping ORC PPD. Exception: 
> Failed to retrieve dynamic value for RS_19_store_ss_store_sk_min StatsType: 
> Long PredicateType: null
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15879) Fix HiveMetaStoreChecker.checkPartitionDirs method

2017-02-23 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881740#comment-15881740
 ] 

Ashutosh Chauhan commented on HIVE-15879:
-

cc: [~rajesh.balamohan] [~pxiong]

> Fix HiveMetaStoreChecker.checkPartitionDirs method
> --
>
> Key: HIVE-15879
> URL: https://issues.apache.org/jira/browse/HIVE-15879
> Project: Hive
>  Issue Type: Bug
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
> Attachments: HIVE-15879.01.patch
>
>
> HIVE-15803 fixes the msck hang issue in 
> HiveMetaStoreChecker.checkPartitionDirs method by adding a check to see if 
> the Threadpool has any spare threads. If not it uses single threaded listing 
> of the files.
> {noformat}
> if (pool != null) {
>   synchronized (pool) {
> // In case of recursive calls, it is possible to deadlock with TP. 
> Check TP usage here.
> if (pool.getActiveCount() < pool.getMaximumPoolSize()) {
>   useThreadPool = true;
> }
> if (!useThreadPool) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Not using threadPool as active count:" + 
> pool.getActiveCount()
> + ", max:" + pool.getMaximumPoolSize());
>   }
> }
>   }
> }
> {noformat}
> Based on the java doc of getActiveCount() below 
> bq. Returns the approximate number of threads that are actively executing 
> tasks.
> it returns only approximate number of threads and it cannot be guaranteed 
> that it always returns the exact number of active threads. This still exposes 
> the method implementation to the msck hang bug in rare corner cases.
> We could either:
> 1. Use a atomic counter to track exactly how many threads are actively running
> 2. Relook at the method itself to make it much simpler. Like eg, look into 
> the possibility of changing the recursive implementation to an iterative 
> implementation where worker threads pick tasks from a queue until the queue 
> is empty.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15924) move ORC PPD failure message caused by a dynamic value to DEBUG level

2017-02-23 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881741#comment-15881741
 ] 

Sergey Shelukhin commented on HIVE-15924:
-

+1. is it necessary to log the exception? I noticed that for the case of 
dynamic values this is being logged many times, I wonder if on debug level it 
would log  10 exception stacks.

> move ORC PPD failure message caused by a dynamic value to DEBUG level
> -
>
> Key: HIVE-15924
> URL: https://issues.apache.org/jira/browse/HIVE-15924
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Prasanth Jayachandran
> Attachments: HIVE-15924.1.patch
>
>
> Several WARN msgs are observed like below when running LLAP with default 
> configurations
> {code}
> 2017-02-14T17:42:06,665  WARN [IO-Elevator-Thread-8 
> (1484282558103_6753_2_05_30_2)] impl.RecordReaderImpl: 
> IllegalStateException when evaluating predicate. Skipping ORC PPD. Exception: 
> Failed to retrieve dynamic value for RS_19_store_ss_store_sk_min StatsType: 
> Long PredicateType: null
> 2017-02-14T17:42:06,665  WARN [IO-Elevator-Thread-3 
> (1484282558103_6753_2_05_57_0)] impl.RecordReaderImpl: 
> IllegalStateException when evaluating predicate. Skipping ORC PPD. Exception: 
> Failed to retrieve dynamic value for RS_19_store_ss_store_sk_min StatsType: 
> Long PredicateType: null
> 2017-02-14T17:42:06,665  WARN [IO-Elevator-Thread-8 
> (1484282558103_6753_2_05_30_2)] impl.RecordReaderImpl: 
> IllegalStateException when evaluating predicate. Skipping ORC PPD. Exception: 
> Failed to retrieve dynamic value for RS_13_item_ss_item_sk_min StatsType: 
> Long PredicateType: null
> 2017-02-14T17:42:06,665  WARN [IO-Elevator-Thread-3 
> (1484282558103_6753_2_05_57_0)] impl.RecordReaderImpl: 
> IllegalStateException when evaluating predicate. Skipping ORC PPD. Exception: 
> Failed to retrieve dynamic value for RS_13_item_ss_item_sk_min StatsType: 
> Long PredicateType: null
> 2017-02-14T17:42:06,665  WARN [IO-Elevator-Thread-8 
> (1484282558103_6753_2_05_30_2)] impl.RecordReaderImpl: 
> IllegalStateException when evaluating predicate. Skipping ORC PPD. Exception: 
> Failed to retrieve dynamic value for RS_19_store_ss_store_sk_min StatsType: 
> Long PredicateType: null
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-15924) move ORC PPD failure message caused by a dynamic value to DEBUG level

2017-02-23 Thread Sergey Shelukhin (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-15924:

Status: Patch Available  (was: Open)

> move ORC PPD failure message caused by a dynamic value to DEBUG level
> -
>
> Key: HIVE-15924
> URL: https://issues.apache.org/jira/browse/HIVE-15924
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Prasanth Jayachandran
> Attachments: HIVE-15924.1.patch
>
>
> Several WARN msgs are observed like below when running LLAP with default 
> configurations
> {code}
> 2017-02-14T17:42:06,665  WARN [IO-Elevator-Thread-8 
> (1484282558103_6753_2_05_30_2)] impl.RecordReaderImpl: 
> IllegalStateException when evaluating predicate. Skipping ORC PPD. Exception: 
> Failed to retrieve dynamic value for RS_19_store_ss_store_sk_min StatsType: 
> Long PredicateType: null
> 2017-02-14T17:42:06,665  WARN [IO-Elevator-Thread-3 
> (1484282558103_6753_2_05_57_0)] impl.RecordReaderImpl: 
> IllegalStateException when evaluating predicate. Skipping ORC PPD. Exception: 
> Failed to retrieve dynamic value for RS_19_store_ss_store_sk_min StatsType: 
> Long PredicateType: null
> 2017-02-14T17:42:06,665  WARN [IO-Elevator-Thread-8 
> (1484282558103_6753_2_05_30_2)] impl.RecordReaderImpl: 
> IllegalStateException when evaluating predicate. Skipping ORC PPD. Exception: 
> Failed to retrieve dynamic value for RS_13_item_ss_item_sk_min StatsType: 
> Long PredicateType: null
> 2017-02-14T17:42:06,665  WARN [IO-Elevator-Thread-3 
> (1484282558103_6753_2_05_57_0)] impl.RecordReaderImpl: 
> IllegalStateException when evaluating predicate. Skipping ORC PPD. Exception: 
> Failed to retrieve dynamic value for RS_13_item_ss_item_sk_min StatsType: 
> Long PredicateType: null
> 2017-02-14T17:42:06,665  WARN [IO-Elevator-Thread-8 
> (1484282558103_6753_2_05_30_2)] impl.RecordReaderImpl: 
> IllegalStateException when evaluating predicate. Skipping ORC PPD. Exception: 
> Failed to retrieve dynamic value for RS_19_store_ss_store_sk_min StatsType: 
> Long PredicateType: null
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15882) HS2 generating high memory pressure with many partitions and concurrent queries

2017-02-23 Thread Rui Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881733#comment-15881733
 ] 

Rui Li commented on HIVE-15882:
---

Thanks [~mi...@cloudera.com] for the benchmark results. It looks promising. 
Would you mind create an RB entry for your patch? You can update to combine 
your other patch if you'd like to have it in this JIRA.

> HS2 generating high memory pressure with many partitions and concurrent 
> queries
> ---
>
> Key: HIVE-15882
> URL: https://issues.apache.org/jira/browse/HIVE-15882
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Misha Dmitriev
>Assignee: Misha Dmitriev
> Attachments: HIVE-15882.01.patch, hs2-crash-2000p-500m-50q.txt
>
>
> I've created a Hive table with 2000 partitions, each backed by two files, 
> with one row in each file. When I execute some number of concurrent queries 
> against this table, e.g. as follows
> {code}
> for i in `seq 1 50`; do beeline -u jdbc:hive2://localhost:1 -n admin -p 
> admin -e "select count(i_f_1) from misha_table;" & done
> {code}
> it results in a big memory spike. With 20 queries I caused an OOM in a HS2 
> server with -Xmx200m and with 50 queries - in the one with -Xmx500m.
> I am attaching the results of jxray (www.jxray.com) analysis of a heap dump 
> that was generated in the 50queries/500m heap scenario. It suggests that 
> there are several opportunities to reduce memory pressure with not very 
> invasive changes to the code:
> 1. 24.5% of memory is wasted by duplicate strings (see section 6). With 
> String.intern() calls added in the ~10 relevant places in the code, this 
> overhead can be highly reduced.
> 2. Almost 20% of memory is wasted due to various suboptimally used 
> collections (see section 8). There are many maps and lists that are either 
> empty or have just 1 element. By modifying the code that creates and 
> populates these collections, we may likely save 5-10% of memory.
> 3. Almost 20% of memory is used by instances of java.util.Properties. It 
> looks like these objects are highly duplicate, since for each Partition each 
> concurrently running query creates its own copy of Partion, PartitionDesc and 
> Properties. Thus we have nearly 100,000 (50 queries * 2,000 partitions) 
> Properties in memory. By interning/deduplicating these objects we may be able 
> to save perhaps 15% of memory.
> So overall, I think there is a good chance to reduce HS2 memory consumption 
> in this scenario by ~40%.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16022) BloomFilter check not showing up in MERGE statement queries

2017-02-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881719#comment-15881719
 ] 

Hive QA commented on HIVE-16022:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12854159/HIVE-16022.2.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 7 failed/errored test(s), 10258 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=235)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a]
 (batchId=136)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[dynamic_partition_pruning]
 (batchId=144)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vectorized_dynamic_partition_pruning]
 (batchId=144)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=223)
org.apache.hive.beeline.TestBeeLineWithArgs.testQueryProgressParallel 
(batchId=211)
org.apache.hive.service.server.TestHS2HttpServer.testContextRootUrlRewrite 
(batchId=186)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/3736/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/3736/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-3736/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 7 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12854159 - PreCommit-HIVE-Build

> BloomFilter check not showing up in MERGE statement queries
> ---
>
> Key: HIVE-16022
> URL: https://issues.apache.org/jira/browse/HIVE-16022
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Reporter: Jason Dere
>Assignee: Jason Dere
> Attachments: HIVE-16022.1.patch, HIVE-16022.2.patch
>
>
> Running explain on a MERGE statement with runtime filtering enabled, I see 
> the min/max being applied on the large table, but not the bloom filter check:
> {noformat}
> explain merge into acidTbl as t using nonAcidOrcTbl s ON t.a = s.a
> WHEN MATCHED AND s.a > 8 THEN DELETE
> WHEN MATCHED THEN UPDATE SET b = 7
> WHEN NOT MATCHED THEN INSERT VALUES(s.a, s.b)
> ...
> Map 1
> Map Operator Tree:
> TableScan
>   alias: t
>   Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
> Column stats: NONE
>   Filter Operator
> predicate: a BETWEEN DynamicValue(RS_3_s_a_min) AND 
> DynamicValue(RS_3_s_a_max) (type: boolean)
> Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
> Column stats: NONE
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-14990) run all tests for MM tables and fix the issues that are found

2017-02-23 Thread Sergey Shelukhin (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-14990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-14990:

Attachment: HIVE-14990.14.patch

The patch again, after fixing a bug introduced during the recent merge

> run all tests for MM tables and fix the issues that are found
> -
>
> Key: HIVE-14990
> URL: https://issues.apache.org/jira/browse/HIVE-14990
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-14990.01.patch, HIVE-14990.02.patch, 
> HIVE-14990.03.patch, HIVE-14990.04.patch, HIVE-14990.04.patch, 
> HIVE-14990.05.patch, HIVE-14990.05.patch, HIVE-14990.06.patch, 
> HIVE-14990.06.patch, HIVE-14990.07.patch, HIVE-14990.08.patch, 
> HIVE-14990.09.patch, HIVE-14990.10.patch, HIVE-14990.10.patch, 
> HIVE-14990.10.patch, HIVE-14990.12.patch, HIVE-14990.13.patch, 
> HIVE-14990.14.patch, HIVE-14990.patch
>
>
> Expected failures 
> 1) All HCat tests (cannot write MM tables via the HCat writer)
> 2) Almost all merge tests (alter .. concat is not supported).
> 3) Tests that run dfs commands with specific paths (path changes).
> 4) Truncate column (not supported).
> 5) Describe formatted will have the new table fields in the output (before 
> merging MM with ACID).
> 6) Many tests w/explain extended - diff in partition "base file name" (path 
> changes).
> 7) TestTxnCommands - all the conversion tests, as they check for bucket count 
> using file lists (path changes).
> 8) HBase metastore tests cause methods are not implemented.
> 9) Some load and ExIm tests that export a table and then rely on specific 
> path for load (path changes).
> 10) Bucket map join/etc. - diffs; disabled the optimization for MM tables due 
> to how it accounts for buckets
> 11) rand - different results due to different sequence of processing.
> 12) many (not all i.e. not the ones with just one insert) tests that have 
> stats output, such as file count, for obvious reasons
> 13) materialized views, not handled by design - the test check erroneously 
> makes them "mm", no easy way to tell them apart, I don't want to plumb more 
> stuff thru just for this test
> I'm filing jiras for some test failures that are not obvious and need an 
> investigation later



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15964) LLAP: Llap IO codepath not getting invoked due to file column id mismatch

2017-02-23 Thread Rajesh Balamohan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881701#comment-15881701
 ] 

Rajesh Balamohan commented on HIVE-15964:
-

committed the addendum patch in master.

> LLAP: Llap IO codepath not getting invoked due to file column id mismatch
> -
>
> Key: HIVE-15964
> URL: https://issues.apache.org/jira/browse/HIVE-15964
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: HIVE-15964.1.patch, HIVE-15964.2.patch, 
> HIVE-15964.3.patch, HIVE-15964.4.patch, HIVE-15964.addendum.patch
>
>
> LLAP IO codepath is not getting invoked in certain cases when schema 
> evolution checks are done. Though "int --> long" (fileType to readerType) 
> conversions are allowed, the file type columns are not matched correctly when 
> such conversions need to happen. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-14901) HiveServer2: Use user supplied fetch size to determine #rows serialized in tasks

2017-02-23 Thread Norris Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-14901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Norris Lee updated HIVE-14901:
--
Status: Patch Available  (was: In Progress)

> HiveServer2: Use user supplied fetch size to determine #rows serialized in 
> tasks
> 
>
> Key: HIVE-14901
> URL: https://issues.apache.org/jira/browse/HIVE-14901
> Project: Hive
>  Issue Type: Sub-task
>  Components: HiveServer2, JDBC, ODBC
>Affects Versions: 2.1.0
>Reporter: Vaibhav Gumashta
>Assignee: Norris Lee
> Attachments: HIVE-14901.1.patch, HIVE-14901.2.patch, 
> HIVE-14901.3.patch, HIVE-14901.4.patch, HIVE-14901.5.patch, 
> HIVE-14901.6.patch, HIVE-14901.patch
>
>
> Currently, we use {{hive.server2.thrift.resultset.max.fetch.size}} to decide 
> the max number of rows that we write in tasks. However, we should ideally use 
> the user supplied value (which can be extracted from the 
> ThriftCLIService.FetchResults' request parameter) to decide how many rows to 
> serialize in a blob in the tasks. We should however use 
> {{hive.server2.thrift.resultset.max.fetch.size}} to have an upper bound on 
> it, so that we don't go OOM in tasks and HS2. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-14901) HiveServer2: Use user supplied fetch size to determine #rows serialized in tasks

2017-02-23 Thread Norris Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-14901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Norris Lee updated HIVE-14901:
--
Status: In Progress  (was: Patch Available)

> HiveServer2: Use user supplied fetch size to determine #rows serialized in 
> tasks
> 
>
> Key: HIVE-14901
> URL: https://issues.apache.org/jira/browse/HIVE-14901
> Project: Hive
>  Issue Type: Sub-task
>  Components: HiveServer2, JDBC, ODBC
>Affects Versions: 2.1.0
>Reporter: Vaibhav Gumashta
>Assignee: Norris Lee
> Attachments: HIVE-14901.1.patch, HIVE-14901.2.patch, 
> HIVE-14901.3.patch, HIVE-14901.4.patch, HIVE-14901.5.patch, 
> HIVE-14901.6.patch, HIVE-14901.patch
>
>
> Currently, we use {{hive.server2.thrift.resultset.max.fetch.size}} to decide 
> the max number of rows that we write in tasks. However, we should ideally use 
> the user supplied value (which can be extracted from the 
> ThriftCLIService.FetchResults' request parameter) to decide how many rows to 
> serialize in a blob in the tasks. We should however use 
> {{hive.server2.thrift.resultset.max.fetch.size}} to have an upper bound on 
> it, so that we don't go OOM in tasks and HS2. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-14901) HiveServer2: Use user supplied fetch size to determine #rows serialized in tasks

2017-02-23 Thread Norris Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-14901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Norris Lee updated HIVE-14901:
--
Attachment: HIVE-14901.6.patch

> HiveServer2: Use user supplied fetch size to determine #rows serialized in 
> tasks
> 
>
> Key: HIVE-14901
> URL: https://issues.apache.org/jira/browse/HIVE-14901
> Project: Hive
>  Issue Type: Sub-task
>  Components: HiveServer2, JDBC, ODBC
>Affects Versions: 2.1.0
>Reporter: Vaibhav Gumashta
>Assignee: Norris Lee
> Attachments: HIVE-14901.1.patch, HIVE-14901.2.patch, 
> HIVE-14901.3.patch, HIVE-14901.4.patch, HIVE-14901.5.patch, 
> HIVE-14901.6.patch, HIVE-14901.patch
>
>
> Currently, we use {{hive.server2.thrift.resultset.max.fetch.size}} to decide 
> the max number of rows that we write in tasks. However, we should ideally use 
> the user supplied value (which can be extracted from the 
> ThriftCLIService.FetchResults' request parameter) to decide how many rows to 
> serialize in a blob in the tasks. We should however use 
> {{hive.server2.thrift.resultset.max.fetch.size}} to have an upper bound on 
> it, so that we don't go OOM in tasks and HS2. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Assigned] (HIVE-16017) MM tables - many queries duplicate the data after master merge

2017-02-23 Thread Sergey Shelukhin (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin reassigned HIVE-16017:
---

Assignee: Sergey Shelukhin

> MM tables - many queries duplicate the data after master merge
> --
>
> Key: HIVE-16017
> URL: https://issues.apache.org/jira/browse/HIVE-16017
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Fix For: hive-14535
>
>
> Update: happens on many more queries it looks like, and started happening 
> after a recent master merge after I wasn't working on the feature for a while
> This duplicates the data (given that the original query is a self-union, 
> essentially outputs it 4 times instead of 2) for either MM or non-MM tables, 
> on MM branch.
> It seems to be adding correct inputs (esp. in non-MM case the inputs are the 
> same as before). Presumably something in the output changes in the branch is 
> broken for this case. Not sure what yet. 
> {noformat}
> CREATE TABLE tbl1_mm(key int, value string) CLUSTERED BY (key) SORTED BY 
> (key) INTO 2 BUCKETS;
> insert overwrite table tbl1_mm select * from src where key < 10;
> select key, value from tbl1_mm a where key < 6
> union all
> select key, value from tbl1_mm a where key < 6;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16017) MM tables - many queries duplicate the data after master merge

2017-02-23 Thread Sergey Shelukhin (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-16017:

Fix Version/s: hive-14535

> MM tables - many queries duplicate the data after master merge
> --
>
> Key: HIVE-16017
> URL: https://issues.apache.org/jira/browse/HIVE-16017
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Fix For: hive-14535
>
>
> Update: happens on many more queries it looks like, and started happening 
> after a recent master merge after I wasn't working on the feature for a while
> This duplicates the data (given that the original query is a self-union, 
> essentially outputs it 4 times instead of 2) for either MM or non-MM tables, 
> on MM branch.
> It seems to be adding correct inputs (esp. in non-MM case the inputs are the 
> same as before). Presumably something in the output changes in the branch is 
> broken for this case. Not sure what yet. 
> {noformat}
> CREATE TABLE tbl1_mm(key int, value string) CLUSTERED BY (key) SORTED BY 
> (key) INTO 2 BUCKETS;
> insert overwrite table tbl1_mm select * from src where key < 10;
> select key, value from tbl1_mm a where key < 6
> union all
> select key, value from tbl1_mm a where key < 6;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Resolved] (HIVE-16017) MM tables - many queries duplicate the data after master merge

2017-02-23 Thread Sergey Shelukhin (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin resolved HIVE-16017.
-
Resolution: Fixed

Stupid merge issue in Utilities...

> MM tables - many queries duplicate the data after master merge
> --
>
> Key: HIVE-16017
> URL: https://issues.apache.org/jira/browse/HIVE-16017
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sergey Shelukhin
>
> Update: happens on many more queries it looks like, and started happening 
> after a recent master merge after I wasn't working on the feature for a while
> This duplicates the data (given that the original query is a self-union, 
> essentially outputs it 4 times instead of 2) for either MM or non-MM tables, 
> on MM branch.
> It seems to be adding correct inputs (esp. in non-MM case the inputs are the 
> same as before). Presumably something in the output changes in the branch is 
> broken for this case. Not sure what yet. 
> {noformat}
> CREATE TABLE tbl1_mm(key int, value string) CLUSTERED BY (key) SORTED BY 
> (key) INTO 2 BUCKETS;
> insert overwrite table tbl1_mm select * from src where key < 10;
> select key, value from tbl1_mm a where key < 6
> union all
> select key, value from tbl1_mm a where key < 6;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16029) COLLECT_SET and COLLECT_LIST does not return NULL in the result

2017-02-23 Thread Eric Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881671#comment-15881671
 ] 

Eric Lin commented on HIVE-16029:
-

Review request sent: https://reviews.apache.org/r/57009/

> COLLECT_SET and COLLECT_LIST does not return NULL in the result
> ---
>
> Key: HIVE-16029
> URL: https://issues.apache.org/jira/browse/HIVE-16029
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.1.1
>Reporter: Eric Lin
>Assignee: Eric Lin
>Priority: Minor
> Attachments: HIVE-16029.patch
>
>
> See the test case below:
> {code}
> 0: jdbc:hive2://localhost:1/default> select * from collect_set_test;
> +-+
> | collect_set_test.a  |
> +-+
> | 1   |
> | 2   |
> | NULL|
> | 4   |
> | NULL|
> +-+
> 0: jdbc:hive2://localhost:1/default> select collect_set(a) from 
> collect_set_test;
> +---+
> |  _c0  |
> +---+
> | [1,2,4]  |
> +---+
> {code}
> The correct result should be:
> {code}
> 0: jdbc:hive2://localhost:1/default> select collect_set(a) from 
> collect_set_test;
> +---+
> |  _c0  |
> +---+
> | [1,2,null,4]  |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-15964) LLAP: Llap IO codepath not getting invoked due to file column id mismatch

2017-02-23 Thread Rajesh Balamohan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HIVE-15964:

Attachment: HIVE-15964.addendum.patch

Yes [~prasanth_j]. Attaching the addendum patch for 
orc_ppd_schema_evol_3a.q.out. Will commit shortly.

> LLAP: Llap IO codepath not getting invoked due to file column id mismatch
> -
>
> Key: HIVE-15964
> URL: https://issues.apache.org/jira/browse/HIVE-15964
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: HIVE-15964.1.patch, HIVE-15964.2.patch, 
> HIVE-15964.3.patch, HIVE-15964.4.patch, HIVE-15964.addendum.patch
>
>
> LLAP IO codepath is not getting invoked in certain cases when schema 
> evolution checks are done. Though "int --> long" (fileType to readerType) 
> conversions are allowed, the file type columns are not matched correctly when 
> such conversions need to happen. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-15935) ACL is not set in ATS data

2017-02-23 Thread Daniel Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated HIVE-15935:
--
Attachment: HIVE-15935.4.patch

Rebase after HIVE-15830.

> ACL is not set in ATS data
> --
>
> Key: HIVE-15935
> URL: https://issues.apache.org/jira/browse/HIVE-15935
> Project: Hive
>  Issue Type: Bug
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Attachments: HIVE-15935.1.patch, HIVE-15935.2.patch, 
> HIVE-15935.3.patch, HIVE-15935.4.patch
>
>
> When publishing ATS info, Hive does not set ACL, that make Hive ATS entries 
> visible to all users. On the other hand, Tez ATS entires is using Tez DAG ACL 
> which limit both view/modify ACL to end user only. We shall make them 
> consistent. In the Jira, I am going to limit ACL to end user for both Tez ATS 
> and Hive ATS, also provide config "hive.view.acls" and "hive.modify.acls" if 
> user need to overridden.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-13864) Beeline ignores the command that follows a semicolon and comment

2017-02-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-13864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881632#comment-15881632
 ] 

Hive QA commented on HIVE-13864:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12854254/HIVE-13864.4.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 10259 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=235)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a]
 (batchId=136)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=223)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=223)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/3735/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/3735/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-3735/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 4 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12854254 - PreCommit-HIVE-Build

> Beeline ignores the command that follows a semicolon and comment
> 
>
> Key: HIVE-13864
> URL: https://issues.apache.org/jira/browse/HIVE-13864
> Project: Hive
>  Issue Type: Bug
>Reporter: Muthu Manickam
>Assignee: Yongzhi Chen
> Attachments: HIVE-13864.01.patch, HIVE-13864.02.patch, 
> HIVE-13864.3.patch, HIVE-13864.4.patch
>
>
> Beeline ignores the next line/command that follows a command with semicolon 
> and comments.
> Example 1:
> select *
> from table1; -- comments
> select * from table2;
> In this case, only the first command is executed.. second command "select * 
> from table2" is not executed.
> --
> Example 2:
> select *
> from table1; -- comments
> select * from table2;
> select * from table3;
> In this case, first command and third command is executed. second command 
> "select * from table2" is not executed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16023) Wrong estimation for number of rows generated by IN expression

2017-02-23 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881601#comment-15881601
 ] 

Ashutosh Chauhan commented on HIVE-16023:
-

+1 
All the log messages should be at debug(). Also log estimated row count for IN.

> Wrong estimation for number of rows generated by IN expression
> --
>
> Key: HIVE-16023
> URL: https://issues.apache.org/jira/browse/HIVE-16023
> Project: Hive
>  Issue Type: Bug
>  Components: Statistics
>Affects Versions: 2.2.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
> Attachments: HIVE-16023.patch
>
>
> Code seems to be wrong, as we are factoring the number of rows to create the 
> multiplying factor, instead of NDV for given column(s) and NDV in IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16018) Add more information for DynamicPartitionPruningOptimization

2017-02-23 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881570#comment-15881570
 ] 

Ashutosh Chauhan commented on HIVE-16018:
-

+1

> Add more information for DynamicPartitionPruningOptimization
> 
>
> Key: HIVE-16018
> URL: https://issues.apache.org/jira/browse/HIVE-16018
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-16018.01.patch, HIVE-16018.02.patch, qfile.q, 
> qfile.q.out
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16018) Add more information for DynamicPartitionPruningOptimization

2017-02-23 Thread Pengcheng Xiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pengcheng Xiong updated HIVE-16018:
---
Status: Patch Available  (was: Open)

address [~ashutoshc]'s comments.

> Add more information for DynamicPartitionPruningOptimization
> 
>
> Key: HIVE-16018
> URL: https://issues.apache.org/jira/browse/HIVE-16018
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-16018.01.patch, HIVE-16018.02.patch, qfile.q, 
> qfile.q.out
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16018) Add more information for DynamicPartitionPruningOptimization

2017-02-23 Thread Pengcheng Xiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pengcheng Xiong updated HIVE-16018:
---
Attachment: HIVE-16018.02.patch

> Add more information for DynamicPartitionPruningOptimization
> 
>
> Key: HIVE-16018
> URL: https://issues.apache.org/jira/browse/HIVE-16018
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-16018.01.patch, HIVE-16018.02.patch, qfile.q, 
> qfile.q.out
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16018) Add more information for DynamicPartitionPruningOptimization

2017-02-23 Thread Pengcheng Xiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pengcheng Xiong updated HIVE-16018:
---
Status: Open  (was: Patch Available)

> Add more information for DynamicPartitionPruningOptimization
> 
>
> Key: HIVE-16018
> URL: https://issues.apache.org/jira/browse/HIVE-16018
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-16018.01.patch, HIVE-16018.02.patch, qfile.q, 
> qfile.q.out
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15944) The order of cols is error in ColumnPrunerReduceSinkProc because of sort operator

2017-02-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881555#comment-15881555
 ] 

Hive QA commented on HIVE-15944:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12854227/HIVE-15944.3.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 142 failed/errored test(s), 10258 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=235)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[columnstats_partlvl] 
(batchId=33)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[columnstats_partlvl_dp] 
(batchId=47)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[columnstats_tbllvl] 
(batchId=8)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[complex_alias] 
(batchId=16)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[constant_prop_3] 
(batchId=40)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[correlationoptimizer13] 
(batchId=10)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_udf] (batchId=8)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[display_colstats_tbllvl] 
(batchId=3)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[distinct_windowing_no_cbo]
 (batchId=60)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[druid_basic2] 
(batchId=10)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dynamic_rdd_cache] 
(batchId=50)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[except_all] (batchId=43)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby9] (batchId=6)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_join_pushdown] 
(batchId=73)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_position] 
(batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[having2] (batchId=15)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[index_auto_update] 
(batchId=68)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[limit_pushdown_negative] 
(batchId=37)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[multi_insert_gby3] 
(batchId=69)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[multigroupby_singlemr] 
(batchId=64)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[nested_column_pruning] 
(batchId=31)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_non_dictionary_encoding_vectorization]
 (batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_vectorization]
 (batchId=13)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_gby2] (batchId=79)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_gby] (batchId=35)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_windowing1] 
(batchId=41)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ptfgroupbyjoin] 
(batchId=78)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[reduce_deduplicate_extended2]
 (batchId=55)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_in_having] 
(batchId=53)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_notexists] 
(batchId=82)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_notexists_having]
 (batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_notin_having] 
(batchId=46)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_unqualcolumnrefs]
 (batchId=17)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[temp_table_display_colstats_tbllvl]
 (batchId=71)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_decimal_aggregate]
 (batchId=17)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_decimal_round_2] 
(batchId=22)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_groupby_3] 
(batchId=60)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_groupby_reduce] 
(batchId=52)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_orderby_5] 
(batchId=38)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_13] 
(batchId=46)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_15] 
(batchId=59)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_limit] 
(batchId=34)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorized_parquet_types]
 (batchId=61)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[windowing_gby2] 
(batchId=33)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[dynamic_partition_pruning_2]
 (batchId=136)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[explainuser_2] 
(batchId=137)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[llap_stats] 
(batchId=135)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a]
 (batchId=136)

[jira] [Updated] (HIVE-15958) LLAP: IPC connections are not being reused for umbilical protocol

2017-02-23 Thread Prasanth Jayachandran (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth Jayachandran updated HIVE-15958:
-
Attachment: HIVE-15958.4.patch

AMNodeInfo cleanup is done after sending kill for pending fragments. Also, 
heartbeat is done only when the task count >0.

[~sseth] can you please take a look?

> LLAP: IPC connections are not being reused for umbilical protocol
> -
>
> Key: HIVE-15958
> URL: https://issues.apache.org/jira/browse/HIVE-15958
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Rajesh Balamohan
>Assignee: Prasanth Jayachandran
> Attachments: HIVE-15958.1.patch, HIVE-15958.2.patch, 
> HIVE-15958.3.patch, HIVE-15958.4.patch
>
>
> During concurrency testing, observed 1000s of ipc thread creations. Ideally, 
> the connections to same hosts should be reused.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Assigned] (HIVE-16030) LLAP: All rolled over logs should be compressed

2017-02-23 Thread Prasanth Jayachandran (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth Jayachandran reassigned HIVE-16030:



> LLAP: All rolled over logs should be compressed
> ---
>
> Key: HIVE-16030
> URL: https://issues.apache.org/jira/browse/HIVE-16030
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>
> When we rollover the logs we don't compress it. Have seen 256MB of 
> uncompressed logs get down to 20MB after compression. This can significantly 
> save disk space. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16018) Add more information for DynamicPartitionPruningOptimization

2017-02-23 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881531#comment-15881531
 ] 

Ashutosh Chauhan commented on HIVE-16018:
-

Wrap these calls in parseContext().getContext().getExplainConfig()  so these 
are called only for explain?

> Add more information for DynamicPartitionPruningOptimization
> 
>
> Key: HIVE-16018
> URL: https://issues.apache.org/jira/browse/HIVE-16018
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-16018.01.patch, qfile.q, qfile.q.out
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16029) COLLECT_SET and COLLECT_LIST does not return NULL in the result

2017-02-23 Thread Eric Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Lin updated HIVE-16029:

Attachment: HIVE-16029.patch

Adding first patch.

> COLLECT_SET and COLLECT_LIST does not return NULL in the result
> ---
>
> Key: HIVE-16029
> URL: https://issues.apache.org/jira/browse/HIVE-16029
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.1.1
>Reporter: Eric Lin
>Assignee: Eric Lin
>Priority: Minor
> Attachments: HIVE-16029.patch
>
>
> See the test case below:
> {code}
> 0: jdbc:hive2://localhost:1/default> select * from collect_set_test;
> +-+
> | collect_set_test.a  |
> +-+
> | 1   |
> | 2   |
> | NULL|
> | 4   |
> | NULL|
> +-+
> 0: jdbc:hive2://localhost:1/default> select collect_set(a) from 
> collect_set_test;
> +---+
> |  _c0  |
> +---+
> | [1,2,4]  |
> +---+
> {code}
> The correct result should be:
> {code}
> 0: jdbc:hive2://localhost:1/default> select collect_set(a) from 
> collect_set_test;
> +---+
> |  _c0  |
> +---+
> | [1,2,null,4]  |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16029) COLLECT_SET and COLLECT_LIST does not return NULL in the result

2017-02-23 Thread Eric Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Lin updated HIVE-16029:

Status: Patch Available  (was: Open)

> COLLECT_SET and COLLECT_LIST does not return NULL in the result
> ---
>
> Key: HIVE-16029
> URL: https://issues.apache.org/jira/browse/HIVE-16029
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.1.1
>Reporter: Eric Lin
>Assignee: Eric Lin
>Priority: Minor
> Attachments: HIVE-16029.patch
>
>
> See the test case below:
> {code}
> 0: jdbc:hive2://localhost:1/default> select * from collect_set_test;
> +-+
> | collect_set_test.a  |
> +-+
> | 1   |
> | 2   |
> | NULL|
> | 4   |
> | NULL|
> +-+
> 0: jdbc:hive2://localhost:1/default> select collect_set(a) from 
> collect_set_test;
> +---+
> |  _c0  |
> +---+
> | [1,2,4]  |
> +---+
> {code}
> The correct result should be:
> {code}
> 0: jdbc:hive2://localhost:1/default> select collect_set(a) from 
> collect_set_test;
> +---+
> |  _c0  |
> +---+
> | [1,2,null,4]  |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15964) LLAP: Llap IO codepath not getting invoked due to file column id mismatch

2017-02-23 Thread Prasanth Jayachandran (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881528#comment-15881528
 ] 

Prasanth Jayachandran commented on HIVE-15964:
--

[~rajesh.balamohan] orc_ppd_schema_evol_3a.q failure looks related?

> LLAP: Llap IO codepath not getting invoked due to file column id mismatch
> -
>
> Key: HIVE-15964
> URL: https://issues.apache.org/jira/browse/HIVE-15964
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: HIVE-15964.1.patch, HIVE-15964.2.patch, 
> HIVE-15964.3.patch, HIVE-15964.4.patch
>
>
> LLAP IO codepath is not getting invoked in certain cases when schema 
> evolution checks are done. Though "int --> long" (fileType to readerType) 
> conversions are allowed, the file type columns are not matched correctly when 
> such conversions need to happen. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16005) miscellaneous small fixes to help with llap debuggability

2017-02-23 Thread Siddharth Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-16005:
--
Attachment: HIVE-16005.03.patch

Updated to fix the unit tests.

> miscellaneous small fixes to help with llap debuggability
> -
>
> Key: HIVE-16005
> URL: https://issues.apache.org/jira/browse/HIVE-16005
> Project: Hive
>  Issue Type: Task
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: HIVE-16005.01.patch, HIVE-16005.02.patch, 
> HIVE-16005.03.patch
>
>
> - Include proc_ in cli, beeline, metastore, hs2 process args
> - LLAP history logger - log QueryId instead of dagName (dag name is free 
> flowing text)
> - LLAP JXM ExecutorStatus - Log QueryId instead of dagName. Sort by running / 
> queued
> - Include thread name in TaskRunnerCallable so that it shows up in stack 
> traces (will cause extra output in logs)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16024) MSCK Repair Requires nonstrict hive.mapred.mode

2017-02-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881462#comment-15881462
 ] 

Hive QA commented on HIVE-16024:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12854226/HIVE-16024.01.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 10241 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=235)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr]
 (batchId=140)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=223)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=223)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=130)
org.apache.hive.beeline.TestBeeLineWithArgs.testQueryProgressParallel 
(batchId=211)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/3733/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/3733/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-3733/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 6 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12854226 - PreCommit-HIVE-Build

> MSCK Repair Requires nonstrict hive.mapred.mode
> ---
>
> Key: HIVE-16024
> URL: https://issues.apache.org/jira/browse/HIVE-16024
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 2.2.0
>Reporter: Barna Zsombor Klara
>Assignee: Barna Zsombor Klara
> Attachments: HIVE-16024.01.patch
>
>
> MSCK repair fails when hive.mapred.mode is set to strict
> HIVE-13788 modified the way we read up partitions for a table to improve 
> performance. Unfortunately it is using PartitionPruner to load the partitions 
> which in turn is checking hive.mapred.mode.
> The previous code did not check hive.mapred.mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-15951) Make sure base persist directory is unique and deleted

2017-02-23 Thread slim bouguerra (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

slim bouguerra updated HIVE-15951:
--
Status: Patch Available  (was: In Progress)

> Make sure base persist directory is unique and deleted
> --
>
> Key: HIVE-15951
> URL: https://issues.apache.org/jira/browse/HIVE-15951
> Project: Hive
>  Issue Type: Bug
>  Components: Druid integration
>Affects Versions: 2.2.0
>Reporter: slim bouguerra
>Assignee: slim bouguerra
>Priority: Critical
> Fix For: 2.2.0
>
> Attachments: HIVE-15951.2.patch, HIVE-15951.patch
>
>
> In some cases the base persist directory will contain old data or shared 
> between reducer in the same physical VM.
> That will lead to the failure of the job till that the directory is cleaned.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16020) LLAP : Reduce IPC connection misses

2017-02-23 Thread Rajesh Balamohan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HIVE-16020:

   Resolution: Fixed
Fix Version/s: 2.2.0
   Status: Resolved  (was: Patch Available)

Thanks [~sershe]. Committed to master.

> LLAP : Reduce IPC connection misses
> ---
>
> Key: HIVE-16020
> URL: https://issues.apache.org/jira/browse/HIVE-16020
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Fix For: 2.2.0
>
> Attachments: HIVE-16020.1.patch
>
>
> {{RPC.getProxy}} created in {{TaskRunnerCallable}} does not pass 
> SocketFactory. This would cause new SocketFactory to be created every time 
> which would cause connection reuse issues.  Also, UserGroupInformation can be 
> reused at query level.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-15830) Allow additional ACLs for tez jobs

2017-02-23 Thread Siddharth Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-15830:
--
Summary: Allow additional ACLs for tez jobs  (was: Allow additional view 
ACLs for tez jobs)

> Allow additional ACLs for tez jobs
> --
>
> Key: HIVE-15830
> URL: https://issues.apache.org/jira/browse/HIVE-15830
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Fix For: 2.2.0
>
> Attachments: HIVE-15830.01.patch, HIVE-15830.02.patch, 
> HIVE-15830.03.patch, HIVE-15830.05.patch, HIVE-15830.06.patch, 
> HIVE-15830.07.patch
>
>
> Allow users to grant view access to additional users when running tez jobs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-15830) Allow additional ACLs for tez jobs

2017-02-23 Thread Siddharth Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-15830:
--
   Resolution: Fixed
Fix Version/s: 2.2.0
   Status: Resolved  (was: Patch Available)

> Allow additional ACLs for tez jobs
> --
>
> Key: HIVE-15830
> URL: https://issues.apache.org/jira/browse/HIVE-15830
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Fix For: 2.2.0
>
> Attachments: HIVE-15830.01.patch, HIVE-15830.02.patch, 
> HIVE-15830.03.patch, HIVE-15830.05.patch, HIVE-15830.06.patch, 
> HIVE-15830.07.patch
>
>
> Allow users to grant view access to additional users when running tez jobs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15830) Allow additional view ACLs for tez jobs

2017-02-23 Thread Siddharth Seth (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881413#comment-15881413
 ] 

Siddharth Seth commented on HIVE-15830:
---

Thanks for the review. Committing.

> Allow additional view ACLs for tez jobs
> ---
>
> Key: HIVE-15830
> URL: https://issues.apache.org/jira/browse/HIVE-15830
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: HIVE-15830.01.patch, HIVE-15830.02.patch, 
> HIVE-15830.03.patch, HIVE-15830.05.patch, HIVE-15830.06.patch, 
> HIVE-15830.07.patch
>
>
> Allow users to grant view access to additional users when running tez jobs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-15964) LLAP: Llap IO codepath not getting invoked due to file column id mismatch

2017-02-23 Thread Rajesh Balamohan (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HIVE-15964:

   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.2.0
   Status: Resolved  (was: Patch Available)

Thanks [~prasanth_j] and [~sershe]. Committed to master.

> LLAP: Llap IO codepath not getting invoked due to file column id mismatch
> -
>
> Key: HIVE-15964
> URL: https://issues.apache.org/jira/browse/HIVE-15964
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: HIVE-15964.1.patch, HIVE-15964.2.patch, 
> HIVE-15964.3.patch, HIVE-15964.4.patch
>
>
> LLAP IO codepath is not getting invoked in certain cases when schema 
> evolution checks are done. Though "int --> long" (fileType to readerType) 
> conversions are allowed, the file type columns are not matched correctly when 
> such conversions need to happen. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16018) Add more information for DynamicPartitionPruningOptimization

2017-02-23 Thread Pengcheng Xiong (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881403#comment-15881403
 ] 

Pengcheng Xiong commented on HIVE-16018:


[~ashutoshc], could u take a look? Thanks.

> Add more information for DynamicPartitionPruningOptimization
> 
>
> Key: HIVE-16018
> URL: https://issues.apache.org/jira/browse/HIVE-16018
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-16018.01.patch, qfile.q, qfile.q.out
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16018) Add more information for DynamicPartitionPruningOptimization

2017-02-23 Thread Pengcheng Xiong (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881402#comment-15881402
 ] 

Pengcheng Xiong commented on HIVE-16018:


Because the output file masks "DagId", we are not able to submit a patch with 
tests. Thus, I pasted the tests with tests' output before it masks for 
reviewers' reference.

> Add more information for DynamicPartitionPruningOptimization
> 
>
> Key: HIVE-16018
> URL: https://issues.apache.org/jira/browse/HIVE-16018
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-16018.01.patch, qfile.q, qfile.q.out
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16029) COLLECT_SET and COLLECT_LIST does not return NULL in the result

2017-02-23 Thread Eric Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Lin updated HIVE-16029:

Description: 
See the test case below:

{code}
0: jdbc:hive2://localhost:1/default> select * from collect_set_test;
+-+
| collect_set_test.a  |
+-+
| 1   |
| 2   |
| NULL|
| 4   |
| NULL|
+-+

0: jdbc:hive2://localhost:1/default> select collect_set(a) from 
collect_set_test;
+---+
|  _c0  |
+---+
| [1,2,4]  |
+---+

{code}

The correct result should be:

{code}
0: jdbc:hive2://localhost:1/default> select collect_set(a) from 
collect_set_test;
+---+
|  _c0  |
+---+
| [1,2,null,4]  |
+---+
{code}

  was:
See the test case below:

0: jdbc:hive2://localhost:1/default> select * from collect_set_test;
+-+
| collect_set_test.a  |
+-+
| 1   |
| 2   |
| NULL|
| 4   |
| NULL|
+-+

0: jdbc:hive2://localhost:1/default> select collect_set(a) from 
collect_set_test;
+---+
|  _c0  |
+---+
| [1,2,4]  |
+---+

The correct result should be:

0: jdbc:hive2://localhost:1/default> select collect_set(a) from 
collect_set_test;
+---+
|  _c0  |
+---+
| [1,2,null,4]  |
+---+


> COLLECT_SET and COLLECT_LIST does not return NULL in the result
> ---
>
> Key: HIVE-16029
> URL: https://issues.apache.org/jira/browse/HIVE-16029
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.1.1
>Reporter: Eric Lin
>Assignee: Eric Lin
>Priority: Minor
>
> See the test case below:
> {code}
> 0: jdbc:hive2://localhost:1/default> select * from collect_set_test;
> +-+
> | collect_set_test.a  |
> +-+
> | 1   |
> | 2   |
> | NULL|
> | 4   |
> | NULL|
> +-+
> 0: jdbc:hive2://localhost:1/default> select collect_set(a) from 
> collect_set_test;
> +---+
> |  _c0  |
> +---+
> | [1,2,4]  |
> +---+
> {code}
> The correct result should be:
> {code}
> 0: jdbc:hive2://localhost:1/default> select collect_set(a) from 
> collect_set_test;
> +---+
> |  _c0  |
> +---+
> | [1,2,null,4]  |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Assigned] (HIVE-16029) COLLECT_SET and COLLECT_LIST does not return NULL in the result

2017-02-23 Thread Eric Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Lin reassigned HIVE-16029:
---


> COLLECT_SET and COLLECT_LIST does not return NULL in the result
> ---
>
> Key: HIVE-16029
> URL: https://issues.apache.org/jira/browse/HIVE-16029
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.1.1
>Reporter: Eric Lin
>Assignee: Eric Lin
>Priority: Minor
>
> See the test case below:
> 0: jdbc:hive2://localhost:1/default> select * from collect_set_test;
> +-+
> | collect_set_test.a  |
> +-+
> | 1   |
> | 2   |
> | NULL|
> | 4   |
> | NULL|
> +-+
> 0: jdbc:hive2://localhost:1/default> select collect_set(a) from 
> collect_set_test;
> +---+
> |  _c0  |
> +---+
> | [1,2,4]  |
> +---+
> The correct result should be:
> 0: jdbc:hive2://localhost:1/default> select collect_set(a) from 
> collect_set_test;
> +---+
> |  _c0  |
> +---+
> | [1,2,null,4]  |
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15882) HS2 generating high memory pressure with many partitions and concurrent queries

2017-02-23 Thread Misha Dmitriev (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881399#comment-15881399
 ] 

Misha Dmitriev commented on HIVE-15882:
---

I've just measured the CPU performance impact of my changes using the same 
benchmark with the same high heap size (-Xmx3g) to exclude effects of excessive 
GC. I've measured the total time spent in all beeline clients. To do that, I 
ran beeline clients with /usr/bin/time as

{code}
for i in `seq 1 50`; do /usr/bin/time -p -o hive-timings-withchanges.txt 
--append beeline -u jdbc:hive2://localhost:1 -n admin -p admin -e "select 
count(i_f_1) from misha_table;" & done
{code}

I then calculated the sum of all timings in the file with another fun bash 
script:

{code}
sum=0; for s in `grep real hive-timings-withchanges.txt`; do t=${s/real/}; 
t=${t/\.*/}; echo $t; sum=$((sum+t)); done; echo $sum
{code}

The result is:
- before my changes: 17401s
- after my changes: 17012s

So, my changes have no negative CPU impact, and may even result in 1-2% CPU 
time improvement. This is not surprising given that my changes reduce the 
number of objects in memory, and thus ultimately reduce GC time.

Do I really need another JIRA ticket to post a patch that covers my other 
change (interning Properties objects in PartitionDesc)?

> HS2 generating high memory pressure with many partitions and concurrent 
> queries
> ---
>
> Key: HIVE-15882
> URL: https://issues.apache.org/jira/browse/HIVE-15882
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Misha Dmitriev
>Assignee: Misha Dmitriev
> Attachments: HIVE-15882.01.patch, hs2-crash-2000p-500m-50q.txt
>
>
> I've created a Hive table with 2000 partitions, each backed by two files, 
> with one row in each file. When I execute some number of concurrent queries 
> against this table, e.g. as follows
> {code}
> for i in `seq 1 50`; do beeline -u jdbc:hive2://localhost:1 -n admin -p 
> admin -e "select count(i_f_1) from misha_table;" & done
> {code}
> it results in a big memory spike. With 20 queries I caused an OOM in a HS2 
> server with -Xmx200m and with 50 queries - in the one with -Xmx500m.
> I am attaching the results of jxray (www.jxray.com) analysis of a heap dump 
> that was generated in the 50queries/500m heap scenario. It suggests that 
> there are several opportunities to reduce memory pressure with not very 
> invasive changes to the code:
> 1. 24.5% of memory is wasted by duplicate strings (see section 6). With 
> String.intern() calls added in the ~10 relevant places in the code, this 
> overhead can be highly reduced.
> 2. Almost 20% of memory is wasted due to various suboptimally used 
> collections (see section 8). There are many maps and lists that are either 
> empty or have just 1 element. By modifying the code that creates and 
> populates these collections, we may likely save 5-10% of memory.
> 3. Almost 20% of memory is used by instances of java.util.Properties. It 
> looks like these objects are highly duplicate, since for each Partition each 
> concurrently running query creates its own copy of Partion, PartitionDesc and 
> Properties. Thus we have nearly 100,000 (50 queries * 2,000 partitions) 
> Properties in memory. By interning/deduplicating these objects we may be able 
> to save perhaps 15% of memory.
> So overall, I think there is a good chance to reduce HS2 memory consumption 
> in this scenario by ~40%.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16018) Add more information for DynamicPartitionPruningOptimization

2017-02-23 Thread Pengcheng Xiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pengcheng Xiong updated HIVE-16018:
---
Attachment: qfile.q.out
qfile.q

> Add more information for DynamicPartitionPruningOptimization
> 
>
> Key: HIVE-16018
> URL: https://issues.apache.org/jira/browse/HIVE-16018
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-16018.01.patch, qfile.q, qfile.q.out
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16018) Add more information for DynamicPartitionPruningOptimization

2017-02-23 Thread Pengcheng Xiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pengcheng Xiong updated HIVE-16018:
---
Status: Patch Available  (was: Open)

> Add more information for DynamicPartitionPruningOptimization
> 
>
> Key: HIVE-16018
> URL: https://issues.apache.org/jira/browse/HIVE-16018
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-16018.01.patch, qfile.q, qfile.q.out
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16018) Add more information for DynamicPartitionPruningOptimization

2017-02-23 Thread Pengcheng Xiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pengcheng Xiong updated HIVE-16018:
---
Attachment: HIVE-16018.01.patch

> Add more information for DynamicPartitionPruningOptimization
> 
>
> Key: HIVE-16018
> URL: https://issues.apache.org/jira/browse/HIVE-16018
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-16018.01.patch, qfile.q, qfile.q.out
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16028) Fail UPDATE/DELETE/MERGE queries when Ranger authorization manager is used

2017-02-23 Thread Wei Zheng (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Zheng updated HIVE-16028:
-
Attachment: HIVE-16028.2.patch

patch 2 added a test

> Fail UPDATE/DELETE/MERGE queries when Ranger authorization manager is used
> --
>
> Key: HIVE-16028
> URL: https://issues.apache.org/jira/browse/HIVE-16028
> Project: Hive
>  Issue Type: Bug
>  Components: Authorization, Transactions
>Affects Versions: 2.2.0
>Reporter: Wei Zheng
>Assignee: Wei Zheng
> Attachments: HIVE-16028.1.patch, HIVE-16028.2.patch
>
>
> This is a followup of HIVE-15891. In that jira an error-out logic was added, 
> but the assumption that we need to do row filtering/column masking for 
> entries in a non-empty list of tables returned by 
> applyRowFilterAndColumnMasking is wrong, because on Ranger side, 
> RangerHiveAuthorizer#applyRowFilterAndColumnMasking will unconditionally 
> return a list of tables no matter whether row filtering/column masking is 
> applicable on the tables.
> The fix for Hive for now will be to move the error-out logic after we figure 
> out there's no replacement text for the query. But ideally we should consider 
> modifying Ranger logic to only return tables that need to be masked.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16028) Fail UPDATE/DELETE/MERGE queries when Ranger authorization manager is used

2017-02-23 Thread Wei Zheng (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Zheng updated HIVE-16028:
-
Component/s: Transactions

> Fail UPDATE/DELETE/MERGE queries when Ranger authorization manager is used
> --
>
> Key: HIVE-16028
> URL: https://issues.apache.org/jira/browse/HIVE-16028
> Project: Hive
>  Issue Type: Bug
>  Components: Authorization, Transactions
>Affects Versions: 2.2.0
>Reporter: Wei Zheng
>Assignee: Wei Zheng
> Attachments: HIVE-16028.1.patch
>
>
> This is a followup of HIVE-15891. In that jira an error-out logic was added, 
> but the assumption that we need to do row filtering/column masking for 
> entries in a non-empty list of tables returned by 
> applyRowFilterAndColumnMasking is wrong, because on Ranger side, 
> RangerHiveAuthorizer#applyRowFilterAndColumnMasking will unconditionally 
> return a list of tables no matter whether row filtering/column masking is 
> applicable on the tables.
> The fix for Hive for now will be to move the error-out logic after we figure 
> out there's no replacement text for the query. But ideally we should consider 
> modifying Ranger logic to only return tables that need to be masked.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15882) HS2 generating high memory pressure with many partitions and concurrent queries

2017-02-23 Thread Vihang Karajgaonkar (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881337#comment-15881337
 ] 

Vihang Karajgaonkar commented on HIVE-15882:


37% is a great improvement [~mi...@cloudera.com]. Have you already created 
another upstream JIRA for the other patch as well?

> HS2 generating high memory pressure with many partitions and concurrent 
> queries
> ---
>
> Key: HIVE-15882
> URL: https://issues.apache.org/jira/browse/HIVE-15882
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Misha Dmitriev
>Assignee: Misha Dmitriev
> Attachments: HIVE-15882.01.patch, hs2-crash-2000p-500m-50q.txt
>
>
> I've created a Hive table with 2000 partitions, each backed by two files, 
> with one row in each file. When I execute some number of concurrent queries 
> against this table, e.g. as follows
> {code}
> for i in `seq 1 50`; do beeline -u jdbc:hive2://localhost:1 -n admin -p 
> admin -e "select count(i_f_1) from misha_table;" & done
> {code}
> it results in a big memory spike. With 20 queries I caused an OOM in a HS2 
> server with -Xmx200m and with 50 queries - in the one with -Xmx500m.
> I am attaching the results of jxray (www.jxray.com) analysis of a heap dump 
> that was generated in the 50queries/500m heap scenario. It suggests that 
> there are several opportunities to reduce memory pressure with not very 
> invasive changes to the code:
> 1. 24.5% of memory is wasted by duplicate strings (see section 6). With 
> String.intern() calls added in the ~10 relevant places in the code, this 
> overhead can be highly reduced.
> 2. Almost 20% of memory is wasted due to various suboptimally used 
> collections (see section 8). There are many maps and lists that are either 
> empty or have just 1 element. By modifying the code that creates and 
> populates these collections, we may likely save 5-10% of memory.
> 3. Almost 20% of memory is used by instances of java.util.Properties. It 
> looks like these objects are highly duplicate, since for each Partition each 
> concurrently running query creates its own copy of Partion, PartitionDesc and 
> Properties. Thus we have nearly 100,000 (50 queries * 2,000 partitions) 
> Properties in memory. By interning/deduplicating these objects we may be able 
> to save perhaps 15% of memory.
> So overall, I think there is a good chance to reduce HS2 memory consumption 
> in this scenario by ~40%.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15944) The order of cols is error in ColumnPrunerReduceSinkProc because of sort operator

2017-02-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881338#comment-15881338
 ] 

Hive QA commented on HIVE-15944:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12854227/HIVE-15944.3.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 142 failed/errored test(s), 10252 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=235)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[columnstats_partlvl] 
(batchId=33)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[columnstats_partlvl_dp] 
(batchId=47)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[columnstats_tbllvl] 
(batchId=8)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[complex_alias] 
(batchId=16)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[constant_prop_3] 
(batchId=40)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[correlationoptimizer13] 
(batchId=10)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_udf] (batchId=8)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[display_colstats_tbllvl] 
(batchId=3)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[distinct_windowing_no_cbo]
 (batchId=60)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[druid_basic2] 
(batchId=10)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dynamic_rdd_cache] 
(batchId=50)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[except_all] (batchId=43)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby9] (batchId=6)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_join_pushdown] 
(batchId=73)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_position] 
(batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[having2] (batchId=15)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[index_auto_update] 
(batchId=68)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[limit_pushdown_negative] 
(batchId=37)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[multi_insert_gby3] 
(batchId=69)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[multigroupby_singlemr] 
(batchId=64)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[nested_column_pruning] 
(batchId=31)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_non_dictionary_encoding_vectorization]
 (batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_types_vectorization]
 (batchId=13)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_gby2] (batchId=79)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_gby] (batchId=35)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_windowing1] 
(batchId=41)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ptfgroupbyjoin] 
(batchId=78)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[reduce_deduplicate_extended2]
 (batchId=55)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_in_having] 
(batchId=53)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_notexists] 
(batchId=82)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_notexists_having]
 (batchId=77)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_notin_having] 
(batchId=46)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_unqualcolumnrefs]
 (batchId=17)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[temp_table_display_colstats_tbllvl]
 (batchId=71)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_decimal_aggregate]
 (batchId=17)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_decimal_round_2] 
(batchId=22)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_groupby_3] 
(batchId=60)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_groupby_reduce] 
(batchId=52)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_orderby_5] 
(batchId=38)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_13] 
(batchId=46)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_15] 
(batchId=59)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorization_limit] 
(batchId=34)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorized_parquet_types]
 (batchId=61)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[windowing_gby2] 
(batchId=33)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[dynamic_partition_pruning_2]
 (batchId=136)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[explainuser_2] 
(batchId=137)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[llap_stats] 
(batchId=135)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucket_groupby]
 (batchId=154)

[jira] [Commented] (HIVE-16014) HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of hive.mv.files.thread for pool size

2017-02-23 Thread Vihang Karajgaonkar (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881319#comment-15881319
 ] 

Vihang Karajgaonkar commented on HIVE-16014:


I discussed with Sahil offline. It may be possible to avoid one getStatus call 
by passing a flag if the directory is already created from the DDLTask. We can 
take that up in another JIRA since that is unrelated to this JIRA.

> HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of 
> hive.mv.files.thread for pool size
> --
>
> Key: HIVE-16014
> URL: https://issues.apache.org/jira/browse/HIVE-16014
> Project: Hive
>  Issue Type: Improvement
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
> Attachments: HIVE-16014.01.patch, HIVE-16014.02.patch
>
>
> HiveMetastoreChecker uses hive.mv.files.thread configuration value for 
> determining the pool size as below :
> {noformat}
> private void checkPartitionDirs(Path basePath, Set allDirs, int 
> maxDepth) throws IOException, HiveException {
> ConcurrentLinkedQueue basePaths = new ConcurrentLinkedQueue<>();
> basePaths.add(basePath);
> Set dirSet = Collections.newSetFromMap(new ConcurrentHashMap Boolean>());
> // Here we just reuse the THREAD_COUNT configuration for
> // HIVE_MOVE_FILES_THREAD_COUNT
> int poolSize = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 
> 15);
> // Check if too low config is provided for move files. 2x CPU is 
> reasonable max count.
> poolSize = poolSize == 0 ? poolSize : Math.max(poolSize,
> Runtime.getRuntime().availableProcessors() * 2);
> {noformat}
> msck is commonly used to add the missing partitions for the table from the 
> Filesystem. In such a case different pool sizes for HMSHandler and 
> HiveMetastoreChecker can affect the performance. Eg. If 
> {{hive.metastore.fshandler.threads}} is set to a lower value like 15 and 
> {{hive.mv.files.thread}} is much higher like 100 or vice versa the smaller 
> pool will become the bottleneck. If would be good to use 
> {{hive.metastore.fshandler.threads}} to size the pool for 
> HiveMetastoreChecker since the number missing partitions and number of 
> partitions to be added will most likely be the same. In such a case the 
> performance of the query will be optimum when both the pool sizes are same.
> Since it is possible to tune both the configs individually it will be very 
> likely that they may be different. But since there is a strong co-relation 
> between amount of work done by HiveMetastoreChecker and 
> HiveMetastore.add_partitions call it might be a good idea to use 
> {{hive.metastore.fshandler.threads}} for pool size instead of 
> {{hive.mv.files.thread}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-14731) Use Tez cartesian product edge in Hive (unpartitioned case only)

2017-02-23 Thread Zhiyuan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-14731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhiyuan Yang updated HIVE-14731:

Attachment: HIVE-14731.14.patch

Patch rebased. Now we're detecting both SIMPLE_EDGE and CUSTOM_SIMPLE_EDGE in 
CrossProductHandler. [~hagleitn]

> Use Tez cartesian product edge in Hive (unpartitioned case only)
> 
>
> Key: HIVE-14731
> URL: https://issues.apache.org/jira/browse/HIVE-14731
> Project: Hive
>  Issue Type: Bug
>Reporter: Zhiyuan Yang
>Assignee: Zhiyuan Yang
> Attachments: HIVE-14731.10.patch, HIVE-14731.11.patch, 
> HIVE-14731.12.patch, HIVE-14731.13.patch, HIVE-14731.14.patch, 
> HIVE-14731.1.patch, HIVE-14731.2.patch, HIVE-14731.3.patch, 
> HIVE-14731.4.patch, HIVE-14731.5.patch, HIVE-14731.6.patch, 
> HIVE-14731.7.patch, HIVE-14731.8.patch, HIVE-14731.9.patch
>
>
> Given cartesian product edge is available in Tez now (see TEZ-3230), let's 
> integrate it into Hive on Tez. This allows us to have more than one reducer 
> in cross product queries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15830) Allow additional view ACLs for tez jobs

2017-02-23 Thread Daniel Dai (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881272#comment-15881272
 ] 

Daniel Dai commented on HIVE-15830:
---

+1

> Allow additional view ACLs for tez jobs
> ---
>
> Key: HIVE-15830
> URL: https://issues.apache.org/jira/browse/HIVE-15830
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: HIVE-15830.01.patch, HIVE-15830.02.patch, 
> HIVE-15830.03.patch, HIVE-15830.05.patch, HIVE-15830.06.patch, 
> HIVE-15830.07.patch
>
>
> Allow users to grant view access to additional users when running tez jobs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15882) HS2 generating high memory pressure with many partitions and concurrent queries

2017-02-23 Thread Misha Dmitriev (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881244#comment-15881244
 ] 

Misha Dmitriev commented on HIVE-15882:
---

I've measured how much memory is saved with my change. It turned out that it's 
actually more difficult/time-consuming to obtain the "threshold" number of 
concurrent requests that my benchmark can sustain with the same small heap 
(500M). So I switched to a different approach. I set the heap size to a high 
number (3G), sufficient for my benchmark to pass without any GC issues with or 
without my changes. Then I ran it first without, then with my changes, 
measuring the live set of the heap size every 4 sec. That is, the size of live 
objects immediately after full GC. I've done it using the following script:

{code}
PID=$1
while [ true ] ; do
  # Force full GC 
  sudo -u hive jmap -histo:live $PID > /dev/null
  # Get the total amount of memory used
  sudo -u hive jstat -gc $PID | tail -n 1 | awk '{split($0,a," "); 
sum=a[3]+a[4]+a[6]+a[8]; print sum}'
  sleep 4
done
{code}

Then I checked the highest number printed by this script, i.e. the biggest live 
heap size when running my benchmark. I ended up with:

1173M - without my changes
743M - with my changes

That means that my changes (String interning plus interning Properties objects 
in PartitionDesc, which will be posted in a separate patch) collectively save 
37% of memory.

> HS2 generating high memory pressure with many partitions and concurrent 
> queries
> ---
>
> Key: HIVE-15882
> URL: https://issues.apache.org/jira/browse/HIVE-15882
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Misha Dmitriev
>Assignee: Misha Dmitriev
> Attachments: HIVE-15882.01.patch, hs2-crash-2000p-500m-50q.txt
>
>
> I've created a Hive table with 2000 partitions, each backed by two files, 
> with one row in each file. When I execute some number of concurrent queries 
> against this table, e.g. as follows
> {code}
> for i in `seq 1 50`; do beeline -u jdbc:hive2://localhost:1 -n admin -p 
> admin -e "select count(i_f_1) from misha_table;" & done
> {code}
> it results in a big memory spike. With 20 queries I caused an OOM in a HS2 
> server with -Xmx200m and with 50 queries - in the one with -Xmx500m.
> I am attaching the results of jxray (www.jxray.com) analysis of a heap dump 
> that was generated in the 50queries/500m heap scenario. It suggests that 
> there are several opportunities to reduce memory pressure with not very 
> invasive changes to the code:
> 1. 24.5% of memory is wasted by duplicate strings (see section 6). With 
> String.intern() calls added in the ~10 relevant places in the code, this 
> overhead can be highly reduced.
> 2. Almost 20% of memory is wasted due to various suboptimally used 
> collections (see section 8). There are many maps and lists that are either 
> empty or have just 1 element. By modifying the code that creates and 
> populates these collections, we may likely save 5-10% of memory.
> 3. Almost 20% of memory is used by instances of java.util.Properties. It 
> looks like these objects are highly duplicate, since for each Partition each 
> concurrently running query creates its own copy of Partion, PartitionDesc and 
> Properties. Thus we have nearly 100,000 (50 queries * 2,000 partitions) 
> Properties in memory. By interning/deduplicating these objects we may be able 
> to save perhaps 15% of memory.
> So overall, I think there is a good chance to reduce HS2 memory consumption 
> in this scenario by ~40%.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16007) When the query does not complie the LogRunnable never stops

2017-02-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881235#comment-15881235
 ] 

Hive QA commented on HIVE-16007:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12854210/HIVE-16007.2.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 10251 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=235)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr]
 (batchId=140)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=223)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=223)
org.apache.hadoop.hive.cli.TestSparkNegativeCliDriver.org.apache.hadoop.hive.cli.TestSparkNegativeCliDriver
 (batchId=230)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/3731/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/3731/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-3731/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 5 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12854210 - PreCommit-HIVE-Build

> When the query does not complie the LogRunnable never stops
> ---
>
> Key: HIVE-16007
> URL: https://issues.apache.org/jira/browse/HIVE-16007
> Project: Hive
>  Issue Type: Bug
>  Components: Beeline
>Affects Versions: 2.2.0
>Reporter: Peter Vary
>Assignee: Peter Vary
> Attachments: HIVE-16007.2.patch, HIVE-16007.patch
>
>
> When issuing a sql command which does not compile then the LogRunnable thread 
> is never closed.
> The issue can be easily detected when running beeline with showWarnings=true.
> {code}
> $ ./beeline -u "jdbc:hive2://localhost:1 pvary pvary" --showWarnings=true
> [..]
> Connecting to jdbc:hive2://localhost:1
> Connected to: Apache Hive (version 2.2.0-SNAPSHOT)
> Driver: Hive JDBC (version 2.2.0-SNAPSHOT)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 2.2.0-SNAPSHOT by Apache Hive
> 0: jdbc:hive2://localhost:1> selekt;
> Warning: java.sql.SQLException: Method getQueryLog() failed. Because the 
> stmtHandle in HiveStatement is null and the statement execution might fail. 
> (state=,code=0)
> [..]
> Warning: java.sql.SQLException: Can't getQueryLog after statement has been 
> closed (state=,code=0)
> [..]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15882) HS2 generating high memory pressure with many partitions and concurrent queries

2017-02-23 Thread Sahil Takiar (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881203#comment-15881203
 ] 

Sahil Takiar commented on HIVE-15882:
-

[~mi...@cloudera.com] you mentioned this earlier, but it would be good to see 
how many more concurrent requests can be supported before vs. after these 
changes, for a fixed size heap.

> HS2 generating high memory pressure with many partitions and concurrent 
> queries
> ---
>
> Key: HIVE-15882
> URL: https://issues.apache.org/jira/browse/HIVE-15882
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Misha Dmitriev
>Assignee: Misha Dmitriev
> Attachments: HIVE-15882.01.patch, hs2-crash-2000p-500m-50q.txt
>
>
> I've created a Hive table with 2000 partitions, each backed by two files, 
> with one row in each file. When I execute some number of concurrent queries 
> against this table, e.g. as follows
> {code}
> for i in `seq 1 50`; do beeline -u jdbc:hive2://localhost:1 -n admin -p 
> admin -e "select count(i_f_1) from misha_table;" & done
> {code}
> it results in a big memory spike. With 20 queries I caused an OOM in a HS2 
> server with -Xmx200m and with 50 queries - in the one with -Xmx500m.
> I am attaching the results of jxray (www.jxray.com) analysis of a heap dump 
> that was generated in the 50queries/500m heap scenario. It suggests that 
> there are several opportunities to reduce memory pressure with not very 
> invasive changes to the code:
> 1. 24.5% of memory is wasted by duplicate strings (see section 6). With 
> String.intern() calls added in the ~10 relevant places in the code, this 
> overhead can be highly reduced.
> 2. Almost 20% of memory is wasted due to various suboptimally used 
> collections (see section 8). There are many maps and lists that are either 
> empty or have just 1 element. By modifying the code that creates and 
> populates these collections, we may likely save 5-10% of memory.
> 3. Almost 20% of memory is used by instances of java.util.Properties. It 
> looks like these objects are highly duplicate, since for each Partition each 
> concurrently running query creates its own copy of Partion, PartitionDesc and 
> Properties. Thus we have nearly 100,000 (50 queries * 2,000 partitions) 
> Properties in memory. By interning/deduplicating these objects we may be able 
> to save perhaps 15% of memory.
> So overall, I think there is a good chance to reduce HS2 memory consumption 
> in this scenario by ~40%.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16014) HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of hive.mv.files.thread for pool size

2017-02-23 Thread Vihang Karajgaonkar (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881183#comment-15881183
 ] 

Vihang Karajgaonkar commented on HIVE-16014:


[~spena] Do you think we should rename the config to something more generic 
now? HiveMetastoreChecker lives in HS2. The config name sounds like it is 
specific to metastore.

> HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of 
> hive.mv.files.thread for pool size
> --
>
> Key: HIVE-16014
> URL: https://issues.apache.org/jira/browse/HIVE-16014
> Project: Hive
>  Issue Type: Improvement
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
> Attachments: HIVE-16014.01.patch, HIVE-16014.02.patch
>
>
> HiveMetastoreChecker uses hive.mv.files.thread configuration value for 
> determining the pool size as below :
> {noformat}
> private void checkPartitionDirs(Path basePath, Set allDirs, int 
> maxDepth) throws IOException, HiveException {
> ConcurrentLinkedQueue basePaths = new ConcurrentLinkedQueue<>();
> basePaths.add(basePath);
> Set dirSet = Collections.newSetFromMap(new ConcurrentHashMap Boolean>());
> // Here we just reuse the THREAD_COUNT configuration for
> // HIVE_MOVE_FILES_THREAD_COUNT
> int poolSize = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 
> 15);
> // Check if too low config is provided for move files. 2x CPU is 
> reasonable max count.
> poolSize = poolSize == 0 ? poolSize : Math.max(poolSize,
> Runtime.getRuntime().availableProcessors() * 2);
> {noformat}
> msck is commonly used to add the missing partitions for the table from the 
> Filesystem. In such a case different pool sizes for HMSHandler and 
> HiveMetastoreChecker can affect the performance. Eg. If 
> {{hive.metastore.fshandler.threads}} is set to a lower value like 15 and 
> {{hive.mv.files.thread}} is much higher like 100 or vice versa the smaller 
> pool will become the bottleneck. If would be good to use 
> {{hive.metastore.fshandler.threads}} to size the pool for 
> HiveMetastoreChecker since the number missing partitions and number of 
> partitions to be added will most likely be the same. In such a case the 
> performance of the query will be optimum when both the pool sizes are same.
> Since it is possible to tune both the configs individually it will be very 
> likely that they may be different. But since there is a strong co-relation 
> between amount of work done by HiveMetastoreChecker and 
> HiveMetastore.add_partitions call it might be a good idea to use 
> {{hive.metastore.fshandler.threads}} for pool size instead of 
> {{hive.mv.files.thread}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16014) HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of hive.mv.files.thread for pool size

2017-02-23 Thread Sahil Takiar (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881182#comment-15881182
 ] 

Sahil Takiar commented on HIVE-16014:
-

[~vihangk1] - the config METASTORE_FS_HANDLER_THREADS_COUNT control the size of 
a threadpool in the metastore, that threadpool calls two main methods: 
{{createLocationForAddedPartition}} and {{initializeAddedPartition}}

The method {{createLocationForAddedPartition}} checks if the input directory 
exists, which seems unnecessary since when running {{msck}} the partition 
folders should always exist.

The method {{initializeAddedPartition}} calls 
{{MetaStoreUtils.updatePartitionStatsFast}} which may or may not list the 
filestatus of the partition directory to collect basic statistics.

I believe the whole threadpool logic in the {{HiveMetaStore}} was added to 
address the overhead of the above two methods when running against S3.

> HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of 
> hive.mv.files.thread for pool size
> --
>
> Key: HIVE-16014
> URL: https://issues.apache.org/jira/browse/HIVE-16014
> Project: Hive
>  Issue Type: Improvement
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
> Attachments: HIVE-16014.01.patch, HIVE-16014.02.patch
>
>
> HiveMetastoreChecker uses hive.mv.files.thread configuration value for 
> determining the pool size as below :
> {noformat}
> private void checkPartitionDirs(Path basePath, Set allDirs, int 
> maxDepth) throws IOException, HiveException {
> ConcurrentLinkedQueue basePaths = new ConcurrentLinkedQueue<>();
> basePaths.add(basePath);
> Set dirSet = Collections.newSetFromMap(new ConcurrentHashMap Boolean>());
> // Here we just reuse the THREAD_COUNT configuration for
> // HIVE_MOVE_FILES_THREAD_COUNT
> int poolSize = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 
> 15);
> // Check if too low config is provided for move files. 2x CPU is 
> reasonable max count.
> poolSize = poolSize == 0 ? poolSize : Math.max(poolSize,
> Runtime.getRuntime().availableProcessors() * 2);
> {noformat}
> msck is commonly used to add the missing partitions for the table from the 
> Filesystem. In such a case different pool sizes for HMSHandler and 
> HiveMetastoreChecker can affect the performance. Eg. If 
> {{hive.metastore.fshandler.threads}} is set to a lower value like 15 and 
> {{hive.mv.files.thread}} is much higher like 100 or vice versa the smaller 
> pool will become the bottleneck. If would be good to use 
> {{hive.metastore.fshandler.threads}} to size the pool for 
> HiveMetastoreChecker since the number missing partitions and number of 
> partitions to be added will most likely be the same. In such a case the 
> performance of the query will be optimum when both the pool sizes are same.
> Since it is possible to tune both the configs individually it will be very 
> likely that they may be different. But since there is a strong co-relation 
> between amount of work done by HiveMetastoreChecker and 
> HiveMetastore.add_partitions call it might be a good idea to use 
> {{hive.metastore.fshandler.threads}} for pool size instead of 
> {{hive.mv.files.thread}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15958) LLAP: IPC connections are not being reused for umbilical protocol

2017-02-23 Thread Prasanth Jayachandran (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881180#comment-15881180
 ] 

Prasanth Jayachandran commented on HIVE-15958:
--

Just noticed another race between cleanup and kill task. Will fix it shortly.

> LLAP: IPC connections are not being reused for umbilical protocol
> -
>
> Key: HIVE-15958
> URL: https://issues.apache.org/jira/browse/HIVE-15958
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Rajesh Balamohan
>Assignee: Prasanth Jayachandran
> Attachments: HIVE-15958.1.patch, HIVE-15958.2.patch, 
> HIVE-15958.3.patch
>
>
> During concurrency testing, observed 1000s of ipc thread creations. Ideally, 
> the connections to same hosts should be reused.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16014) HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of hive.mv.files.thread for pool size

2017-02-23 Thread Vihang Karajgaonkar (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vihang Karajgaonkar updated HIVE-16014:
---
Attachment: HIVE-16014.02.patch

> HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of 
> hive.mv.files.thread for pool size
> --
>
> Key: HIVE-16014
> URL: https://issues.apache.org/jira/browse/HIVE-16014
> Project: Hive
>  Issue Type: Improvement
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
> Attachments: HIVE-16014.01.patch, HIVE-16014.02.patch
>
>
> HiveMetastoreChecker uses hive.mv.files.thread configuration value for 
> determining the pool size as below :
> {noformat}
> private void checkPartitionDirs(Path basePath, Set allDirs, int 
> maxDepth) throws IOException, HiveException {
> ConcurrentLinkedQueue basePaths = new ConcurrentLinkedQueue<>();
> basePaths.add(basePath);
> Set dirSet = Collections.newSetFromMap(new ConcurrentHashMap Boolean>());
> // Here we just reuse the THREAD_COUNT configuration for
> // HIVE_MOVE_FILES_THREAD_COUNT
> int poolSize = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 
> 15);
> // Check if too low config is provided for move files. 2x CPU is 
> reasonable max count.
> poolSize = poolSize == 0 ? poolSize : Math.max(poolSize,
> Runtime.getRuntime().availableProcessors() * 2);
> {noformat}
> msck is commonly used to add the missing partitions for the table from the 
> Filesystem. In such a case different pool sizes for HMSHandler and 
> HiveMetastoreChecker can affect the performance. Eg. If 
> {{hive.metastore.fshandler.threads}} is set to a lower value like 15 and 
> {{hive.mv.files.thread}} is much higher like 100 or vice versa the smaller 
> pool will become the bottleneck. If would be good to use 
> {{hive.metastore.fshandler.threads}} to size the pool for 
> HiveMetastoreChecker since the number missing partitions and number of 
> partitions to be added will most likely be the same. In such a case the 
> performance of the query will be optimum when both the pool sizes are same.
> Since it is possible to tune both the configs individually it will be very 
> likely that they may be different. But since there is a strong co-relation 
> between amount of work done by HiveMetastoreChecker and 
> HiveMetastore.add_partitions call it might be a good idea to use 
> {{hive.metastore.fshandler.threads}} for pool size instead of 
> {{hive.mv.files.thread}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16014) HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of hive.mv.files.thread for pool size

2017-02-23 Thread Vihang Karajgaonkar (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881178#comment-15881178
 ] 

Vihang Karajgaonkar commented on HIVE-16014:


Attached second version of the patch which adds more relevant comments

> HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of 
> hive.mv.files.thread for pool size
> --
>
> Key: HIVE-16014
> URL: https://issues.apache.org/jira/browse/HIVE-16014
> Project: Hive
>  Issue Type: Improvement
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
> Attachments: HIVE-16014.01.patch, HIVE-16014.02.patch
>
>
> HiveMetastoreChecker uses hive.mv.files.thread configuration value for 
> determining the pool size as below :
> {noformat}
> private void checkPartitionDirs(Path basePath, Set allDirs, int 
> maxDepth) throws IOException, HiveException {
> ConcurrentLinkedQueue basePaths = new ConcurrentLinkedQueue<>();
> basePaths.add(basePath);
> Set dirSet = Collections.newSetFromMap(new ConcurrentHashMap Boolean>());
> // Here we just reuse the THREAD_COUNT configuration for
> // HIVE_MOVE_FILES_THREAD_COUNT
> int poolSize = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 
> 15);
> // Check if too low config is provided for move files. 2x CPU is 
> reasonable max count.
> poolSize = poolSize == 0 ? poolSize : Math.max(poolSize,
> Runtime.getRuntime().availableProcessors() * 2);
> {noformat}
> msck is commonly used to add the missing partitions for the table from the 
> Filesystem. In such a case different pool sizes for HMSHandler and 
> HiveMetastoreChecker can affect the performance. Eg. If 
> {{hive.metastore.fshandler.threads}} is set to a lower value like 15 and 
> {{hive.mv.files.thread}} is much higher like 100 or vice versa the smaller 
> pool will become the bottleneck. If would be good to use 
> {{hive.metastore.fshandler.threads}} to size the pool for 
> HiveMetastoreChecker since the number missing partitions and number of 
> partitions to be added will most likely be the same. In such a case the 
> performance of the query will be optimum when both the pool sizes are same.
> Since it is possible to tune both the configs individually it will be very 
> likely that they may be different. But since there is a strong co-relation 
> between amount of work done by HiveMetastoreChecker and 
> HiveMetastore.add_partitions call it might be a good idea to use 
> {{hive.metastore.fshandler.threads}} for pool size instead of 
> {{hive.mv.files.thread}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-15879) Fix HiveMetaStoreChecker.checkPartitionDirs method

2017-02-23 Thread Vihang Karajgaonkar (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vihang Karajgaonkar updated HIVE-15879:
---
Status: Patch Available  (was: Open)

> Fix HiveMetaStoreChecker.checkPartitionDirs method
> --
>
> Key: HIVE-15879
> URL: https://issues.apache.org/jira/browse/HIVE-15879
> Project: Hive
>  Issue Type: Bug
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
> Attachments: HIVE-15879.01.patch
>
>
> HIVE-15803 fixes the msck hang issue in 
> HiveMetaStoreChecker.checkPartitionDirs method by adding a check to see if 
> the Threadpool has any spare threads. If not it uses single threaded listing 
> of the files.
> {noformat}
> if (pool != null) {
>   synchronized (pool) {
> // In case of recursive calls, it is possible to deadlock with TP. 
> Check TP usage here.
> if (pool.getActiveCount() < pool.getMaximumPoolSize()) {
>   useThreadPool = true;
> }
> if (!useThreadPool) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Not using threadPool as active count:" + 
> pool.getActiveCount()
> + ", max:" + pool.getMaximumPoolSize());
>   }
> }
>   }
> }
> {noformat}
> Based on the java doc of getActiveCount() below 
> bq. Returns the approximate number of threads that are actively executing 
> tasks.
> it returns only approximate number of threads and it cannot be guaranteed 
> that it always returns the exact number of active threads. This still exposes 
> the method implementation to the msck hang bug in rare corner cases.
> We could either:
> 1. Use a atomic counter to track exactly how many threads are actively running
> 2. Relook at the method itself to make it much simpler. Like eg, look into 
> the possibility of changing the recursive implementation to an iterative 
> implementation where worker threads pick tasks from a queue until the queue 
> is empty.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-15958) LLAP: IPC connections are not being reused for umbilical protocol

2017-02-23 Thread Prasanth Jayachandran (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth Jayachandran updated HIVE-15958:
-
Attachment: HIVE-15958.3.patch

Addressed review comments.
- Not stopping umbilical as some taskKilled events are sent at the end of query 
complete that would be using the umbilical
- Avoid iteration over known app masters

> LLAP: IPC connections are not being reused for umbilical protocol
> -
>
> Key: HIVE-15958
> URL: https://issues.apache.org/jira/browse/HIVE-15958
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Rajesh Balamohan
>Assignee: Prasanth Jayachandran
> Attachments: HIVE-15958.1.patch, HIVE-15958.2.patch, 
> HIVE-15958.3.patch
>
>
> During concurrency testing, observed 1000s of ipc thread creations. Ideally, 
> the connections to same hosts should be reused.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-15879) Fix HiveMetaStoreChecker.checkPartitionDirs method

2017-02-23 Thread Vihang Karajgaonkar (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vihang Karajgaonkar updated HIVE-15879:
---
Attachment: HIVE-15879.01.patch

Created a review board link as well

> Fix HiveMetaStoreChecker.checkPartitionDirs method
> --
>
> Key: HIVE-15879
> URL: https://issues.apache.org/jira/browse/HIVE-15879
> Project: Hive
>  Issue Type: Bug
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
> Attachments: HIVE-15879.01.patch
>
>
> HIVE-15803 fixes the msck hang issue in 
> HiveMetaStoreChecker.checkPartitionDirs method by adding a check to see if 
> the Threadpool has any spare threads. If not it uses single threaded listing 
> of the files.
> {noformat}
> if (pool != null) {
>   synchronized (pool) {
> // In case of recursive calls, it is possible to deadlock with TP. 
> Check TP usage here.
> if (pool.getActiveCount() < pool.getMaximumPoolSize()) {
>   useThreadPool = true;
> }
> if (!useThreadPool) {
>   if (LOG.isDebugEnabled()) {
> LOG.debug("Not using threadPool as active count:" + 
> pool.getActiveCount()
> + ", max:" + pool.getMaximumPoolSize());
>   }
> }
>   }
> }
> {noformat}
> Based on the java doc of getActiveCount() below 
> bq. Returns the approximate number of threads that are actively executing 
> tasks.
> it returns only approximate number of threads and it cannot be guaranteed 
> that it always returns the exact number of active threads. This still exposes 
> the method implementation to the msck hang bug in rare corner cases.
> We could either:
> 1. Use a atomic counter to track exactly how many threads are actively running
> 2. Relook at the method itself to make it much simpler. Like eg, look into 
> the possibility of changing the recursive implementation to an iterative 
> implementation where worker threads pick tasks from a queue until the queue 
> is empty.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16014) HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of hive.mv.files.thread for pool size

2017-02-23 Thread JIRA


[ 
https://issues.apache.org/jira/browse/HIVE-16014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881142#comment-15881142
 ] 

Sergio Peña commented on HIVE-16014:


I agree that the METASTORE_FS_HANDLER_THREADS_COUNT variable makes more sense 
to be in the metastore checker. I see is used to create partitions in parallel, 
so they are related.

I'm +1

One quick change:
- The comment HIVE_MOVE_FILES_THREAD_COUNT should be removed then.


> HiveMetastoreChecker should use hive.metastore.fshandler.threads instead of 
> hive.mv.files.thread for pool size
> --
>
> Key: HIVE-16014
> URL: https://issues.apache.org/jira/browse/HIVE-16014
> Project: Hive
>  Issue Type: Improvement
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
> Attachments: HIVE-16014.01.patch
>
>
> HiveMetastoreChecker uses hive.mv.files.thread configuration value for 
> determining the pool size as below :
> {noformat}
> private void checkPartitionDirs(Path basePath, Set allDirs, int 
> maxDepth) throws IOException, HiveException {
> ConcurrentLinkedQueue basePaths = new ConcurrentLinkedQueue<>();
> basePaths.add(basePath);
> Set dirSet = Collections.newSetFromMap(new ConcurrentHashMap Boolean>());
> // Here we just reuse the THREAD_COUNT configuration for
> // HIVE_MOVE_FILES_THREAD_COUNT
> int poolSize = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 
> 15);
> // Check if too low config is provided for move files. 2x CPU is 
> reasonable max count.
> poolSize = poolSize == 0 ? poolSize : Math.max(poolSize,
> Runtime.getRuntime().availableProcessors() * 2);
> {noformat}
> msck is commonly used to add the missing partitions for the table from the 
> Filesystem. In such a case different pool sizes for HMSHandler and 
> HiveMetastoreChecker can affect the performance. Eg. If 
> {{hive.metastore.fshandler.threads}} is set to a lower value like 15 and 
> {{hive.mv.files.thread}} is much higher like 100 or vice versa the smaller 
> pool will become the bottleneck. If would be good to use 
> {{hive.metastore.fshandler.threads}} to size the pool for 
> HiveMetastoreChecker since the number missing partitions and number of 
> partitions to be added will most likely be the same. In such a case the 
> performance of the query will be optimum when both the pool sizes are same.
> Since it is possible to tune both the configs individually it will be very 
> likely that they may be different. But since there is a strong co-relation 
> between amount of work done by HiveMetastoreChecker and 
> HiveMetastore.add_partitions call it might be a good idea to use 
> {{hive.metastore.fshandler.threads}} for pool size instead of 
> {{hive.mv.files.thread}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-15881) Use new thread count variable name instead of mapred.dfsclient.parallelism.max

2017-02-23 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/HIVE-15881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Peña updated HIVE-15881:
---
Attachment: HIVE-15881.5.patch

New patch that added a new unit tests & use SizeValidator on the HiveConf 
variable to limit the number of threads (min: 0, max: 1024)

> Use new thread count variable name instead of mapred.dfsclient.parallelism.max
> --
>
> Key: HIVE-15881
> URL: https://issues.apache.org/jira/browse/HIVE-15881
> Project: Hive
>  Issue Type: Task
>  Components: Query Planning
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>Priority: Minor
> Attachments: HIVE-15881.1.patch, HIVE-15881.2.patch, 
> HIVE-15881.3.patch, HIVE-15881.4.patch, HIVE-15881.5.patch
>
>
> The Utilities class has two methods, {{getInputSummary}} and 
> {{getInputPaths}}, that use the variable {{mapred.dfsclient.parallelism.max}} 
> to get the summary of a list of input locations in parallel. These methods 
> are Hive related, but the variable name does not look it is specific for Hive.
> Also, the above variable is not on HiveConf nor used anywhere else. I just 
> found a reference on the Hadoop MR1 code.
> I'd like to propose the deprecation of {{mapred.dfsclient.parallelism.max}}, 
> and use a different variable name, such as 
> {{hive.get.input.listing.num.threads}}, that reflects the intention of the 
> variable. The removal of the old variable might happen on Hive 3.x



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

1 2 >

1 - 100 of 175 matches

Mail list logo