[jira] [Commented] (HIVE-15290) Stripe size smaller than specified.

2017-01-18 Thread Prasanth Jayachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829464#comment-15829464
 ] 

Prasanth Jayachandran commented on HIVE-15290:
--

Please file this issue under ORC project as orc module will be removed from 
hive soon. 

> Stripe size smaller than specified.
> ---
>
> Key: HIVE-15290
> URL: https://issues.apache.org/jira/browse/HIVE-15290
> Project: Hive
>  Issue Type: Bug
>  Components: ORC
>Affects Versions: 1.2.0, 1.2.1, 2.0.0, 2.1.0, 2.0.1
>Reporter: Yuxing Yao
>
> In Hive-1.2.0, the real stripe size of output orc file will be very small if 
> most of table data are empty, result in too many Column Statistics objects 
> consumes most of the memory.
> I found it become better in Hive-2.0.1, but the stripe size still much 
> smaller than expected.
> I saw there's a Jira item: https://issues.apache.org/jira/browse/HIVE-13232 
> moved the compressed = null out of if block, this changes helps a lot, but 
> for completely fix this, another change is needed in 
> `OutStream.getBufferSize()`
> I've created the PR:
> https://github.com/apache/hive/pull/118
> Please take a look.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15586) Make Insert and Create statement Transactional

2017-01-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829452#comment-15829452
 ] 

Hive QA commented on HIVE-15586:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12848167/HIVE-15586.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 78 failed/errored test(s), 10960 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=234)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_joins] 
(batchId=218)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_predicate_pushdown]
 (batchId=218)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_single_sourced_multi_insert]
 (batchId=218)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[ctas] 
(batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_dynamic_partitions]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_table]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_directory]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_dynamic_partitions]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_table]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[write_final_output_blobstore]
 (batchId=230)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_table_stats] 
(batchId=48)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[analyze_tbl_part] 
(batchId=44)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input19] (batchId=79)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_overwrite_directory]
 (batchId=25)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[sample5] (batchId=52)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[serde_opencsv] 
(batchId=68)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[specialChar] (batchId=22)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorized_math_funcs] 
(batchId=19)
org.apache.hadoop.hive.cli.TestContribCliDriver.testCliDriver[dboutput] 
(batchId=221)
org.apache.hadoop.hive.cli.TestContribCliDriver.testCliDriver[fileformat_base64]
 (batchId=221)
org.apache.hadoop.hive.cli.TestContribCliDriver.testCliDriver[udf_row_sequence] 
(batchId=221)
org.apache.hadoop.hive.cli.TestContribCliDriver.testCliDriver[url_hook] 
(batchId=221)
org.apache.hadoop.hive.cli.TestContribNegativeCliDriver.testCliDriver[case_with_row_sequence]
 (batchId=224)
org.apache.hadoop.hive.cli.TestContribNegativeCliDriver.testCliDriver[invalid_row_sequence]
 (batchId=224)
org.apache.hadoop.hive.cli.TestContribNegativeCliDriver.testCliDriver[serde_regex]
 (batchId=224)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_insert_partition_dynamic]
 (batchId=158)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_insert_partition_static]
 (batchId=156)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_insert_values]
 (batchId=157)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_join_unencrypted_tbl]
 (batchId=159)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_join_with_different_encryption_keys]
 (batchId=159)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_load_data_to_encrypted_tables]
 (batchId=157)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_move_tbl]
 (batchId=157)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_select_read_only_encrypted_tbl]
 (batchId=159)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_select_read_only_unencrypted_tbl]
 (batchId=159)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_unencrypted_nonhdfs_external_tables]
 (batchId=157)
org.apache.hadoop.hive.cli.TestHBaseNegativeCliDriver.testCliDriver[cascade_dbdrop]
 (batchId=225)
org.apache.hadoop.hive.cli.TestHBaseNegativeCliDriver.testCliDriver[generatehfiles_require_family_path]
 (batchId=225)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_llap_counters]
 (batchId=137)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=135)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[escape1] 
(batchId=139)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[escape2] 
(batchId=154)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[schema_evol_text_vec_part]
 (batchId=149)
org.apache.hadoop.hive.cli.Test

[jira] [Commented] (HIVE-15147) LLAP: use LLAP cache for non-columnar formats in a somewhat general way

2017-01-18 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829438#comment-15829438
 ] 

Lefty Leverenz commented on HIVE-15147:
---

Doc note:  Four new configuration parameters need to be documented in the wiki.

#  *hive.llap.io.encode.formats*
#  *hive.llap.io.encode.alloc.size*
#  *hive.llap.io.encode.slice.row.count*
#  *hive.llap.io.encode.slice.lrr*

* [Configuration Properties -- LLAP | 
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-LLAP]

Added a TODOC2.2 label. 

> LLAP: use LLAP cache for non-columnar formats in a somewhat general way
> ---
>
> Key: HIVE-15147
> URL: https://issues.apache.org/jira/browse/HIVE-15147
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>  Labels: TODOC2.2
> Attachments: HIVE-15147.01.patch, HIVE-15147.patch, 
> HIVE-15147.WIP.noout.patch, perf-top-cache.png, pre-cache.svg, 
> writerimpl-addrow.png
>
>
> The primary goal for the first pass is caching text files. Nothing would 
> prevent other formats from using the same path, in principle, although, as 
> was originally done with ORC, it may be better to have native caching support 
> optimized for each particular format.
> Given that caching pure text is not smart, and we already have ORC-encoded 
> cache that is columnar due to ORC file structure, we will transform data into 
> columnar ORC.
> The general idea is to treat all the data in the world as merely ORC that was 
> compressed with some poor compression codec, such as csv. Using the original 
> IF and serde, as well as an ORC writer (with some heavyweight optimizations 
> disabled, potentially), we can "uncompress" the csv/whatever data into its 
> "original" ORC representation, then cache it efficiently, by column, and also 
> reuse a lot of the existing code.
> Various other points:
> 1) Caching granularity will have to be somehow determined (i.e. how do we 
> slice the file horizontally, to avoid caching entire columns). As with ORC 
> uncompressed files, the specific offsets don't really matter as long as they 
> are consistent between reads. The problem is that the file offsets will 
> actually need to be propagated to the new reader from the original 
> inputformat. Row counts are easier to use but there's a problem of how to 
> actually map them to missing ranges to read from disk.
> 2) Obviously, for row-based formats, if any one column that is to be read has 
> been evicted or is otherwise missing, "all the columns" have to be read for 
> the corresponding slice to cache and read that one column. The vague plan is 
> to handle this implicitly, similarly to how ORC reader handles CB-RG overlaps 
> - it will just so happen that a missing column in disk range list to retrieve 
> will expand the disk-range-to-read into the whole horizontal slice of the 
> file.
> 3) Granularity/etc. won't work for gzipped text. If anything at all is 
> evicted, the entire file has to be re-read. Gzipped text is a ridiculous 
> feature, so this is by design.
> 4) In future, it would be possible to also build some form or 
> metadata/indexes for this cached data to do PPD, etc. This is out of the 
> scope for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14827) Micro benchmark for Parquet vectorized reader

2017-01-18 Thread Ferdinand Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829434#comment-15829434
 ] 

Ferdinand Xu commented on HIVE-14827:
-

LGTM +1


> Micro benchmark for Parquet vectorized reader
> -
>
> Key: HIVE-14827
> URL: https://issues.apache.org/jira/browse/HIVE-14827
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Ferdinand Xu
>Assignee: Colin Ma
> Attachments: HIVE-14827.001.patch, HIVE-14827.002.patch
>
>
> We need a microbenchmark to evaluate the throughput and execution time for 
> Parquet vectorized reader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15390) Orc reader unnecessarily reading stripe footers with hive.optimize.index.filter set to true

2017-01-18 Thread Prasanth Jayachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829436#comment-15829436
 ] 

Prasanth Jayachandran commented on HIVE-15390:
--

+1

> Orc reader unnecessarily reading stripe footers with 
> hive.optimize.index.filter set to true
> ---
>
> Key: HIVE-15390
> URL: https://issues.apache.org/jira/browse/HIVE-15390
> Project: Hive
>  Issue Type: Bug
>  Components: ORC
>Affects Versions: 1.2.1
>Reporter: Abhishek Somani
>Assignee: Abhishek Somani
> Attachments: HIVE-15390.1.patch, HIVE-15390.patch
>
>
> In a split given to a task, the task's orc reader is unnecessarily reading 
> stripe footers for stripes that are not its responsibility to read. This is 
> happening with hive.optimize.index.filter set to true.
> Assuming one split per task(no tez grouping considered), a task should not 
> need to read beyond the split's end offset. Even in some split computation 
> strategies where a split's end offset can be in the middle of a stripe, it 
> should not need to read more than one stripe beyond the split's end offset(to 
> fully read a stripe that started in it). However I see that some tasks make 
> unnecessary filesystem calls to read all the stripe footers in a file from 
> the split start offset till the end of the file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15147) LLAP: use LLAP cache for non-columnar formats in a somewhat general way

2017-01-18 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829433#comment-15829433
 ] 

Lefty Leverenz commented on HIVE-15147:
---

Nudge:  Please update the fix version.

> LLAP: use LLAP cache for non-columnar formats in a somewhat general way
> ---
>
> Key: HIVE-15147
> URL: https://issues.apache.org/jira/browse/HIVE-15147
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>  Labels: TODOC2.2
> Attachments: HIVE-15147.01.patch, HIVE-15147.patch, 
> HIVE-15147.WIP.noout.patch, perf-top-cache.png, pre-cache.svg, 
> writerimpl-addrow.png
>
>
> The primary goal for the first pass is caching text files. Nothing would 
> prevent other formats from using the same path, in principle, although, as 
> was originally done with ORC, it may be better to have native caching support 
> optimized for each particular format.
> Given that caching pure text is not smart, and we already have ORC-encoded 
> cache that is columnar due to ORC file structure, we will transform data into 
> columnar ORC.
> The general idea is to treat all the data in the world as merely ORC that was 
> compressed with some poor compression codec, such as csv. Using the original 
> IF and serde, as well as an ORC writer (with some heavyweight optimizations 
> disabled, potentially), we can "uncompress" the csv/whatever data into its 
> "original" ORC representation, then cache it efficiently, by column, and also 
> reuse a lot of the existing code.
> Various other points:
> 1) Caching granularity will have to be somehow determined (i.e. how do we 
> slice the file horizontally, to avoid caching entire columns). As with ORC 
> uncompressed files, the specific offsets don't really matter as long as they 
> are consistent between reads. The problem is that the file offsets will 
> actually need to be propagated to the new reader from the original 
> inputformat. Row counts are easier to use but there's a problem of how to 
> actually map them to missing ranges to read from disk.
> 2) Obviously, for row-based formats, if any one column that is to be read has 
> been evicted or is otherwise missing, "all the columns" have to be read for 
> the corresponding slice to cache and read that one column. The vague plan is 
> to handle this implicitly, similarly to how ORC reader handles CB-RG overlaps 
> - it will just so happen that a missing column in disk range list to retrieve 
> will expand the disk-range-to-read into the whole horizontal slice of the 
> file.
> 3) Granularity/etc. won't work for gzipped text. If anything at all is 
> evicted, the entire file has to be re-read. Gzipped text is a ridiculous 
> feature, so this is by design.
> 4) In future, it would be possible to also build some form or 
> metadata/indexes for this cached data to do PPD, etc. This is out of the 
> scope for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15147) LLAP: use LLAP cache for non-columnar formats in a somewhat general way

2017-01-18 Thread Lefty Leverenz (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lefty Leverenz updated HIVE-15147:
--
Labels: TODOC2.2  (was: )

> LLAP: use LLAP cache for non-columnar formats in a somewhat general way
> ---
>
> Key: HIVE-15147
> URL: https://issues.apache.org/jira/browse/HIVE-15147
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>  Labels: TODOC2.2
> Attachments: HIVE-15147.01.patch, HIVE-15147.patch, 
> HIVE-15147.WIP.noout.patch, perf-top-cache.png, pre-cache.svg, 
> writerimpl-addrow.png
>
>
> The primary goal for the first pass is caching text files. Nothing would 
> prevent other formats from using the same path, in principle, although, as 
> was originally done with ORC, it may be better to have native caching support 
> optimized for each particular format.
> Given that caching pure text is not smart, and we already have ORC-encoded 
> cache that is columnar due to ORC file structure, we will transform data into 
> columnar ORC.
> The general idea is to treat all the data in the world as merely ORC that was 
> compressed with some poor compression codec, such as csv. Using the original 
> IF and serde, as well as an ORC writer (with some heavyweight optimizations 
> disabled, potentially), we can "uncompress" the csv/whatever data into its 
> "original" ORC representation, then cache it efficiently, by column, and also 
> reuse a lot of the existing code.
> Various other points:
> 1) Caching granularity will have to be somehow determined (i.e. how do we 
> slice the file horizontally, to avoid caching entire columns). As with ORC 
> uncompressed files, the specific offsets don't really matter as long as they 
> are consistent between reads. The problem is that the file offsets will 
> actually need to be propagated to the new reader from the original 
> inputformat. Row counts are easier to use but there's a problem of how to 
> actually map them to missing ranges to read from disk.
> 2) Obviously, for row-based formats, if any one column that is to be read has 
> been evicted or is otherwise missing, "all the columns" have to be read for 
> the corresponding slice to cache and read that one column. The vague plan is 
> to handle this implicitly, similarly to how ORC reader handles CB-RG overlaps 
> - it will just so happen that a missing column in disk range list to retrieve 
> will expand the disk-range-to-read into the whole horizontal slice of the 
> file.
> 3) Granularity/etc. won't work for gzipped text. If anything at all is 
> evicted, the entire file has to be re-read. Gzipped text is a ridiculous 
> feature, so this is by design.
> 4) In future, it would be possible to also build some form or 
> metadata/indexes for this cached data to do PPD, etc. This is out of the 
> scope for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13014) RetryingMetaStoreClient is retrying too aggresievley

2017-01-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829393#comment-15829393
 ] 

Hive QA commented on HIVE-13014:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12848115/HIVE-13014.06.patch

{color:green}SUCCESS:{color} +1 due to 3 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 81 failed/errored test(s), 10963 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=234)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_joins] 
(batchId=218)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_predicate_pushdown]
 (batchId=218)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_single_sourced_multi_insert]
 (batchId=218)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[ctas] 
(batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_dynamic_partitions]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_table]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_directory]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_dynamic_partitions]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_table]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[write_final_output_blobstore]
 (batchId=230)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_table_stats] 
(batchId=48)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[analyze_tbl_part] 
(batchId=44)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input19] (batchId=79)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_overwrite_directory]
 (batchId=25)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[sample5] (batchId=52)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[serde_opencsv] 
(batchId=68)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[specialChar] (batchId=22)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorized_math_funcs] 
(batchId=19)
org.apache.hadoop.hive.cli.TestContribCliDriver.testCliDriver[dboutput] 
(batchId=221)
org.apache.hadoop.hive.cli.TestContribCliDriver.testCliDriver[fileformat_base64]
 (batchId=221)
org.apache.hadoop.hive.cli.TestContribCliDriver.testCliDriver[udf_row_sequence] 
(batchId=221)
org.apache.hadoop.hive.cli.TestContribCliDriver.testCliDriver[url_hook] 
(batchId=221)
org.apache.hadoop.hive.cli.TestContribNegativeCliDriver.testCliDriver[case_with_row_sequence]
 (batchId=224)
org.apache.hadoop.hive.cli.TestContribNegativeCliDriver.testCliDriver[invalid_row_sequence]
 (batchId=224)
org.apache.hadoop.hive.cli.TestContribNegativeCliDriver.testCliDriver[serde_regex]
 (batchId=224)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_insert_partition_dynamic]
 (batchId=158)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_insert_partition_static]
 (batchId=156)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_insert_values]
 (batchId=157)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_join_unencrypted_tbl]
 (batchId=159)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_join_with_different_encryption_keys]
 (batchId=159)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_load_data_to_encrypted_tables]
 (batchId=157)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_move_tbl]
 (batchId=157)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_select_read_only_encrypted_tbl]
 (batchId=159)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_select_read_only_unencrypted_tbl]
 (batchId=159)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_unencrypted_nonhdfs_external_tables]
 (batchId=157)
org.apache.hadoop.hive.cli.TestHBaseNegativeCliDriver.testCliDriver[cascade_dbdrop]
 (batchId=225)
org.apache.hadoop.hive.cli.TestHBaseNegativeCliDriver.testCliDriver[generatehfiles_require_family_path]
 (batchId=225)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_llap_counters]
 (batchId=137)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=135)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a]
 (batchId=136)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[columnstats_part_coltype]
 (batchId=151)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[escape1] 
(batchId=139)
org.apache.hadoop.h

[jira] [Commented] (HIVE-15656) Place powermock in correct dependency management section root pom.xml

2017-01-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829386#comment-15829386
 ] 

ASF GitHub Bot commented on HIVE-15656:
---

Github user anishek closed the pull request at:

https://github.com/apache/hive/pull/134


> Place powermock in correct dependency management section root pom.xml
> -
>
> Key: HIVE-15656
> URL: https://issues.apache.org/jira/browse/HIVE-15656
> Project: Hive
>  Issue Type: Bug
>  Components: Build Infrastructure
>Affects Versions: 2.2.0
>Reporter: anishek
>Assignee: anishek
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: HIVE-15656.patch
>
>
> As part of committing  HIVE-15550 to master,  powermock was included in the 
> root pom.xml. This should not be the case. This lead to build failures fixed 
> in HIVE-15648.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15580) Replace Spark's groupByKey operator with something with bounded memory

2017-01-18 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-15580:
---
Attachment: HIVE-15580.4.patch

> Replace Spark's groupByKey operator with something with bounded memory
> --
>
> Key: HIVE-15580
> URL: https://issues.apache.org/jira/browse/HIVE-15580
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, 
> HIVE-15580.2.patch, HIVE-15580.2.patch, HIVE-15580.3.patch, 
> HIVE-15580.4.patch, HIVE-15580.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15297) Hive should not split semicolon within quoted string literals

2017-01-18 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829364#comment-15829364
 ] 

Xuefu Zhang commented on HIVE-15297:


I noticed that a couple of trailing space/tabs are introduced in the patch. 
Could we get it removed, maybe via a separate JIRA? Thanks.

> Hive should not split semicolon within quoted string literals
> -
>
> Key: HIVE-15297
> URL: https://issues.apache.org/jira/browse/HIVE-15297
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Fix For: 2.2.0
>
> Attachments: HIVE-15297.01.patch, HIVE-15297.02.patch, 
> HIVE-15297.03.patch, HIVE-15297.04.patch, HIVE-15297.05.patch
>
>
> String literals in query cannot have reserved symbols. The same set of query 
> works fine in mysql and postgresql. 
> {code}
> hive> CREATE TABLE ts(s varchar(550));
> OK
> Time taken: 0.075 seconds
> hive> INSERT INTO ts VALUES ('Mozilla/5.0 (iPhone; CPU iPhone OS 5_0');
> MismatchedTokenException(14!=326)
>   at 
> org.antlr.runtime.BaseRecognizer.recoverFromMismatchedToken(BaseRecognizer.java:617)
>   at org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:115)
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.valueRowConstructor(HiveParser_FromClauseParser.java:7271)
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.valuesTableConstructor(HiveParser_FromClauseParser.java:7370)
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.valuesClause(HiveParser_FromClauseParser.java:7510)
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser.valuesClause(HiveParser.java:51854)
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:45432)
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:44578)
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:8)
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1694)
>   at 
> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1176)
>   at 
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:204)
>   at 
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:402)
>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:326)
>   at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1169)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1288)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1095)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1083)
>   at 
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:232)
>   at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
>   at 
> org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
>   at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
>   at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> FAILED: ParseException line 1:31 mismatched input '/' expecting ) near 
> 'Mozilla' in value row constructor
> hive>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15580) Replace Spark's groupByKey operator with something with bounded memory

2017-01-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829346#comment-15829346
 ] 

Hive QA commented on HIVE-15580:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12848103/HIVE-15580.3.patch

{color:red}ERROR:{color} -1 due to build exiting with an error

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/3028/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/3028/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-3028/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Tests exited with: NonZeroExitCodeException
Command 'bash /data/hiveptest/working/scratch/source-prep.sh' failed with exit 
status 1 and output '+ date '+%Y-%m-%d %T.%3N'
2017-01-19 05:35:19.448
+ [[ -n /usr/lib/jvm/java-8-openjdk-amd64 ]]
+ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ export 
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ 
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m '
+ ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m '
+ export 'MAVEN_OPTS=-Xmx1g '
+ MAVEN_OPTS='-Xmx1g '
+ cd /data/hiveptest/working/
+ tee /data/hiveptest/logs/PreCommit-HIVE-Build-3028/source-prep.txt
+ [[ false == \t\r\u\e ]]
+ mkdir -p maven ivy
+ [[ git = \s\v\n ]]
+ [[ git = \g\i\t ]]
+ [[ -z master ]]
+ [[ -d apache-github-source-source ]]
+ [[ ! -d apache-github-source-source/.git ]]
+ [[ ! -d apache-github-source-source ]]
+ date '+%Y-%m-%d %T.%3N'
2017-01-19 05:35:19.451
+ cd apache-github-source-source
+ git fetch origin
+ git reset --hard HEAD
HEAD is now at ef33237  IVE-15297: Hive should not split semicolon within 
quoted string literals (Pengcheng Xiong, reviewed by Ashutosh Chauhan) 
(addendum I)
+ git clean -f -d
+ git checkout master
Already on 'master'
Your branch is up-to-date with 'origin/master'.
+ git reset --hard origin/master
HEAD is now at ef33237  IVE-15297: Hive should not split semicolon within 
quoted string literals (Pengcheng Xiong, reviewed by Ashutosh Chauhan) 
(addendum I)
+ git merge --ff-only origin/master
Already up-to-date.
+ date '+%Y-%m-%d %T.%3N'
2017-01-19 05:35:20.803
+ patchCommandPath=/data/hiveptest/working/scratch/smart-apply-patch.sh
+ patchFilePath=/data/hiveptest/working/scratch/build.patch
+ [[ -f /data/hiveptest/working/scratch/build.patch ]]
+ chmod +x /data/hiveptest/working/scratch/smart-apply-patch.sh
+ /data/hiveptest/working/scratch/smart-apply-patch.sh 
/data/hiveptest/working/scratch/build.patch
error: patch failed: 
ql/src/test/results/clientpositive/spark/union_top_level.q.out:324
error: ql/src/test/results/clientpositive/spark/union_top_level.q.out: patch 
does not apply
The patch does not appear to apply with p0, p1, or p2
+ exit 1
'
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12848103 - PreCommit-HIVE-Build

> Replace Spark's groupByKey operator with something with bounded memory
> --
>
> Key: HIVE-15580
> URL: https://issues.apache.org/jira/browse/HIVE-15580
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, 
> HIVE-15580.2.patch, HIVE-15580.2.patch, HIVE-15580.3.patch, HIVE-15580.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15572) Improve the response time for query canceling when it happens during acquiring locks

2017-01-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829344#comment-15829344
 ] 

Hive QA commented on HIVE-15572:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12848224/HIVE-15572.2.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 79 failed/errored test(s), 10960 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=234)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_joins] 
(batchId=218)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_predicate_pushdown]
 (batchId=218)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_single_sourced_multi_insert]
 (batchId=218)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[ctas] 
(batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_dynamic_partitions]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_table]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_directory]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_dynamic_partitions]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_table]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[write_final_output_blobstore]
 (batchId=230)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_table_stats] 
(batchId=48)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[analyze_tbl_part] 
(batchId=44)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input19] (batchId=79)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_overwrite_directory]
 (batchId=25)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[sample5] (batchId=52)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[serde_opencsv] 
(batchId=68)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[specialChar] (batchId=22)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vectorized_math_funcs] 
(batchId=19)
org.apache.hadoop.hive.cli.TestContribCliDriver.testCliDriver[dboutput] 
(batchId=221)
org.apache.hadoop.hive.cli.TestContribCliDriver.testCliDriver[fileformat_base64]
 (batchId=221)
org.apache.hadoop.hive.cli.TestContribCliDriver.testCliDriver[udf_row_sequence] 
(batchId=221)
org.apache.hadoop.hive.cli.TestContribCliDriver.testCliDriver[url_hook] 
(batchId=221)
org.apache.hadoop.hive.cli.TestContribNegativeCliDriver.testCliDriver[case_with_row_sequence]
 (batchId=224)
org.apache.hadoop.hive.cli.TestContribNegativeCliDriver.testCliDriver[invalid_row_sequence]
 (batchId=224)
org.apache.hadoop.hive.cli.TestContribNegativeCliDriver.testCliDriver[serde_regex]
 (batchId=224)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_insert_partition_dynamic]
 (batchId=158)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_insert_partition_static]
 (batchId=156)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_insert_values]
 (batchId=157)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_join_unencrypted_tbl]
 (batchId=159)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_join_with_different_encryption_keys]
 (batchId=159)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_load_data_to_encrypted_tables]
 (batchId=157)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_move_tbl]
 (batchId=157)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_select_read_only_encrypted_tbl]
 (batchId=159)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_select_read_only_unencrypted_tbl]
 (batchId=159)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_unencrypted_nonhdfs_external_tables]
 (batchId=157)
org.apache.hadoop.hive.cli.TestHBaseNegativeCliDriver.testCliDriver[cascade_dbdrop]
 (batchId=225)
org.apache.hadoop.hive.cli.TestHBaseNegativeCliDriver.testCliDriver[generatehfiles_require_family_path]
 (batchId=225)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_llap_counters]
 (batchId=137)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=135)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_schema_evol_3a]
 (batchId=136)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[escape1] 
(batchId=139)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[escape2] 
(batchId=154)
org.apache.hadoop.hive.cli.TestMiniLl

[jira] [Updated] (HIVE-14827) Micro benchmark for Parquet vectorized reader

2017-01-18 Thread Colin Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Ma updated HIVE-14827:

Attachment: HIVE-14827.002.patch

[~Ferd], thanks for your review, the patch is updated to fix the problems in 
your comments.

> Micro benchmark for Parquet vectorized reader
> -
>
> Key: HIVE-14827
> URL: https://issues.apache.org/jira/browse/HIVE-14827
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Ferdinand Xu
>Assignee: Colin Ma
> Attachments: HIVE-14827.001.patch, HIVE-14827.002.patch
>
>
> We need a microbenchmark to evaluate the throughput and execution time for 
> Parquet vectorized reader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15439) Support INSERT OVERWRITE for internal druid datasources.

2017-01-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829307#comment-15829307
 ] 

Hive QA commented on HIVE-15439:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12848075/HIVE-15439.patch

{color:green}SUCCESS:{color} +1 due to 3 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 351 failed/errored test(s), 10959 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=233)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_joins] 
(batchId=217)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_predicate_pushdown]
 (batchId=217)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_queries]
 (batchId=217)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_single_sourced_multi_insert]
 (batchId=217)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[ctas] 
(batchId=229)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_dynamic_partitions]
 (batchId=229)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_table]
 (batchId=229)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_directory]
 (batchId=229)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_dynamic_partitions]
 (batchId=229)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_table]
 (batchId=229)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[write_final_output_blobstore]
 (batchId=229)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_subquery] 
(batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_table_stats] 
(batchId=48)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[analyze_tbl_part] 
(batchId=44)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[annotate_stats_join_pkfk]
 (batchId=13)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[avrocountemptytbl] 
(batchId=74)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_rp_udf_percentile2] 
(batchId=18)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_rp_udf_percentile] 
(batchId=39)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[create_or_replace_view] 
(batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_udf] (batchId=8)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input19] (batchId=79)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_overwrite_directory]
 (batchId=25)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[join46] (batchId=1)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[join_emit_interval] 
(batchId=10)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[mapjoin46] (batchId=53)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[multi_insert_gby4] 
(batchId=43)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[nested_column_pruning] 
(batchId=31)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[sample5] (batchId=52)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[serde_opencsv] 
(batchId=68)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[smb_mapjoin_46] 
(batchId=38)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[specialChar] (batchId=22)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_exists] 
(batchId=38)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_notexists] 
(batchId=81)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_notin_having] 
(batchId=45)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_unqualcolumnrefs]
 (batchId=17)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_array_contains] 
(batchId=12)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_conv] (batchId=21)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_add] 
(batchId=44)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_sub] (batchId=2)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_hex] (batchId=22)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_java_method] 
(batchId=63)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_map_keys] 
(batchId=62)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_map_values] 
(batchId=46)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_months_between] 
(batchId=48)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_negative] (batchId=1)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_not] (batchId=51)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_percentile] 
(batchId=59)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_positive] 
(batchId=39)
org.apache.hadoop.hive.cli.TestCliDriver.te

[jira] [Commented] (HIVE-15565) LLAP: GroupByOperator flushes hash table too frequently

2017-01-18 Thread Prasanth Jayachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829282#comment-15829282
 ] 

Prasanth Jayachandran commented on HIVE-15565:
--

This delays the memory check hoping memory will be freed up in the meantime. 
Although freeing up of memory is not guaranteed and may not happen at all 
because of on-heap metadata cache and when other executors are performing 
allocations. 

LGTM, +1. Pending tests.

> LLAP: GroupByOperator flushes hash table too frequently
> ---
>
> Key: HIVE-15565
> URL: https://issues.apache.org/jira/browse/HIVE-15565
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: HIVE-15565.1.patch, HIVE-15565.2.patch
>
>
> {{GroupByOperator::isTez}} would be true in LLAP mode. Current memory 
> computations can go wrong with {{isTez}} checks in {{GroupByOperator}}. For 
> e.g, in a LLAP instance with Xmx128G and 12 executors, it would start 
> flushing hash table for every record once it reaches around 42GB 
> (hive.tez.container.size=7100, hive.map.aggr.hash.percentmemory=0.5).
> {noformat}
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_04_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size 
> = 0
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_12_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size 
> = 0
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_04_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 
> 1
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_12_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 
> 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HIVE-15664) LLAP text cache: improve first query perf

2017-01-18 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829272#comment-15829272
 ] 

Sergey Shelukhin edited comment on HIVE-15664 at 1/19/17 3:59 AM:
--

This implements 1-2, as well as the config flag.
Skipping is only supported on VectorDeserialize; I started looking at it, 
should be easy to do after clearing the initial confusiong - VD doesn't support 
complex types anyway, so should be easy to map new ORC cols to original column 
indexes. 
We don't expect that to result in major gain though (compared to 1-2-4), so I 
postponed it for now.
Unfortunately 1 and 2 don't speed it up enough... need to do 4 - return VRBs 
from VectorDeserialize, and offload ORC writing to a background thread, I was 
looking into that today. Need to wrap my head around variety of array indexes 
and integer lists that various parts use. Also interface-wise it would be 
difficult. Will probably piggyback on Orc...Batch


was (Author: sershe):
This implements 1-2, as well as ORC dictionary.
Skipping is only supported on VectorDeserialize; I started looking at it, 
should be easy to do after clearing the initial confusiong - VD doesn't support 
complex types anyway, so should be easy to map new ORC cols to original column 
indexes. 
We don't expect that to result in major gain though (compared to 1-2-4), so I 
postponed it for now.
Unfortunately 1 and 2 don't speed it up enough... need to do 4 - return VRBs 
from VectorDeserialize, and offload ORC writing to a background thread, I was 
looking into that today. Need to wrap my head around variety of array indexes 
and integer lists that various parts use. Also interface-wise it would be 
difficult. Will probably piggyback on Orc...Batch

> LLAP text cache: improve first query perf
> -
>
> Key: HIVE-15664
> URL: https://issues.apache.org/jira/browse/HIVE-15664
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-15664.WIP.patch
>
>
> 1) Don't use ORC dictionary.
> 2) Use VectorDeserialize.
> 3) Don't parse the columns that are not included (cannot avoid reading them).
> 4) Send VRB to the pipeline and write ORC in parallel (in background).
> Also add an option to disable the encoding pipeline server-side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HIVE-15664) LLAP text cache: improve first query perf

2017-01-18 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin reassigned HIVE-15664:
---

Assignee: Sergey Shelukhin

> LLAP text cache: improve first query perf
> -
>
> Key: HIVE-15664
> URL: https://issues.apache.org/jira/browse/HIVE-15664
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-15664.WIP.patch
>
>
> 1) Don't use ORC dictionary.
> 2) Use VectorDeserialize.
> 3) Don't parse the columns that are not included (cannot avoid reading them).
> 4) Send VRB to the pipeline and write ORC in parallel (in background).
> Also add an option to disable the encoding pipeline server-side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15664) LLAP text cache: improve first query perf

2017-01-18 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829274#comment-15829274
 ] 

Sergey Shelukhin commented on HIVE-15664:
-

[~gopalv] fyi

> LLAP text cache: improve first query perf
> -
>
> Key: HIVE-15664
> URL: https://issues.apache.org/jira/browse/HIVE-15664
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-15664.WIP.patch
>
>
> 1) Don't use ORC dictionary.
> 2) Use VectorDeserialize.
> 3) Don't parse the columns that are not included (cannot avoid reading them).
> 4) Send VRB to the pipeline and write ORC in parallel (in background).
> Also add an option to disable the encoding pipeline server-side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15664) LLAP text cache: improve first query perf

2017-01-18 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-15664:

Attachment: HIVE-15664.WIP.patch

This implements 1-2, as well as ORC dictionary.
Skipping is only supported on VectorDeserialize; I started looking at it, 
should be easy to do after clearing the initial confusiong - VD doesn't support 
complex types anyway, so should be easy to map new ORC cols to original column 
indexes. 
We don't expect that to result in major gain though (compared to 1-2-4), so I 
postponed it for now.
Unfortunately 1 and 2 don't speed it up enough... need to do 4 - return VRBs 
from VectorDeserialize, and offload ORC writing to a background thread, I was 
looking into that today. Need to wrap my head around variety of array indexes 
and integer lists that various parts use. Also interface-wise it would be 
difficult. Will probably piggyback on Orc...Batch

> LLAP text cache: improve first query perf
> -
>
> Key: HIVE-15664
> URL: https://issues.apache.org/jira/browse/HIVE-15664
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
> Attachments: HIVE-15664.WIP.patch
>
>
> 1) Don't use ORC dictionary.
> 2) Use VectorDeserialize.
> 3) Don't parse the columns that are not included (cannot avoid reading them).
> 4) Send VRB to the pipeline and write ORC in parallel (in background).
> Also add an option to disable the encoding pipeline server-side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15572) Improve the response time for query canceling when it happens during acquiring locks

2017-01-18 Thread Yongzhi Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongzhi Chen updated HIVE-15572:

Attachment: HIVE-15572.2.patch

> Improve the response time for query canceling when it happens during 
> acquiring locks
> 
>
> Key: HIVE-15572
> URL: https://issues.apache.org/jira/browse/HIVE-15572
> Project: Hive
>  Issue Type: Improvement
>Reporter: Yongzhi Chen
>Assignee: Yongzhi Chen
> Attachments: HIVE-15572.1.patch, HIVE-15572.2.patch
>
>
> When query canceling command sent during Hive Acquire locks (from zookeeper), 
> hive will finish acquiring all the locks and release them. As it is shown in 
> the following log:
> It took 165 s to finish acquire the lock,then spend 81s to release them.
> We can improve the performance by not acquiring any more locks and releasing 
> held locks when the query canceling command is received. 
> {noformat}
> Background-Pool: Thread-224]:  from=org.apache.hadoop.hive.ql.Driver>
> 2017-01-03 10:50:35,413 INFO  org.apache.hadoop.hive.ql.log.PerfLogger: 
> [HiveServer2-Background-Pool: Thread-224]:  method=acquireReadWriteLocks from=org.apache.hadoop.hive.ql.Driver>
> 2017-01-03 10:51:00,671 INFO  org.apache.hadoop.hive.ql.log.PerfLogger: 
> [HiveServer2-Background-Pool: Thread-218]:  method=acquireReadWriteLocks start=1483469295080 end=1483469460671 
> duration=165591 from=org.apache.hadoop.hive.ql.Driver>
> 2017-01-03 10:51:00,672 INFO  org.apache.hadoop.hive.ql.log.PerfLogger: 
> [HiveServer2-Background-Pool: Thread-218]:  from=org.apache.hadoop.hive.ql.Driver>
> 2017-01-03 10:51:00,672 ERROR org.apache.hadoop.hive.ql.Driver: 
> [HiveServer2-Background-Pool: Thread-218]: FAILED: query select count(*) from 
> manyparttbl has been cancelled
> 2017-01-03 10:51:00,673 INFO  org.apache.hadoop.hive.ql.log.PerfLogger: 
> [HiveServer2-Background-Pool: Thread-218]:  from=org.apache.hadoop.hive.ql.Driver>
> 2017-01-03 10:51:40,755 INFO  org.apache.hadoop.hive.ql.log.PerfLogger: 
> [HiveServer2-Background-Pool: Thread-215]:  start=1483469419487 end=1483469500755 duration=81268 
> from=org.apache.hadoop.hive.ql.Driver>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15572) Improve the response time for query canceling when it happens during acquiring locks

2017-01-18 Thread Yongzhi Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongzhi Chen updated HIVE-15572:

Attachment: (was: HIVE-15572.2.patch)

> Improve the response time for query canceling when it happens during 
> acquiring locks
> 
>
> Key: HIVE-15572
> URL: https://issues.apache.org/jira/browse/HIVE-15572
> Project: Hive
>  Issue Type: Improvement
>Reporter: Yongzhi Chen
>Assignee: Yongzhi Chen
> Attachments: HIVE-15572.1.patch, HIVE-15572.2.patch
>
>
> When query canceling command sent during Hive Acquire locks (from zookeeper), 
> hive will finish acquiring all the locks and release them. As it is shown in 
> the following log:
> It took 165 s to finish acquire the lock,then spend 81s to release them.
> We can improve the performance by not acquiring any more locks and releasing 
> held locks when the query canceling command is received. 
> {noformat}
> Background-Pool: Thread-224]:  from=org.apache.hadoop.hive.ql.Driver>
> 2017-01-03 10:50:35,413 INFO  org.apache.hadoop.hive.ql.log.PerfLogger: 
> [HiveServer2-Background-Pool: Thread-224]:  method=acquireReadWriteLocks from=org.apache.hadoop.hive.ql.Driver>
> 2017-01-03 10:51:00,671 INFO  org.apache.hadoop.hive.ql.log.PerfLogger: 
> [HiveServer2-Background-Pool: Thread-218]:  method=acquireReadWriteLocks start=1483469295080 end=1483469460671 
> duration=165591 from=org.apache.hadoop.hive.ql.Driver>
> 2017-01-03 10:51:00,672 INFO  org.apache.hadoop.hive.ql.log.PerfLogger: 
> [HiveServer2-Background-Pool: Thread-218]:  from=org.apache.hadoop.hive.ql.Driver>
> 2017-01-03 10:51:00,672 ERROR org.apache.hadoop.hive.ql.Driver: 
> [HiveServer2-Background-Pool: Thread-218]: FAILED: query select count(*) from 
> manyparttbl has been cancelled
> 2017-01-03 10:51:00,673 INFO  org.apache.hadoop.hive.ql.log.PerfLogger: 
> [HiveServer2-Background-Pool: Thread-218]:  from=org.apache.hadoop.hive.ql.Driver>
> 2017-01-03 10:51:40,755 INFO  org.apache.hadoop.hive.ql.log.PerfLogger: 
> [HiveServer2-Background-Pool: Thread-215]:  start=1483469419487 end=1483469500755 duration=81268 
> from=org.apache.hadoop.hive.ql.Driver>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15147) LLAP: use LLAP cache for non-columnar formats in a somewhat general way

2017-01-18 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-15147:

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to master

> LLAP: use LLAP cache for non-columnar formats in a somewhat general way
> ---
>
> Key: HIVE-15147
> URL: https://issues.apache.org/jira/browse/HIVE-15147
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-15147.01.patch, HIVE-15147.patch, 
> HIVE-15147.WIP.noout.patch, perf-top-cache.png, pre-cache.svg, 
> writerimpl-addrow.png
>
>
> The primary goal for the first pass is caching text files. Nothing would 
> prevent other formats from using the same path, in principle, although, as 
> was originally done with ORC, it may be better to have native caching support 
> optimized for each particular format.
> Given that caching pure text is not smart, and we already have ORC-encoded 
> cache that is columnar due to ORC file structure, we will transform data into 
> columnar ORC.
> The general idea is to treat all the data in the world as merely ORC that was 
> compressed with some poor compression codec, such as csv. Using the original 
> IF and serde, as well as an ORC writer (with some heavyweight optimizations 
> disabled, potentially), we can "uncompress" the csv/whatever data into its 
> "original" ORC representation, then cache it efficiently, by column, and also 
> reuse a lot of the existing code.
> Various other points:
> 1) Caching granularity will have to be somehow determined (i.e. how do we 
> slice the file horizontally, to avoid caching entire columns). As with ORC 
> uncompressed files, the specific offsets don't really matter as long as they 
> are consistent between reads. The problem is that the file offsets will 
> actually need to be propagated to the new reader from the original 
> inputformat. Row counts are easier to use but there's a problem of how to 
> actually map them to missing ranges to read from disk.
> 2) Obviously, for row-based formats, if any one column that is to be read has 
> been evicted or is otherwise missing, "all the columns" have to be read for 
> the corresponding slice to cache and read that one column. The vague plan is 
> to handle this implicitly, similarly to how ORC reader handles CB-RG overlaps 
> - it will just so happen that a missing column in disk range list to retrieve 
> will expand the disk-range-to-read into the whole horizontal slice of the 
> file.
> 3) Granularity/etc. won't work for gzipped text. If anything at all is 
> evicted, the entire file has to be re-read. Gzipped text is a ridiculous 
> feature, so this is by design.
> 4) In future, it would be possible to also build some form or 
> metadata/indexes for this cached data to do PPD, etc. This is out of the 
> scope for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15565) LLAP: GroupByOperator flushes hash table too frequently

2017-01-18 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HIVE-15565:

Status: Patch Available  (was: Reopened)

> LLAP: GroupByOperator flushes hash table too frequently
> ---
>
> Key: HIVE-15565
> URL: https://issues.apache.org/jira/browse/HIVE-15565
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: HIVE-15565.1.patch, HIVE-15565.2.patch
>
>
> {{GroupByOperator::isTez}} would be true in LLAP mode. Current memory 
> computations can go wrong with {{isTez}} checks in {{GroupByOperator}}. For 
> e.g, in a LLAP instance with Xmx128G and 12 executors, it would start 
> flushing hash table for every record once it reaches around 42GB 
> (hive.tez.container.size=7100, hive.map.aggr.hash.percentmemory=0.5).
> {noformat}
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_04_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size 
> = 0
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_12_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size 
> = 0
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_04_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 
> 1
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_12_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 
> 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15565) LLAP: GroupByOperator flushes hash table too frequently

2017-01-18 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HIVE-15565:

Attachment: HIVE-15565.2.patch

> LLAP: GroupByOperator flushes hash table too frequently
> ---
>
> Key: HIVE-15565
> URL: https://issues.apache.org/jira/browse/HIVE-15565
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: HIVE-15565.1.patch, HIVE-15565.2.patch
>
>
> {{GroupByOperator::isTez}} would be true in LLAP mode. Current memory 
> computations can go wrong with {{isTez}} checks in {{GroupByOperator}}. For 
> e.g, in a LLAP instance with Xmx128G and 12 executors, it would start 
> flushing hash table for every record once it reaches around 42GB 
> (hive.tez.container.size=7100, hive.map.aggr.hash.percentmemory=0.5).
> {noformat}
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_04_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size 
> = 0
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_12_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size 
> = 0
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_04_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 
> 1
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_12_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 
> 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15615) Fix unit tests failures caused by HIVE-13696

2017-01-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829250#comment-15829250
 ] 

Hive QA commented on HIVE-15615:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12848067/HIVE-15615.1.patch

{color:red}ERROR:{color} -1 due to build exiting with an error

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/3024/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/3024/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-3024/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Tests exited with: NonZeroExitCodeException
Command 'bash /data/hiveptest/working/scratch/source-prep.sh' failed with exit 
status 1 and output '+ date '+%Y-%m-%d %T.%3N'
2017-01-19 03:41:19.131
+ [[ -n /usr/lib/jvm/java-8-openjdk-amd64 ]]
+ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ export 
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ 
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m '
+ ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m '
+ export 'MAVEN_OPTS=-Xmx1g '
+ MAVEN_OPTS='-Xmx1g '
+ cd /data/hiveptest/working/
+ tee /data/hiveptest/logs/PreCommit-HIVE-Build-3024/source-prep.txt
+ [[ false == \t\r\u\e ]]
+ mkdir -p maven ivy
+ [[ git = \s\v\n ]]
+ [[ git = \g\i\t ]]
+ [[ -z master ]]
+ [[ -d apache-github-source-source ]]
+ [[ ! -d apache-github-source-source/.git ]]
+ [[ ! -d apache-github-source-source ]]
+ date '+%Y-%m-%d %T.%3N'
2017-01-19 03:41:19.134
+ cd apache-github-source-source
+ git fetch origin
>From https://github.com/apache/hive
   c9f81d2..4449c99  master -> origin/master
+ git reset --hard HEAD
HEAD is now at c9f81d2 HIVE-15576 : Fix bug in QTestUtil where lines after a 
partial mask will not be masked (Thomas Poepping, reviewed by Sergey Shelukhin)
+ git clean -f -d
Removing metastore/scripts/upgrade/derby/038-HIVE-10562.derby.sql
Removing metastore/scripts/upgrade/mssql/023-HIVE-10562.mssql.sql
Removing metastore/scripts/upgrade/mysql/038-HIVE-10562.mysql.sql
Removing metastore/scripts/upgrade/oracle/038-HIVE-10562.oracle.sql
Removing metastore/scripts/upgrade/postgres/037-HIVE-10562.postgres.sql
+ git checkout master
Already on 'master'
Your branch is behind 'origin/master' by 7 commits, and can be fast-forwarded.
  (use "git pull" to update your local branch)
+ git reset --hard origin/master
HEAD is now at 4449c99 Revert "HIVE-15565: LLAP: GroupByOperator flushes hash 
table too frequently (Rajesh Balamohan, reviewed by Sergey Shelukhin)"
+ git merge --ff-only origin/master
Already up-to-date.
+ date '+%Y-%m-%d %T.%3N'
2017-01-19 03:41:29.636
+ patchCommandPath=/data/hiveptest/working/scratch/smart-apply-patch.sh
+ patchFilePath=/data/hiveptest/working/scratch/build.patch
+ [[ -f /data/hiveptest/working/scratch/build.patch ]]
+ chmod +x /data/hiveptest/working/scratch/smart-apply-patch.sh
+ /data/hiveptest/working/scratch/smart-apply-patch.sh 
/data/hiveptest/working/scratch/build.patch
error: a/ql/src/java/org/apache/hadoop/hive/ql/Driver.java: No such file or 
directory
error: 
a/shims/scheduler/src/main/java/org/apache/hadoop/hive/schshim/FairSchedulerShim.java:
 No such file or directory
The patch does not appear to apply with p0, p1, or p2
+ exit 1
'
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12848067 - PreCommit-HIVE-Build

> Fix unit tests failures caused by HIVE-13696
> 
>
> Key: HIVE-15615
> URL: https://issues.apache.org/jira/browse/HIVE-15615
> Project: Hive
>  Issue Type: Bug
>Reporter: Yongzhi Chen
>Assignee: Yongzhi Chen
> Attachments: HIVE-15615.1.patch
>
>
> Following unit tests failed with same stack:
> org.apache.hadoop.hive.ql.security.authorization.plugin.TestHiveAuthorizerCheckInvocation
> org.apache.hadoop.hive.ql.security.authorization.plugin.TestHiveAuthorizerShowFilters
> {noformat}
> 2017-01-11T15:02:27,774 ERROR [main] ql.Driver: FAILED: NullPointerException 
> null
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule.cleanName(QueuePlacementRule.java:351)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule$User.getQueueForApp(QueuePlacementRule.java:132)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule.assignAppToQueue(QueuePlacementRule.java:74)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementPolicy.assignAppToQue

[jira] [Commented] (HIVE-15519) BitSet not computed properly for ColumnBuffer subset

2017-01-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829246#comment-15829246
 ] 

Hive QA commented on HIVE-15519:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12848058/HIVE-15519.5-branch-1.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 161 failed/errored test(s), 8083 tests 
executed
*Failed tests:*
{noformat}
TestAcidOnTez - did not produce a TEST-*.xml file (likely timed out) 
(batchId=376)
TestAdminUser - did not produce a TEST-*.xml file (likely timed out) 
(batchId=358)
TestAuthorizationPreEventListener - did not produce a TEST-*.xml file (likely 
timed out) (batchId=391)
TestAuthzApiEmbedAuthorizerInEmbed - did not produce a TEST-*.xml file (likely 
timed out) (batchId=368)
TestAuthzApiEmbedAuthorizerInRemote - did not produce a TEST-*.xml file (likely 
timed out) (batchId=374)
TestBeeLineWithArgs - did not produce a TEST-*.xml file (likely timed out) 
(batchId=398)
TestCLIAuthzSessionContext - did not produce a TEST-*.xml file (likely timed 
out) (batchId=416)
TestClearDanglingScratchDir - did not produce a TEST-*.xml file (likely timed 
out) (batchId=383)
TestClientSideAuthorizationProvider - did not produce a TEST-*.xml file (likely 
timed out) (batchId=390)
TestCompactor - did not produce a TEST-*.xml file (likely timed out) 
(batchId=379)
TestCreateUdfEntities - did not produce a TEST-*.xml file (likely timed out) 
(batchId=378)
TestCustomAuthentication - did not produce a TEST-*.xml file (likely timed out) 
(batchId=399)
TestDBTokenStore - did not produce a TEST-*.xml file (likely timed out) 
(batchId=342)
TestDDLWithRemoteMetastoreSecondNamenode - did not produce a TEST-*.xml file 
(likely timed out) (batchId=377)
TestDynamicSerDe - did not produce a TEST-*.xml file (likely timed out) 
(batchId=345)
TestEmbeddedHiveMetaStore - did not produce a TEST-*.xml file (likely timed 
out) (batchId=355)
TestEmbeddedThriftBinaryCLIService - did not produce a TEST-*.xml file (likely 
timed out) (batchId=402)
TestFilterHooks - did not produce a TEST-*.xml file (likely timed out) 
(batchId=350)
TestFolderPermissions - did not produce a TEST-*.xml file (likely timed out) 
(batchId=385)
TestHS2AuthzContext - did not produce a TEST-*.xml file (likely timed out) 
(batchId=419)
TestHS2AuthzSessionContext - did not produce a TEST-*.xml file (likely timed 
out) (batchId=420)
TestHS2ClearDanglingScratchDir - did not produce a TEST-*.xml file (likely 
timed out) (batchId=406)
TestHS2ImpersonationWithRemoteMS - did not produce a TEST-*.xml file (likely 
timed out) (batchId=407)
TestHiveAuthorizerCheckInvocation - did not produce a TEST-*.xml file (likely 
timed out) (batchId=394)
TestHiveAuthorizerShowFilters - did not produce a TEST-*.xml file (likely timed 
out) (batchId=393)
TestHiveHistory - did not produce a TEST-*.xml file (likely timed out) 
(batchId=396)
TestHiveMetaStoreTxns - did not produce a TEST-*.xml file (likely timed out) 
(batchId=370)
TestHiveMetaStoreWithEnvironmentContext - did not produce a TEST-*.xml file 
(likely timed out) (batchId=360)
TestHiveMetaTool - did not produce a TEST-*.xml file (likely timed out) 
(batchId=373)
TestHiveServer2 - did not produce a TEST-*.xml file (likely timed out) 
(batchId=422)
TestHiveServer2SessionTimeout - did not produce a TEST-*.xml file (likely timed 
out) (batchId=423)
TestHiveSessionImpl - did not produce a TEST-*.xml file (likely timed out) 
(batchId=403)
TestHs2Hooks - did not produce a TEST-*.xml file (likely timed out) 
(batchId=375)
TestHs2HooksWithMiniKdc - did not produce a TEST-*.xml file (likely timed out) 
(batchId=451)
TestJdbcDriver2 - did not produce a TEST-*.xml file (likely timed out) 
(batchId=410)
TestJdbcMetadataApiAuth - did not produce a TEST-*.xml file (likely timed out) 
(batchId=421)
TestJdbcWithLocalClusterSpark - did not produce a TEST-*.xml file (likely timed 
out) (batchId=415)
TestJdbcWithMiniHS2 - did not produce a TEST-*.xml file (likely timed out) 
(batchId=412)
TestJdbcWithMiniKdc - did not produce a TEST-*.xml file (likely timed out) 
(batchId=448)
TestJdbcWithMiniKdcCookie - did not produce a TEST-*.xml file (likely timed 
out) (batchId=447)
TestJdbcWithMiniKdcSQLAuthBinary - did not produce a TEST-*.xml file (likely 
timed out) (batchId=445)
TestJdbcWithMiniKdcSQLAuthHttp - did not produce a TEST-*.xml file (likely 
timed out) (batchId=450)
TestJdbcWithMiniMr - did not produce a TEST-*.xml file (likely timed out) 
(batchId=411)
TestJdbcWithSQLAuthUDFBlacklist - did not produce a TEST-*.xml file (likely 
timed out) (batchId=417)
TestJdbcWithSQLAuthorization - did not produce a TEST-*.xml file (likely timed 
out) (batchId=418)
TestLocationQueries - did not produce a TEST-*.xml file (likely timed out) 
(batchId=382)
TestMTQueries - did not produce a TEST-*.xml file (like

[jira] [Commented] (HIVE-15565) LLAP: GroupByOperator flushes hash table too frequently

2017-01-18 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829231#comment-15829231
 ] 

Rajesh Balamohan commented on HIVE-15565:
-

Reverted the patch. Will post a separate patch for checking 
"numEntriesHashTable==0" for LLAP.

> LLAP: GroupByOperator flushes hash table too frequently
> ---
>
> Key: HIVE-15565
> URL: https://issues.apache.org/jira/browse/HIVE-15565
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: HIVE-15565.1.patch
>
>
> {{GroupByOperator::isTez}} would be true in LLAP mode. Current memory 
> computations can go wrong with {{isTez}} checks in {{GroupByOperator}}. For 
> e.g, in a LLAP instance with Xmx128G and 12 executors, it would start 
> flushing hash table for every record once it reaches around 42GB 
> (hive.tez.container.size=7100, hive.map.aggr.hash.percentmemory=0.5).
> {noformat}
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_04_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size 
> = 0
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_12_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size 
> = 0
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_04_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 
> 1
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_12_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 
> 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (HIVE-15565) LLAP: GroupByOperator flushes hash table too frequently

2017-01-18 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan reopened HIVE-15565:
-

> LLAP: GroupByOperator flushes hash table too frequently
> ---
>
> Key: HIVE-15565
> URL: https://issues.apache.org/jira/browse/HIVE-15565
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: HIVE-15565.1.patch
>
>
> {{GroupByOperator::isTez}} would be true in LLAP mode. Current memory 
> computations can go wrong with {{isTez}} checks in {{GroupByOperator}}. For 
> e.g, in a LLAP instance with Xmx128G and 12 executors, it would start 
> flushing hash table for every record once it reaches around 42GB 
> (hive.tez.container.size=7100, hive.map.aggr.hash.percentmemory=0.5).
> {noformat}
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_04_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size 
> = 0
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_12_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size 
> = 0
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_04_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 
> 1
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_12_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 
> 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15565) LLAP: GroupByOperator flushes hash table too frequently

2017-01-18 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829228#comment-15829228
 ] 

Rajesh Balamohan commented on HIVE-15565:
-

Had offline discussion with [~prasanth_j] on this. We do not need to flush the 
hashTable when {{numEntriesHashTable=0}} in LLAP.  We can revert the existing 
patch and check for this condition for LLAP.  That would reduce the number of 
flushes by large margin.

> LLAP: GroupByOperator flushes hash table too frequently
> ---
>
> Key: HIVE-15565
> URL: https://issues.apache.org/jira/browse/HIVE-15565
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: HIVE-15565.1.patch
>
>
> {{GroupByOperator::isTez}} would be true in LLAP mode. Current memory 
> computations can go wrong with {{isTez}} checks in {{GroupByOperator}}. For 
> e.g, in a LLAP instance with Xmx128G and 12 executors, it would start 
> flushing hash table for every record once it reaches around 42GB 
> (hive.tez.container.size=7100, hive.map.aggr.hash.percentmemory=0.5).
> {noformat}
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_04_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size 
> = 0
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_12_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size 
> = 0
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_04_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 
> 1
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_12_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 
> 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15580) Replace Spark's groupByKey operator with something with bounded memory

2017-01-18 Thread Dapeng Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829218#comment-15829218
 ] 

Dapeng Sun commented on HIVE-15580:
---

Thank [~xuefuz] for the suggestion, currently the heap size is 290G for each 
executor, I will try to do more turning on it.

> Replace Spark's groupByKey operator with something with bounded memory
> --
>
> Key: HIVE-15580
> URL: https://issues.apache.org/jira/browse/HIVE-15580
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, 
> HIVE-15580.2.patch, HIVE-15580.2.patch, HIVE-15580.3.patch, HIVE-15580.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15544) Support scalar subqueries

2017-01-18 Thread Vineet Garg (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-15544:
---
Status: Patch Available  (was: Open)

Latest patch contains rewrites for correlated scalar subqueries with aggregates 
and more tests.
Note that this patch also disables the following:

* IN/NOT IN correlated subqueries containing aggregates (HIVE checks for such 
queries and throw an exception)
* SCALAR correlated subqueries containing aggregates with non-equi join 
predicates on correlated columns (HIVE throws an exception for such queries)

Above restrictions will need to be documented.

> Support scalar subqueries
> -
>
> Key: HIVE-15544
> URL: https://issues.apache.org/jira/browse/HIVE-15544
> Project: Hive
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>  Labels: sub-query
> Attachments: HIVE-15544.1.patch, HIVE-15544.2.patch, 
> HIVE-15544.3.patch, HIVE-15544.4.patch
>
>
> Currently HIVE only support IN/EXISTS/NOT IN/NOT EXISTS subqueries. HIVE 
> doesn't allow sub-queries such as:
> {code}
> explain select  a.ca_state state, count(*) cnt
>  from customer_address a
>  ,customer c
>  ,store_sales s
>  ,date_dim d
>  ,item i
>  where   a.ca_address_sk = c.c_current_addr_sk
>   and c.c_customer_sk = s.ss_customer_sk
>   and s.ss_sold_date_sk = d.d_date_sk
>   and s.ss_item_sk = i.i_item_sk
>   and d.d_month_seq = 
>(select distinct (d_month_seq)
> from date_dim
>where d_year = 2000
>   and d_moy = 2 )
>   and i.i_current_price > 1.2 * 
>  (select avg(j.i_current_price) 
>from item j 
>where j.i_category = i.i_category)
>  group by a.ca_state
>  having count(*) >= 10
>  order by cnt 
>  limit 100;
> {code}
> We initially plan to support such scalar subqueries in filter i.e. WHERE and 
> HAVING



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15544) Support scalar subqueries

2017-01-18 Thread Vineet Garg (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-15544:
---
Attachment: HIVE-15544.4.patch

> Support scalar subqueries
> -
>
> Key: HIVE-15544
> URL: https://issues.apache.org/jira/browse/HIVE-15544
> Project: Hive
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>  Labels: sub-query
> Attachments: HIVE-15544.1.patch, HIVE-15544.2.patch, 
> HIVE-15544.3.patch, HIVE-15544.4.patch
>
>
> Currently HIVE only support IN/EXISTS/NOT IN/NOT EXISTS subqueries. HIVE 
> doesn't allow sub-queries such as:
> {code}
> explain select  a.ca_state state, count(*) cnt
>  from customer_address a
>  ,customer c
>  ,store_sales s
>  ,date_dim d
>  ,item i
>  where   a.ca_address_sk = c.c_current_addr_sk
>   and c.c_customer_sk = s.ss_customer_sk
>   and s.ss_sold_date_sk = d.d_date_sk
>   and s.ss_item_sk = i.i_item_sk
>   and d.d_month_seq = 
>(select distinct (d_month_seq)
> from date_dim
>where d_year = 2000
>   and d_moy = 2 )
>   and i.i_current_price > 1.2 * 
>  (select avg(j.i_current_price) 
>from item j 
>where j.i_category = i.i_category)
>  group by a.ca_state
>  having count(*) >= 10
>  order by cnt 
>  limit 100;
> {code}
> We initially plan to support such scalar subqueries in filter i.e. WHERE and 
> HAVING



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15544) Support scalar subqueries

2017-01-18 Thread Vineet Garg (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-15544:
---
Status: Open  (was: Patch Available)

> Support scalar subqueries
> -
>
> Key: HIVE-15544
> URL: https://issues.apache.org/jira/browse/HIVE-15544
> Project: Hive
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>  Labels: sub-query
> Attachments: HIVE-15544.1.patch, HIVE-15544.2.patch, 
> HIVE-15544.3.patch
>
>
> Currently HIVE only support IN/EXISTS/NOT IN/NOT EXISTS subqueries. HIVE 
> doesn't allow sub-queries such as:
> {code}
> explain select  a.ca_state state, count(*) cnt
>  from customer_address a
>  ,customer c
>  ,store_sales s
>  ,date_dim d
>  ,item i
>  where   a.ca_address_sk = c.c_current_addr_sk
>   and c.c_customer_sk = s.ss_customer_sk
>   and s.ss_sold_date_sk = d.d_date_sk
>   and s.ss_item_sk = i.i_item_sk
>   and d.d_month_seq = 
>(select distinct (d_month_seq)
> from date_dim
>where d_year = 2000
>   and d_moy = 2 )
>   and i.i_current_price > 1.2 * 
>  (select avg(j.i_current_price) 
>from item j 
>where j.i_category = i.i_category)
>  group by a.ca_state
>  having count(*) >= 10
>  order by cnt 
>  limit 100;
> {code}
> We initially plan to support such scalar subqueries in filter i.e. WHERE and 
> HAVING



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15580) Replace Spark's groupByKey operator with something with bounded memory

2017-01-18 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829203#comment-15829203
 ] 

Xuefu Zhang commented on HIVE-15580:


[~dapengsun], for the OOM error you get, you can probably increase the executor 
heap size to overcome it.  PartitionedPairBuffer uses a buffer up to 
Int.MaxValue / 2, so the memory usage is big but bounded.

> Replace Spark's groupByKey operator with something with bounded memory
> --
>
> Key: HIVE-15580
> URL: https://issues.apache.org/jira/browse/HIVE-15580
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, 
> HIVE-15580.2.patch, HIVE-15580.2.patch, HIVE-15580.3.patch, HIVE-15580.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14827) Micro benchmark for Parquet vectorized reader

2017-01-18 Thread Ferdinand Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829202#comment-15829202
 ] 

Ferdinand Xu commented on HIVE-14827:
-

Thank you for the patch. Left some comments on the PR.

> Micro benchmark for Parquet vectorized reader
> -
>
> Key: HIVE-14827
> URL: https://issues.apache.org/jira/browse/HIVE-14827
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Ferdinand Xu
>Assignee: Colin Ma
> Attachments: HIVE-14827.001.patch
>
>
> We need a microbenchmark to evaluate the throughput and execution time for 
> Parquet vectorized reader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15651) LLAP: llap status tool enhancements

2017-01-18 Thread Prasanth Jayachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829184#comment-15829184
 ] 

Prasanth Jayachandran commented on HIVE-15651:
--

[~sseth] can you please take a look?

> LLAP: llap status tool enhancements
> ---
>
> Key: HIVE-15651
> URL: https://issues.apache.org/jira/browse/HIVE-15651
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
> Attachments: HIVE-15651.1.patch
>
>
> Per [~sseth] following enhancements can be made to llap status tool
> 1) If state changes from an ACTIVE state to STOPPED - terminate the script 
> immediately (fail fast)
> 2) Add a threshold of what is acceptable in terms of the running state - 
> RUNNING_PARTIAL may be ok if 80% nodes are up for example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15651) LLAP: llap status tool enhancements

2017-01-18 Thread Prasanth Jayachandran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth Jayachandran updated HIVE-15651:
-
Status: Patch Available  (was: Open)

> LLAP: llap status tool enhancements
> ---
>
> Key: HIVE-15651
> URL: https://issues.apache.org/jira/browse/HIVE-15651
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
> Attachments: HIVE-15651.1.patch
>
>
> Per [~sseth] following enhancements can be made to llap status tool
> 1) If state changes from an ACTIVE state to STOPPED - terminate the script 
> immediately (fail fast)
> 2) Add a threshold of what is acceptable in terms of the running state - 
> RUNNING_PARTIAL may be ok if 80% nodes are up for example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15651) LLAP: llap status tool enhancements

2017-01-18 Thread Prasanth Jayachandran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth Jayachandran updated HIVE-15651:
-
Attachment: HIVE-15651.1.patch

> LLAP: llap status tool enhancements
> ---
>
> Key: HIVE-15651
> URL: https://issues.apache.org/jira/browse/HIVE-15651
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
> Attachments: HIVE-15651.1.patch
>
>
> Per [~sseth] following enhancements can be made to llap status tool
> 1) If state changes from an ACTIVE state to STOPPED - terminate the script 
> immediately (fail fast)
> 2) Add a threshold of what is acceptable in terms of the running state - 
> RUNNING_PARTIAL may be ok if 80% nodes are up for example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10562) Add version column to NOTIFICATION_LOG table and DbNotificationListener

2017-01-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829161#comment-15829161
 ] 

Hive QA commented on HIVE-10562:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12848165/HIVE-10562.2.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 364 failed/errored test(s), 10956 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=233)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_joins] 
(batchId=217)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_predicate_pushdown]
 (batchId=217)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_single_sourced_multi_insert]
 (batchId=217)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[ctas] 
(batchId=229)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_dynamic_partitions]
 (batchId=229)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_table]
 (batchId=229)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_directory]
 (batchId=229)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_dynamic_partitions]
 (batchId=229)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_table]
 (batchId=229)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[write_final_output_blobstore]
 (batchId=229)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_subquery] 
(batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_table_stats] 
(batchId=48)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[analyze_tbl_part] 
(batchId=44)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[annotate_stats_join_pkfk]
 (batchId=13)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[avrocountemptytbl] 
(batchId=74)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_rp_udf_percentile2] 
(batchId=18)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_rp_udf_percentile] 
(batchId=39)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[create_or_replace_view] 
(batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_udf] (batchId=8)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input19] (batchId=79)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_overwrite_directory]
 (batchId=25)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[join46] (batchId=1)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[join_emit_interval] 
(batchId=10)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[mapjoin46] (batchId=53)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[multi_insert_gby4] 
(batchId=43)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[nested_column_pruning] 
(batchId=31)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[sample5] (batchId=52)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[serde_opencsv] 
(batchId=68)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[smb_mapjoin_46] 
(batchId=38)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[specialChar] (batchId=22)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_exists] 
(batchId=38)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_notexists] 
(batchId=81)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_notin_having] 
(batchId=45)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_unqualcolumnrefs]
 (batchId=17)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_array_contains] 
(batchId=12)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_conv] (batchId=21)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_add] 
(batchId=44)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_sub] (batchId=2)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_hex] (batchId=22)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_java_method] 
(batchId=63)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_map_keys] 
(batchId=62)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_map_values] 
(batchId=46)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_months_between] 
(batchId=48)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_negative] (batchId=1)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_not] (batchId=51)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_percentile] 
(batchId=59)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_positive] 
(batchId=39)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sort_array] 
(batchId=59)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[

[jira] [Comment Edited] (HIVE-15580) Replace Spark's groupByKey operator with something with bounded memory

2017-01-18 Thread Dapeng Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829154#comment-15829154
 ] 

Dapeng Sun edited comment on HIVE-15580 at 1/19/17 2:05 AM:


Thank [~xuefuz], [~csun] and [~Ferd], we are running a 100TB test case about 
data skew on 50 nodes([TPC-xBB 
q21|https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench/tree/master/engines/hive/queries/q21]),
 before the patch, spark tasks are failed with following error:
{noformat}
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
 at java.util.Arrays.copyOf(Arrays.java:3181)
 at java.util.ArrayList.grow(ArrayList.java:261)
 at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235)
 at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227)
 at java.util.ArrayList.add(ArrayList.java:458)
 at 
org.apache.hadoop.hive.ql.exec.spark.SortByShuffler$ShuffleFunction$1.next(SortByShuffler.java:100)
 at 
org.apache.hadoop.hive.ql.exec.spark.SortByShuffler$ShuffleFunction$1.next(SortByShuffler.java:75)
 at 
org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:95)
 at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
 at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:200)
 at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
 at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
 at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:89)
 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
{noformat}
after apply the patches (patched HIVE-15580 and HIVE-15527 respectively), the 
arraylist are both fixed, but PartitionedPairBuffer at Spark side also cause 
OOM, here are the task failed exception:
{noformat}
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at 
org.apache.spark.util.collection.PartitionedPairBuffer.growArray(PartitionedPairBuffer.scala:67)
at 
org.apache.spark.util.collection.PartitionedPairBuffer.insert(PartitionedPairBuffer.scala:48)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:203)
at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:111)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}


was (Author: dapengsun):
Thank [~xuefuz], [~csun] and [~Ferd], we are running a 100TB test case about 
data skew on 50 nodes([TPC-xBB 
q21|https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench/tree/master/engines/hive/queries/q21]),
 before the patch, spark tasks are failed with following error:
{noformat}
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
 at java.util.Arrays.copyOf(Arrays.java:3181)
 at java.util.ArrayList.grow(ArrayList.java:261)
 at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235)
 at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227)
 at java.util.ArrayList.add(ArrayList.java:458)
 at 
org.apache.hadoop.hive.ql.exec.spark.SortByShuffler$ShuffleFunction$1.next(SortByShuffler.java:100)
 at 
org.apache.hadoop.hive.ql.exec.spark.SortByShuffler$ShuffleFunction$1.next(SortByShuf

[jira] [Comment Edited] (HIVE-15580) Replace Spark's groupByKey operator with something with bounded memory

2017-01-18 Thread Dapeng Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829154#comment-15829154
 ] 

Dapeng Sun edited comment on HIVE-15580 at 1/19/17 2:01 AM:


Thank [~xuefuz], [~csun] and [~Ferd], we are running a 100TB test case about 
data skew on 50 nodes([TPC-xBB 
q21|https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench/tree/master/engines/hive/queries/q21]),
 before the patch, spark tasks are failed with following error:
{noformat}
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
 at java.util.Arrays.copyOf(Arrays.java:3181)
 at java.util.ArrayList.grow(ArrayList.java:261)
 at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235)
 at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227)
 at java.util.ArrayList.add(ArrayList.java:458)
 at 
org.apache.hadoop.hive.ql.exec.spark.SortByShuffler$ShuffleFunction$1.next(SortByShuffler.java:100)
 at 
org.apache.hadoop.hive.ql.exec.spark.SortByShuffler$ShuffleFunction$1.next(SortByShuffler.java:75)
 at 
org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:95)
 at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
 at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:200)
 at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
 at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
 at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:89)
 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
{noformat}
after apply the patch, the arraylist are fixed, but PartitionedPairBuffer also 
cause OOM, here are the task failed exception:
{noformat}
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at 
org.apache.spark.util.collection.PartitionedPairBuffer.growArray(PartitionedPairBuffer.scala:67)
at 
org.apache.spark.util.collection.PartitionedPairBuffer.insert(PartitionedPairBuffer.scala:48)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:203)
at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:111)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}


was (Author: dapengsun):
Thank [~xuefuz], [~csun] and [~Ferd], we are running a 100TB test case about 
data skew on 50 nodes(TPC-xBB q21), before the patch, spark tasks are failed 
with following error:
{noformat}
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
 at java.util.Arrays.copyOf(Arrays.java:3181)
 at java.util.ArrayList.grow(ArrayList.java:261)
 at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235)
 at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227)
 at java.util.ArrayList.add(ArrayList.java:458)
 at 
org.apache.hadoop.hive.ql.exec.spark.SortByShuffler$ShuffleFunction$1.next(SortByShuffler.java:100)
 at 
org.apache.hadoop.hive.ql.exec.spark.SortByShuffler$ShuffleFunction$1.next(SortByShuffler.java:75)
 at 
org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:95)
 at 
scala.collect

[jira] [Commented] (HIVE-15580) Replace Spark's groupByKey operator with something with bounded memory

2017-01-18 Thread Dapeng Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829154#comment-15829154
 ] 

Dapeng Sun commented on HIVE-15580:
---

Thank [~xuefuz], [~csun] and [~Ferd], we are running a 100TB test case about 
data skew on 50 nodes(TPC-xBB q21), before the patch, spark tasks are failed 
with following error:
{noformat}
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
 at java.util.Arrays.copyOf(Arrays.java:3181)
 at java.util.ArrayList.grow(ArrayList.java:261)
 at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235)
 at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227)
 at java.util.ArrayList.add(ArrayList.java:458)
 at 
org.apache.hadoop.hive.ql.exec.spark.SortByShuffler$ShuffleFunction$1.next(SortByShuffler.java:100)
 at 
org.apache.hadoop.hive.ql.exec.spark.SortByShuffler$ShuffleFunction$1.next(SortByShuffler.java:75)
 at 
org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:95)
 at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
 at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:200)
 at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
 at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
 at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:89)
 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
{noformat}
after apply the patch, the arraylist are fixed, but PartitionedPairBuffer also 
cause OOM, here are the task failed exception:
{noformat}
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at 
org.apache.spark.util.collection.PartitionedPairBuffer.growArray(PartitionedPairBuffer.scala:67)
at 
org.apache.spark.util.collection.PartitionedPairBuffer.insert(PartitionedPairBuffer.scala:48)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:203)
at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:111)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

> Replace Spark's groupByKey operator with something with bounded memory
> --
>
> Key: HIVE-15580
> URL: https://issues.apache.org/jira/browse/HIVE-15580
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, 
> HIVE-15580.2.patch, HIVE-15580.2.patch, HIVE-15580.3.patch, HIVE-15580.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15519) BitSet not computed properly for ColumnBuffer subset

2017-01-18 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829151#comment-15829151
 ] 

Rui Li commented on HIVE-15519:
---

Hi [~thejas], could you have a look at the latest patches? FYI v6 is for master 
and v5 is for branch-1.

> BitSet not computed properly for ColumnBuffer subset
> 
>
> Key: HIVE-15519
> URL: https://issues.apache.org/jira/browse/HIVE-15519
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, JDBC
>Reporter: Bharat Viswanadham
>Assignee: Rui Li
>Priority: Critical
> Attachments: data_type_test(1).txt, HIVE-15519.1.patch, 
> HIVE-15519.2.patch, HIVE-15519.3.patch, HIVE-15519.4.patch, 
> HIVE-15519.5-branch-1.patch, HIVE-15519.6.patch
>
>
> Hive decimal type column precision is returning as zero, even though column 
> has precision set.
> Example: col67 decimal(18,2) scale is returning as zero for that column.
> Tried with below program.
> {code}
>System.out.println("Opening connection");   
> Class.forName("org.apache.hive.jdbc.HiveDriver");
>Connection con = 
> DriverManager.getConnection("jdbc:hive2://x.x.x.x:1/default");
>   DatabaseMetaData dbMeta = con.getMetaData();
>ResultSet rs = dbMeta.getColumns(null, "DEFAULT", "data_type_test",null);
>  while (rs.next()) {
> if (rs.getString("COLUMN_NAME").equalsIgnoreCase("col48") || 
> rs.getString("COLUMN_NAME").equalsIgnoreCase("col67") || 
> rs.getString("COLUMN_NAME").equalsIgnoreCase("col68") || 
> rs.getString("COLUMN_NAME").equalsIgnoreCase("col122")){
>  System.out.println(rs.getString("COLUMN_NAME") + "\t" + 
> rs.getString("COLUMN_SIZE") + "\t" + rs.getInt("DECIMAL_DIGITS"));
> }
>}
>rs.close();
>con.close();
>   } catch (Exception e) {
>e.printStackTrace();
>;
>   }
> {code}
> Default fetch size is 50. if any column no is under 50 with decimal type, 
> precision is returning properly, when the column no is greater than 50, scale 
> is returning as zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-10562) Add version column to NOTIFICATION_LOG table and DbNotificationListener

2017-01-18 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829142#comment-15829142
 ] 

Thejas M Nair commented on HIVE-10562:
--

+1 pending tests


> Add version column to NOTIFICATION_LOG table and DbNotificationListener
> ---
>
> Key: HIVE-10562
> URL: https://issues.apache.org/jira/browse/HIVE-10562
> Project: Hive
>  Issue Type: Sub-task
>  Components: Import/Export
>Affects Versions: 1.2.0
>Reporter: Sushanth Sowmyan
>Assignee: Sushanth Sowmyan
> Attachments: HIVE-10562.2.patch, HIVE-10562.patch
>
>
> Currently, we have a JSON encoded message being stored in the 
> NOTIFICATION_LOG table.
> If we want to be future proof, we need to allow for versioning of this 
> message, since we might change what gets stored in the message. A prime 
> example of what we'd want to change is as in HIVE-10393.
> MessageFactory already has stubs to allow for versioning of messages, and we 
> could expand on this further in the future. NotificationListener currently 
> encodes the message version into the header for the JMS message it sends, 
> which seems to be the right place for a message version (instead of being 
> contained in the message, for eg.).
> So, we should have a similar ability for DbEventListener as well, and the 
> place this makes the most sense is to and add a version column to the 
> NOTIFICATION_LOG table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HIVE-15580) Replace Spark's groupByKey operator with something with bounded memory

2017-01-18 Thread Ferdinand Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829099#comment-15829099
 ] 

Ferdinand Xu edited comment on HIVE-15580 at 1/19/17 1:52 AM:
--

[~xuefuz], both pathes in HIVE-15527 and HIVE-15580 solve groupByKey OOM issue 
for our query. But we didn't have a full test for both of them.


was (Author: ferd):
[~xuefuz], patch in HIVE-15527 solved the issue and we're trying the patch in 
HIVE-15580. Will keep you post.

> Replace Spark's groupByKey operator with something with bounded memory
> --
>
> Key: HIVE-15580
> URL: https://issues.apache.org/jira/browse/HIVE-15580
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, 
> HIVE-15580.2.patch, HIVE-15580.2.patch, HIVE-15580.3.patch, HIVE-15580.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15655) Optimizer: Allow config option to disable n-way JOIN merging

2017-01-18 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated HIVE-15655:
---
Attachment: HIVE-15655.2.patch

> Optimizer: Allow config option to disable n-way JOIN merging 
> -
>
> Key: HIVE-15655
> URL: https://issues.apache.org/jira/browse/HIVE-15655
> Project: Hive
>  Issue Type: Bug
>  Components: Physical Optimizer
>Affects Versions: 2.2.0
>Reporter: Gopal V
>Assignee: Gopal V
> Attachments: HIVE-15655.1.patch, HIVE-15655.2.patch
>
>
> N-way Joins in Tez produce bad runtime plans whenever they are left-outer 
> joins with map-joins.
> This is something which should have a safety setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14827) Micro benchmark for Parquet vectorized reader

2017-01-18 Thread Colin Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829114#comment-15829114
 ] 

Colin Ma commented on HIVE-14827:
-

Hi, [~Ferd], the patch and the link of review board are updated. Any problem, 
please let me know, thanks.

> Micro benchmark for Parquet vectorized reader
> -
>
> Key: HIVE-14827
> URL: https://issues.apache.org/jira/browse/HIVE-14827
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Ferdinand Xu
>Assignee: Colin Ma
> Attachments: HIVE-14827.001.patch
>
>
> We need a microbenchmark to evaluate the throughput and execution time for 
> Parquet vectorized reader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-14827) Micro benchmark for Parquet vectorized reader

2017-01-18 Thread Colin Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Ma updated HIVE-14827:

Status: Patch Available  (was: Open)

> Micro benchmark for Parquet vectorized reader
> -
>
> Key: HIVE-14827
> URL: https://issues.apache.org/jira/browse/HIVE-14827
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Ferdinand Xu
>Assignee: Colin Ma
> Attachments: HIVE-14827.001.patch
>
>
> We need a microbenchmark to evaluate the throughput and execution time for 
> Parquet vectorized reader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-14827) Micro benchmark for Parquet vectorized reader

2017-01-18 Thread Colin Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Ma updated HIVE-14827:

Attachment: HIVE-14827.001.patch

To update the patch, I change the owner to myself, any problem, please let me 
know.

> Micro benchmark for Parquet vectorized reader
> -
>
> Key: HIVE-14827
> URL: https://issues.apache.org/jira/browse/HIVE-14827
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Ferdinand Xu
>Assignee: Colin Ma
> Attachments: HIVE-14827.001.patch
>
>
> We need a microbenchmark to evaluate the throughput and execution time for 
> Parquet vectorized reader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HIVE-14827) Micro benchmark for Parquet vectorized reader

2017-01-18 Thread Colin Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Ma reassigned HIVE-14827:
---

Assignee: Colin Ma  (was: Sahil Takiar)

> Micro benchmark for Parquet vectorized reader
> -
>
> Key: HIVE-14827
> URL: https://issues.apache.org/jira/browse/HIVE-14827
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Ferdinand Xu
>Assignee: Colin Ma
>
> We need a microbenchmark to evaluate the throughput and execution time for 
> Parquet vectorized reader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15614) Druid splitSelectQuery closes lifecycle object too early

2017-01-18 Thread Jesus Camacho Rodriguez (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez updated HIVE-15614:
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   Status: Resolved  (was: Patch Available)

Pushed to master, thanks for reviewing [~ashutoshc]!

> Druid splitSelectQuery closes lifecycle object too early
> 
>
> Key: HIVE-15614
> URL: https://issues.apache.org/jira/browse/HIVE-15614
> Project: Hive
>  Issue Type: Improvement
>  Components: Druid integration
>Affects Versions: 2.2.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
> Fix For: 2.2.0
>
> Attachments: HIVE-15614.patch
>
>
> L208 in DruidQueryBasedInputFormat.java.
> Fix includes better handling of lifecycle objects in general.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15613) Include druid-handler sources in src packaging

2017-01-18 Thread Jesus Camacho Rodriguez (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez updated HIVE-15613:
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   Status: Resolved  (was: Patch Available)

Pushed to master, thanks for reviewing [~ashutoshc]!

> Include druid-handler sources in src packaging
> --
>
> Key: HIVE-15613
> URL: https://issues.apache.org/jira/browse/HIVE-15613
> Project: Hive
>  Issue Type: Improvement
>  Components: Druid integration
>Affects Versions: 2.2.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: HIVE-15613.patch
>
>
> We forgot to do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15612) Include Calcite dependency in Druid storage handler jar

2017-01-18 Thread Jesus Camacho Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829105#comment-15829105
 ] 

Jesus Camacho Rodriguez commented on HIVE-15612:


Pushed to master, thanks for reviewing [~bslim], [~ashutoshc]!

> Include Calcite dependency in Druid storage handler jar
> ---
>
> Key: HIVE-15612
> URL: https://issues.apache.org/jira/browse/HIVE-15612
> Project: Hive
>  Issue Type: Improvement
>  Components: Druid integration
>Affects Versions: 2.2.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: HIVE-15612.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15612) Include Calcite dependency in Druid storage handler jar

2017-01-18 Thread Jesus Camacho Rodriguez (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez updated HIVE-15612:
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   Status: Resolved  (was: Patch Available)

> Include Calcite dependency in Druid storage handler jar
> ---
>
> Key: HIVE-15612
> URL: https://issues.apache.org/jira/browse/HIVE-15612
> Project: Hive
>  Issue Type: Improvement
>  Components: Druid integration
>Affects Versions: 2.2.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: HIVE-15612.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15580) Replace Spark's groupByKey operator with something with bounded memory

2017-01-18 Thread Ferdinand Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829099#comment-15829099
 ] 

Ferdinand Xu commented on HIVE-15580:
-

[~xuefuz], patch in HIVE-15527 solved the issue and we're trying the patch in 
HIVE-15580. Will keep you post.

> Replace Spark's groupByKey operator with something with bounded memory
> --
>
> Key: HIVE-15580
> URL: https://issues.apache.org/jira/browse/HIVE-15580
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, 
> HIVE-15580.2.patch, HIVE-15580.2.patch, HIVE-15580.3.patch, HIVE-15580.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15613) Include druid-handler sources in src packaging

2017-01-18 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829097#comment-15829097
 ] 

Ashutosh Chauhan commented on HIVE-15613:
-

+1

> Include druid-handler sources in src packaging
> --
>
> Key: HIVE-15613
> URL: https://issues.apache.org/jira/browse/HIVE-15613
> Project: Hive
>  Issue Type: Improvement
>  Components: Druid integration
>Affects Versions: 2.2.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Minor
> Attachments: HIVE-15613.patch
>
>
> We forgot to do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15612) Include Calcite dependency in Druid storage handler jar

2017-01-18 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829093#comment-15829093
 ] 

Ashutosh Chauhan commented on HIVE-15612:
-

+1

> Include Calcite dependency in Druid storage handler jar
> ---
>
> Key: HIVE-15612
> URL: https://issues.apache.org/jira/browse/HIVE-15612
> Project: Hive
>  Issue Type: Improvement
>  Components: Druid integration
>Affects Versions: 2.2.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Minor
> Attachments: HIVE-15612.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15614) Druid splitSelectQuery closes lifecycle object too early

2017-01-18 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829091#comment-15829091
 ] 

Ashutosh Chauhan commented on HIVE-15614:
-

+1 LGTM

> Druid splitSelectQuery closes lifecycle object too early
> 
>
> Key: HIVE-15614
> URL: https://issues.apache.org/jira/browse/HIVE-15614
> Project: Hive
>  Issue Type: Improvement
>  Components: Druid integration
>Affects Versions: 2.2.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
> Attachments: HIVE-15614.patch
>
>
> L208 in DruidQueryBasedInputFormat.java.
> Fix includes better handling of lifecycle objects in general.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15578) Simplify IdentifiersParser

2017-01-18 Thread Pengcheng Xiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pengcheng Xiong updated HIVE-15578:
---
Status: Open  (was: Patch Available)

> Simplify IdentifiersParser
> --
>
> Key: HIVE-15578
> URL: https://issues.apache.org/jira/browse/HIVE-15578
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-15578.01.patch, HIVE-15578.02.patch
>
>
> before: 1.72M LOC in IdentifiersParser, after: 1.41M



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15578) Simplify IdentifiersParser

2017-01-18 Thread Pengcheng Xiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pengcheng Xiong updated HIVE-15578:
---
Status: Patch Available  (was: Open)

> Simplify IdentifiersParser
> --
>
> Key: HIVE-15578
> URL: https://issues.apache.org/jira/browse/HIVE-15578
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-15578.01.patch, HIVE-15578.02.patch
>
>
> before: 1.72M LOC in IdentifiersParser, after: 1.41M



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15646) Column level lineage is not available for table Views

2017-01-18 Thread Pengcheng Xiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pengcheng Xiong updated HIVE-15646:
---
Status: Patch Available  (was: Open)

> Column level lineage is not available for table Views
> -
>
> Key: HIVE-15646
> URL: https://issues.apache.org/jira/browse/HIVE-15646
> Project: Hive
>  Issue Type: Bug
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-15646.01.patch, HIVE-15646.02.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15646) Column level lineage is not available for table Views

2017-01-18 Thread Pengcheng Xiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pengcheng Xiong updated HIVE-15646:
---
Attachment: HIVE-15646.02.patch

> Column level lineage is not available for table Views
> -
>
> Key: HIVE-15646
> URL: https://issues.apache.org/jira/browse/HIVE-15646
> Project: Hive
>  Issue Type: Bug
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-15646.01.patch, HIVE-15646.02.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15646) Column level lineage is not available for table Views

2017-01-18 Thread Pengcheng Xiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pengcheng Xiong updated HIVE-15646:
---
Status: Open  (was: Patch Available)

> Column level lineage is not available for table Views
> -
>
> Key: HIVE-15646
> URL: https://issues.apache.org/jira/browse/HIVE-15646
> Project: Hive
>  Issue Type: Bug
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-15646.01.patch, HIVE-15646.02.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15613) Include druid-handler sources in src packaging

2017-01-18 Thread Jesus Camacho Rodriguez (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez updated HIVE-15613:
---
Component/s: Druid integration

> Include druid-handler sources in src packaging
> --
>
> Key: HIVE-15613
> URL: https://issues.apache.org/jira/browse/HIVE-15613
> Project: Hive
>  Issue Type: Improvement
>  Components: Druid integration
>Affects Versions: 2.2.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Minor
> Attachments: HIVE-15613.patch
>
>
> We forgot to do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15541) Hive OOM when ATSHook enabled and ATS goes down

2017-01-18 Thread Jason Dere (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829077#comment-15829077
 ] 

Jason Dere commented on HIVE-15541:
---

[~sershe], can you take a look one more time?

> Hive OOM when ATSHook enabled and ATS goes down
> ---
>
> Key: HIVE-15541
> URL: https://issues.apache.org/jira/browse/HIVE-15541
> Project: Hive
>  Issue Type: Bug
>  Components: Hooks
>Reporter: Jason Dere
>Assignee: Jason Dere
> Attachments: HIVE-15541.1.patch, HIVE-15541.2.patch, 
> HIVE-15541.3.patch, HIVE-15541.4.patch
>
>
> The ATS API used by the Hive ATSHook is a blocking call, if ATS goes down 
> this can block the ATSHook executor, while the hook continues to submit work 
> to the executor with each query.
> Over time the buildup of queued items can cause OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15613) Include druid-handler sources in src packaging

2017-01-18 Thread Jesus Camacho Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829075#comment-15829075
 ] 

Jesus Camacho Rodriguez commented on HIVE-15613:


[~ashutoshc], can you review it? Thanks

> Include druid-handler sources in src packaging
> --
>
> Key: HIVE-15613
> URL: https://issues.apache.org/jira/browse/HIVE-15613
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 2.2.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Minor
> Attachments: HIVE-15613.patch
>
>
> We forgot to do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15633) Hive/Druid integration: Exception when time filter is not in datasource range

2017-01-18 Thread Jesus Camacho Rodriguez (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez updated HIVE-15633:
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   Status: Resolved  (was: Patch Available)

Pushed to master, thanks for reviewing [~ashutoshc]!

> Hive/Druid integration: Exception when time filter is not in datasource range
> -
>
> Key: HIVE-15633
> URL: https://issues.apache.org/jira/browse/HIVE-15633
> Project: Hive
>  Issue Type: Bug
>  Components: Druid integration
>Affects Versions: 2.2.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
> Fix For: 2.2.0
>
> Attachments: HIVE-15633.patch
>
>
> When _metadataList.isEmpty()_ (L222 in DruidQueryBasedInputFormat) returns 
> true, we throw an Exception. However, this is true if query filters on range 
> that is not within datasource timestamp ranges. Thus, we should only throw 
> the Exception if _metadataList_ is null.
> Issue can be reproduced with the following query if timestamp values are all 
> greater or equal than '1999-11-01 00:00:00':
> {code:sql}
> SELECT COUNT(`__time`)
> FROM store_sales_sold_time_subset
> WHERE `__time` < '1999-11-01 00:00:00';
> {code}
> {noformat}
> Status: Failed
> Vertex failed, vertexName=Map 1, vertexId=vertex_1484282558103_0067_2_00, 
> diagnostics=[Vertex vertex_1484282558103_0067_2_00 [Map 1] killed/failed due 
> to:ROOT_INPUT_INIT_FAILURE, Vertex Input: store_sales_sold_time_subset 
> initializer failed, vertex=vertex_1484282558103_0067_2_00 [Map 1], 
> java.io.IOException: Connected to Druid but could not retrieve datasource 
> information
>   at 
> org.apache.hadoop.hive.druid.io.DruidQueryBasedInputFormat.splitSelectQuery(DruidQueryBasedInputFormat.java:224)
>   at 
> org.apache.hadoop.hive.druid.io.DruidQueryBasedInputFormat.getInputSplits(DruidQueryBasedInputFormat.java:140)
>   at 
> org.apache.hadoop.hive.druid.io.DruidQueryBasedInputFormat.getSplits(DruidQueryBasedInputFormat.java:92)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:367)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:485)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15582) Druid CTAS should support BYTE/SHORT/INT types

2017-01-18 Thread Jesus Camacho Rodriguez (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez updated HIVE-15582:
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   Status: Resolved  (was: Patch Available)

Pushed to master, thanks for reviewing [~ashutoshc]!

> Druid CTAS should support BYTE/SHORT/INT types
> --
>
> Key: HIVE-15582
> URL: https://issues.apache.org/jira/browse/HIVE-15582
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 2.2.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
> Fix For: 2.2.0
>
> Attachments: HIVE-15582.02.patch, HIVE-15582.patch
>
>
> Currently these types are not recognized and we throw an exception when we 
> try to create a table with them.
> {noformat}
> Caused by: org.apache.hadoop.hive.serde2.SerDeException: Unknown type: INT
>   at 
> org.apache.hadoop.hive.druid.serde.DruidSerDe.serialize(DruidSerDe.java:414)
>   at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:715)
>   ... 22 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15612) Include Calcite dependency in Druid storage handler jar

2017-01-18 Thread slim bouguerra (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829072#comment-15829072
 ] 

slim bouguerra commented on HIVE-15612:
---

+1

> Include Calcite dependency in Druid storage handler jar
> ---
>
> Key: HIVE-15612
> URL: https://issues.apache.org/jira/browse/HIVE-15612
> Project: Hive
>  Issue Type: Improvement
>  Components: Druid integration
>Affects Versions: 2.2.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Minor
> Attachments: HIVE-15612.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15614) Druid splitSelectQuery closes lifecycle object too early

2017-01-18 Thread Jesus Camacho Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829068#comment-15829068
 ] 

Jesus Camacho Rodriguez commented on HIVE-15614:


[~bslim], [~ashutoshc], could you take a look? Thanks

> Druid splitSelectQuery closes lifecycle object too early
> 
>
> Key: HIVE-15614
> URL: https://issues.apache.org/jira/browse/HIVE-15614
> Project: Hive
>  Issue Type: Improvement
>  Components: Druid integration
>Affects Versions: 2.2.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
> Attachments: HIVE-15614.patch
>
>
> L208 in DruidQueryBasedInputFormat.java.
> Fix includes better handling of lifecycle objects in general.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15612) Include Calcite dependency in Druid storage handler jar

2017-01-18 Thread Jesus Camacho Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829070#comment-15829070
 ] 

Jesus Camacho Rodriguez commented on HIVE-15612:


[~bslim], [~ashutoshc], could you take a look? Thanks

> Include Calcite dependency in Druid storage handler jar
> ---
>
> Key: HIVE-15612
> URL: https://issues.apache.org/jira/browse/HIVE-15612
> Project: Hive
>  Issue Type: Improvement
>  Components: Druid integration
>Affects Versions: 2.2.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Minor
> Attachments: HIVE-15612.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15541) Hive OOM when ATSHook enabled and ATS goes down

2017-01-18 Thread Jason Dere (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dere updated HIVE-15541:
--
Attachment: HIVE-15541.4.patch

> Hive OOM when ATSHook enabled and ATS goes down
> ---
>
> Key: HIVE-15541
> URL: https://issues.apache.org/jira/browse/HIVE-15541
> Project: Hive
>  Issue Type: Bug
>  Components: Hooks
>Reporter: Jason Dere
>Assignee: Jason Dere
> Attachments: HIVE-15541.1.patch, HIVE-15541.2.patch, 
> HIVE-15541.3.patch, HIVE-15541.4.patch
>
>
> The ATS API used by the Hive ATSHook is a blocking call, if ATS goes down 
> this can block the ATSHook executor, while the hook continues to submit work 
> to the executor with each query.
> Over time the buildup of queued items can cause OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15541) Hive OOM when ATSHook enabled and ATS goes down

2017-01-18 Thread Jason Dere (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829051#comment-15829051
 ] 

Jason Dere commented on HIVE-15541:
---

Per [~hagleitn], Creating the ATS event on the main query thread is expensive 
and should be done on a separate thread.
Will try creating a separate thread pool for sending the event to ATS, so that 
the existing executor does not get tied up in blocking calls.

> Hive OOM when ATSHook enabled and ATS goes down
> ---
>
> Key: HIVE-15541
> URL: https://issues.apache.org/jira/browse/HIVE-15541
> Project: Hive
>  Issue Type: Bug
>  Components: Hooks
>Reporter: Jason Dere
>Assignee: Jason Dere
> Attachments: HIVE-15541.1.patch, HIVE-15541.2.patch, 
> HIVE-15541.3.patch
>
>
> The ATS API used by the Hive ATSHook is a blocking call, if ATS goes down 
> this can block the ATSHook executor, while the hook continues to submit work 
> to the executor with each query.
> Over time the buildup of queued items can cause OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15508) LLAP: Find better way to track memory usage per executor

2017-01-18 Thread Prasanth Jayachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829045#comment-15829045
 ] 

Prasanth Jayachandran commented on HIVE-15508:
--

I don't think we are there yet :)


> LLAP: Find better way to track memory usage per executor
> 
>
> Key: HIVE-15508
> URL: https://issues.apache.org/jira/browse/HIVE-15508
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Prasanth Jayachandran
>
> Many hive operators make runtime decisions based on memory usage. For getting 
> the memory usage, Runtime.getUsed() or MemoryMXBean methods are used. This 
> works fine for MR or Tez but for LLAP the entire memory is shared among 
> multiple executors and each executors can have different memory usage. 
> HIVE-15503 assumes the memory usage is shared across all executors. If we 
> track memory usage per executor, also memory usage by on-heap cache better 
> decisions can be made by the operators. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15655) Optimizer: Allow config option to disable n-way JOIN merging

2017-01-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829043#comment-15829043
 ] 

Hive QA commented on HIVE-15655:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12848026/HIVE-15655.1.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 351 failed/errored test(s), 10942 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=234)
TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=102)

[limit_pushdown2.q,skewjoin_noskew.q,leftsemijoin_mr.q,bucket3.q,skewjoinopt13.q,bucketmapjoin9.q,auto_join15.q,ptf.q,join22.q,vectorized_nested_mapjoin.q,sample4.q,union18.q,multi_insert_gby.q,join33.q,join_cond_pushdown_unqual2.q]
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_joins] 
(batchId=218)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_predicate_pushdown]
 (batchId=218)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_single_sourced_multi_insert]
 (batchId=218)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[ctas] 
(batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_dynamic_partitions]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_table]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_directory]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_dynamic_partitions]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_table]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[write_final_output_blobstore]
 (batchId=230)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_subquery] 
(batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_table_stats] 
(batchId=48)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[analyze_tbl_part] 
(batchId=44)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[annotate_stats_join_pkfk]
 (batchId=13)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[avrocountemptytbl] 
(batchId=74)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_rp_udf_percentile2] 
(batchId=18)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_rp_udf_percentile] 
(batchId=39)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[create_or_replace_view] 
(batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_udf] (batchId=8)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[index_auto_mult_tables] 
(batchId=78)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input19] (batchId=79)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_overwrite_directory]
 (batchId=25)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[join46] (batchId=1)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[join_emit_interval] 
(batchId=10)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[mapjoin46] (batchId=53)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[multi_insert_gby4] 
(batchId=43)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[nested_column_pruning] 
(batchId=31)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[sample5] (batchId=52)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[serde_opencsv] 
(batchId=68)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[smb_mapjoin_46] 
(batchId=38)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[specialChar] (batchId=22)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_exists] 
(batchId=38)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_notexists] 
(batchId=81)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_notin_having] 
(batchId=45)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_unqualcolumnrefs]
 (batchId=17)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_array_contains] 
(batchId=12)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_conv] (batchId=21)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_add] 
(batchId=44)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_sub] (batchId=2)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_hex] (batchId=22)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_java_method] 
(batchId=63)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_map_keys] 
(batchId=62)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_map_values] 
(batchId=46)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_months_between] 
(batchId=48)
org.apache.hadoop.hive.cli.TestCliDrive

[jira] [Updated] (HIVE-14949) Enforce that target:source is not 1:N

2017-01-18 Thread Eugene Koifman (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-14949:
--
Description: 
If > 1 row on source side matches the same row on target side that means that  
we are forced update (or delete) the same row in target more than once as part 
of the same SQL statement.  This should raise an error per SQL Spec
ISO/IEC 9075-2:2011(E)
Section 14.2 under "General Rules" Item 6/Subitem a/Subitem 2/Subitem B

There is no sure way to do this via static analysis of the query.

Can we add something to ROJ operator to pay attention to ROW__ID of target side 
row and compare it with ROW__ID of target side of previous row output?  If they 
are the same, that means > 1 source row matched.
Or perhaps just mark each row in the hash table that it matched.  And if it 
matches again, throw an error.

  was:
If > 1 row on source side matches the same row on target side that means that  
we are forced update (or delete) the same row in target more than once as part 
of the same SQL statement.  This should raise an error per SQL Spec

There is no sure way to do this via static analysis of the query.

Can we add something to ROJ operator to pay attention to ROW__ID of target side 
row and compare it with ROW__ID of target side of previous row output?  If they 
are the same, that means > 1 source row matched.
Or perhaps just mark each row in the hash table that it matched.  And if it 
matches again, throw an error.


> Enforce that target:source is not 1:N
> -
>
> Key: HIVE-14949
> URL: https://issues.apache.org/jira/browse/HIVE-14949
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>
> If > 1 row on source side matches the same row on target side that means that 
>  we are forced update (or delete) the same row in target more than once as 
> part of the same SQL statement.  This should raise an error per SQL Spec
> ISO/IEC 9075-2:2011(E)
> Section 14.2 under "General Rules" Item 6/Subitem a/Subitem 2/Subitem B
> There is no sure way to do this via static analysis of the query.
> Can we add something to ROJ operator to pay attention to ROW__ID of target 
> side row and compare it with ROW__ID of target side of previous row output?  
> If they are the same, that means > 1 source row matched.
> Or perhaps just mark each row in the hash table that it matched.  And if it 
> matches again, throw an error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15615) Fix unit tests failures caused by HIVE-13696

2017-01-18 Thread Yongzhi Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829027#comment-15829027
 ] 

Yongzhi Chen commented on HIVE-15615:
-

[~sushanth], I agree with you, we need revert HIVE-13696 to make the tarball 
installation stable.

> Fix unit tests failures caused by HIVE-13696
> 
>
> Key: HIVE-15615
> URL: https://issues.apache.org/jira/browse/HIVE-15615
> Project: Hive
>  Issue Type: Bug
>Reporter: Yongzhi Chen
>Assignee: Yongzhi Chen
> Attachments: HIVE-15615.1.patch
>
>
> Following unit tests failed with same stack:
> org.apache.hadoop.hive.ql.security.authorization.plugin.TestHiveAuthorizerCheckInvocation
> org.apache.hadoop.hive.ql.security.authorization.plugin.TestHiveAuthorizerShowFilters
> {noformat}
> 2017-01-11T15:02:27,774 ERROR [main] ql.Driver: FAILED: NullPointerException 
> null
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule.cleanName(QueuePlacementRule.java:351)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule$User.getQueueForApp(QueuePlacementRule.java:132)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementRule.assignAppToQueue(QueuePlacementRule.java:74)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueuePlacementPolicy.assignAppToQueue(QueuePlacementPolicy.java:167)
>   at 
> org.apache.hadoop.hive.schshim.FairSchedulerShim.setJobQueueForUserInternal(FairSchedulerShim.java:96)
>   at 
> org.apache.hadoop.hive.schshim.FairSchedulerShim.validateQueueConfiguration(FairSchedulerShim.java:82)
>   at 
> org.apache.hadoop.hive.ql.session.YarnFairScheduling.validateYarnQueue(YarnFairScheduling.java:68)
>   at org.apache.hadoop.hive.ql.Driver.configureScheduling(Driver.java:671)
>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:543)
>   at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1313)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1453)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1233)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1223)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15629) Set DDLTask’s exception with its subtask’s exception

2017-01-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HIVE-15629:
-
Attachment: (was: HIVE-15629.000.patch)

> Set DDLTask’s exception with its subtask’s exception
> 
>
> Key: HIVE-15629
> URL: https://issues.apache.org/jira/browse/HIVE-15629
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive
>Affects Versions: 2.2.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: HIVE-15629.000.patch
>
>
> Set DDLTask’s exception with its subtask’s exception, So the exception from 
> subtask in DDLTask can be propagated to TaskRunner.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15629) Set DDLTask’s exception with its subtask’s exception

2017-01-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HIVE-15629:
-
Attachment: HIVE-15629.000.patch

> Set DDLTask’s exception with its subtask’s exception
> 
>
> Key: HIVE-15629
> URL: https://issues.apache.org/jira/browse/HIVE-15629
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive
>Affects Versions: 2.2.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: HIVE-15629.000.patch
>
>
> Set DDLTask’s exception with its subtask’s exception, So the exception from 
> subtask in DDLTask can be propagated to TaskRunner.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15630) add operation handle before operation.run instead of after operation.run

2017-01-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HIVE-15630:
-
Attachment: HIVE-15630.000.patch

> add operation handle before operation.run instead of after operation.run
> 
>
> Key: HIVE-15630
> URL: https://issues.apache.org/jira/browse/HIVE-15630
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.2.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: HIVE-15630.000.patch
>
>
> Add operation handle before operation.run instead of after operation.run. So 
> when session is closed, all the running operations from {{operation.run}} can 
> also be closed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15630) add operation handle before operation.run instead of after operation.run

2017-01-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HIVE-15630:
-
Attachment: (was: HIVE-15630.000.patch)

> add operation handle before operation.run instead of after operation.run
> 
>
> Key: HIVE-15630
> URL: https://issues.apache.org/jira/browse/HIVE-15630
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.2.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
>
> Add operation handle before operation.run instead of after operation.run. So 
> when session is closed, all the running operations from {{operation.run}} can 
> also be closed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15662) check startTime in SparkTask to make sure startTime is not less than submitTime

2017-01-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HIVE-15662:
-
Attachment: HIVE-15662.000.patch

> check startTime in SparkTask to make sure startTime is not less than 
> submitTime
> ---
>
> Key: HIVE-15662
> URL: https://issues.apache.org/jira/browse/HIVE-15662
> Project: Hive
>  Issue Type: Bug
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: HIVE-15662.000.patch
>
>
> Check startTime in SparkTask to make sure startTime is not less than 
> submitTime. We saw a corner case when the sparkTask is finished in less than 
> 1 second, the startTime may not be set because RemoteSparkJobMonitor will 
> sleep for 1 second then check the state, in this case, right after sleep for 
> one second, the spark job is already completed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15508) LLAP: Find better way to track memory usage per executor

2017-01-18 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829022#comment-15829022
 ] 

Sergey Shelukhin commented on HIVE-15508:
-

It's Java... how would they? 

> LLAP: Find better way to track memory usage per executor
> 
>
> Key: HIVE-15508
> URL: https://issues.apache.org/jira/browse/HIVE-15508
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Prasanth Jayachandran
>
> Many hive operators make runtime decisions based on memory usage. For getting 
> the memory usage, Runtime.getUsed() or MemoryMXBean methods are used. This 
> works fine for MR or Tez but for LLAP the entire memory is shared among 
> multiple executors and each executors can have different memory usage. 
> HIVE-15503 assumes the memory usage is shared across all executors. If we 
> track memory usage per executor, also memory usage by on-heap cache better 
> decisions can be made by the operators. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15662) check startTime in SparkTask to make sure startTime is not less than submitTime

2017-01-18 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HIVE-15662:
-
Status: Patch Available  (was: Open)

> check startTime in SparkTask to make sure startTime is not less than 
> submitTime
> ---
>
> Key: HIVE-15662
> URL: https://issues.apache.org/jira/browse/HIVE-15662
> Project: Hive
>  Issue Type: Bug
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Minor
> Attachments: HIVE-15662.000.patch
>
>
> Check startTime in SparkTask to make sure startTime is not less than 
> submitTime. We saw a corner case when the sparkTask is finished in less than 
> 1 second, the startTime may not be set because RemoteSparkJobMonitor will 
> sleep for 1 second then check the state, in this case, right after sleep for 
> one second, the spark job is already completed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15546) Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel

2017-01-18 Thread Thomas Poepping (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829016#comment-15829016
 ] 

Thomas Poepping commented on HIVE-15546:


I see how that could make sense -- just have the executor treat the empty 
partition as it would any other, by getting all files and parsing. It's just in 
the case of an empty partition, an empty file is used.

Seems fine to me. I also took a look at the RB, no problems there. Non-binding 
+1 from me.

> Optimize Utilities.getInputPaths() so each listStatus of a partition is done 
> in parallel
> 
>
> Key: HIVE-15546
> URL: https://issues.apache.org/jira/browse/HIVE-15546
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-15546.1.patch, HIVE-15546.2.patch, 
> HIVE-15546.3.patch, HIVE-15546.4.patch, HIVE-15546.5.patch
>
>
> When running on blobstores (like S3) where metadata operations (like 
> listStatus) are costly, Utilities.getInputPaths() can add significant 
> overhead when setting up the input paths for an MR / Spark / Tez job.
> The method performs a listStatus on all input paths in order to check if the 
> path is empty. If the path is empty, a dummy file is created for the given 
> partition. This is all done sequentially. This can be really slow when there 
> are a lot of empty partitions. Even when all partitions have input data, this 
> can take a long time.
> We should either:
> (1) Just remove the logic to check if each input path is empty, and handle 
> any edge cases accordingly.
> (2) Multi-thread the listStatus calls



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15546) Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel

2017-01-18 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829006#comment-15829006
 ] 

Sahil Takiar commented on HIVE-15546:
-

According to the code, dummy files are added to empty partitions so that the 
operator pipeline does not need to be special cased to handle such scenarios. 
As far as I can tell, this has been the behavior for a while:

{quote}
If any input path points to an empty table or partition a dummy file in the 
scratch dir is instead created and added to the list. This is needed to avoid 
special casing the operator pipeline for these cases.
{quote}

I originally tried to remove the dummy file logic, but that caused a bunch of 
test failures (ref the first Hive QA run). So I decided to make the logic 
multi-threaded instead.

Sergey kindly pointed out some code in NullScanFileSystem and 
ZeroRowsInputFormat that could potentially get around this, but I was going to 
leave that investigation for a separate patch. Even when using the 
NullScanFileSystem and ZeroRowsInputFormat, a listStatus still needs to be done 
to see if the partition is empty or not, which can be costly on blobstores like 
S3.

> Optimize Utilities.getInputPaths() so each listStatus of a partition is done 
> in parallel
> 
>
> Key: HIVE-15546
> URL: https://issues.apache.org/jira/browse/HIVE-15546
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-15546.1.patch, HIVE-15546.2.patch, 
> HIVE-15546.3.patch, HIVE-15546.4.patch, HIVE-15546.5.patch
>
>
> When running on blobstores (like S3) where metadata operations (like 
> listStatus) are costly, Utilities.getInputPaths() can add significant 
> overhead when setting up the input paths for an MR / Spark / Tez job.
> The method performs a listStatus on all input paths in order to check if the 
> path is empty. If the path is empty, a dummy file is created for the given 
> partition. This is all done sequentially. This can be really slow when there 
> are a lot of empty partitions. Even when all partitions have input data, this 
> can take a long time.
> We should either:
> (1) Just remove the logic to check if each input path is empty, and handle 
> any edge cases accordingly.
> (2) Multi-thread the listStatus calls



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15623) Use customized version of netty for llap

2017-01-18 Thread Wei Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829004#comment-15829004
 ] 

Wei Zheng commented on HIVE-15623:
--

mvn dependency:tree output under llap-server directory without and with patch

w/o patch
{code}
[INFO] +- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.7.2:compile
[INFO] |  +- org.apache.hadoop:hadoop-yarn-common:jar:2.7.2:compile
[INFO] |  |  +- javax.xml.bind:jaxb-api:jar:2.2.2:compile
[INFO] |  |  |  \- javax.xml.stream:stax-api:jar:1.0-2:compile
[INFO] |  |  +- com.google.inject:guice:jar:3.0:compile
[INFO] |  |  |  +- javax.inject:javax.inject:jar:1:compile
[INFO] |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
[INFO] |  |  \- com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
[INFO] |  +- com.google.inject.extensions:guice-servlet:jar:3.0:compile
[INFO] |  \- io.netty:netty:jar:3.6.2.Final:compile

[INFO] +- org.apache.hadoop:hadoop-hdfs:jar:2.7.2:test
[INFO] |  +- commons-daemon:commons-daemon:jar:1.0.13:test
[INFO] |  +- io.netty:netty-all:jar:4.0.29.Final:compile
{code}

with patch
{code}
[INFO] +- io.netty:netty:jar:3.6.2.Final:compile

[INFO] +- org.apache.hadoop:hadoop-hdfs:jar:2.7.2:test
[INFO] |  +- commons-daemon:commons-daemon:jar:1.0.13:test
[INFO] |  +- io.netty:netty-all:jar:4.0.29.Final:compile
{code}

> Use customized version of netty for llap
> 
>
> Key: HIVE-15623
> URL: https://issues.apache.org/jira/browse/HIVE-15623
> Project: Hive
>  Issue Type: Task
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Wei Zheng
>Assignee: Wei Zheng
> Attachments: HIVE-15623.1.patch, HIVE-15623.2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15546) Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel

2017-01-18 Thread Thomas Poepping (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828994#comment-15828994
 ] 

Thomas Poepping commented on HIVE-15546:


What is the benefit of using dummy files in empty partitions?

> Optimize Utilities.getInputPaths() so each listStatus of a partition is done 
> in parallel
> 
>
> Key: HIVE-15546
> URL: https://issues.apache.org/jira/browse/HIVE-15546
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-15546.1.patch, HIVE-15546.2.patch, 
> HIVE-15546.3.patch, HIVE-15546.4.patch, HIVE-15546.5.patch
>
>
> When running on blobstores (like S3) where metadata operations (like 
> listStatus) are costly, Utilities.getInputPaths() can add significant 
> overhead when setting up the input paths for an MR / Spark / Tez job.
> The method performs a listStatus on all input paths in order to check if the 
> path is empty. If the path is empty, a dummy file is created for the given 
> partition. This is all done sequentially. This can be really slow when there 
> are a lot of empty partitions. Even when all partitions have input data, this 
> can take a long time.
> We should either:
> (1) Just remove the logic to check if each input path is empty, and handle 
> any edge cases accordingly.
> (2) Multi-thread the listStatus calls



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15565) LLAP: GroupByOperator flushes hash table too frequently

2017-01-18 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828993#comment-15828993
 ] 

Siddharth Seth commented on HIVE-15565:
---

After the patch, we won't flush in LLAP GroupBy, correct? used/numExecutors 
always < maxMemory ?

Before the patch - it's random, and depends on other operators running in the 
system.
https://issues.apache.org/jira/browse/HIVE-15508 is pretty important - tracking 
memory per operator should help with this.

> LLAP: GroupByOperator flushes hash table too frequently
> ---
>
> Key: HIVE-15565
> URL: https://issues.apache.org/jira/browse/HIVE-15565
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: HIVE-15565.1.patch
>
>
> {{GroupByOperator::isTez}} would be true in LLAP mode. Current memory 
> computations can go wrong with {{isTez}} checks in {{GroupByOperator}}. For 
> e.g, in a LLAP instance with Xmx128G and 12 executors, it would start 
> flushing hash table for every record once it reaches around 42GB 
> (hive.tez.container.size=7100, hive.map.aggr.hash.percentmemory=0.5).
> {noformat}
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_04_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size 
> = 0
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_12_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size 
> = 0
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_04_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 
> 1
> 2017-01-08T23:40:21,339 INFO  [TezTaskRunner 
> (1480722417364_1922_7_03_12_1)] 
> org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 
> 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15508) LLAP: Find better way to track memory usage per executor

2017-01-18 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828987#comment-15828987
 ] 

Siddharth Seth commented on HIVE-15508:
---

Do individual operators not track how much memory they are using?

> LLAP: Find better way to track memory usage per executor
> 
>
> Key: HIVE-15508
> URL: https://issues.apache.org/jira/browse/HIVE-15508
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Prasanth Jayachandran
>
> Many hive operators make runtime decisions based on memory usage. For getting 
> the memory usage, Runtime.getUsed() or MemoryMXBean methods are used. This 
> works fine for MR or Tez but for LLAP the entire memory is shared among 
> multiple executors and each executors can have different memory usage. 
> HIVE-15503 assumes the memory usage is shared across all executors. If we 
> track memory usage per executor, also memory usage by on-heap cache better 
> decisions can be made by the operators. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15586) Make Insert and Create statement Transactional

2017-01-18 Thread slim bouguerra (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

slim bouguerra updated HIVE-15586:
--
Attachment: HIVE-15586.patch

> Make Insert and Create statement Transactional
> --
>
> Key: HIVE-15586
> URL: https://issues.apache.org/jira/browse/HIVE-15586
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Reporter: slim bouguerra
>Assignee: slim bouguerra
> Attachments: HIVE-15586.patch, HIVE-15586.patch, HIVE-15586.patch
>
>
> Currently insert/create will return the handle to user without waiting for 
> the data been loaded by the druid cluster. In order to avoid that will add a 
> passive wait till the segment are loaded by historical in case the 
> coordinator is UP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15036) Druid code recently included in Hive pulls in GPL jar

2017-01-18 Thread slim bouguerra (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828979#comment-15828979
 ] 

slim bouguerra commented on HIVE-15036:
---

This one can be excluded, we don't need `io.airlift:airline` it is used to run 
druid CLI which is not really used here.


> Druid code recently included in Hive pulls in GPL jar
> -
>
> Key: HIVE-15036
> URL: https://issues.apache.org/jira/browse/HIVE-15036
> Project: Hive
>  Issue Type: Bug
>  Components: Druid integration
>Affects Versions: 2.2.0
>Reporter: Alan Gates
>Assignee: Jesus Camacho Rodriguez
>Priority: Blocker
>
> Druid pulls in a jar annotation-2.3.jar.  According to its pom file it is 
> licensed under GPL.  We cannot ship a binary distribution that includes this 
> jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10562) Add version column to NOTIFICATION_LOG table and DbNotificationListener

2017-01-18 Thread Sushanth Sowmyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushanth Sowmyan updated HIVE-10562:

Status: Patch Available  (was: Open)

> Add version column to NOTIFICATION_LOG table and DbNotificationListener
> ---
>
> Key: HIVE-10562
> URL: https://issues.apache.org/jira/browse/HIVE-10562
> Project: Hive
>  Issue Type: Sub-task
>  Components: Import/Export
>Affects Versions: 1.2.0
>Reporter: Sushanth Sowmyan
>Assignee: Sushanth Sowmyan
> Attachments: HIVE-10562.2.patch, HIVE-10562.patch
>
>
> Currently, we have a JSON encoded message being stored in the 
> NOTIFICATION_LOG table.
> If we want to be future proof, we need to allow for versioning of this 
> message, since we might change what gets stored in the message. A prime 
> example of what we'd want to change is as in HIVE-10393.
> MessageFactory already has stubs to allow for versioning of messages, and we 
> could expand on this further in the future. NotificationListener currently 
> encodes the message version into the header for the JMS message it sends, 
> which seems to be the right place for a message version (instead of being 
> contained in the message, for eg.).
> So, we should have a similar ability for DbEventListener as well, and the 
> place this makes the most sense is to and add a version column to the 
> NOTIFICATION_LOG table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-10562) Add version column to NOTIFICATION_LOG table and DbNotificationListener

2017-01-18 Thread Sushanth Sowmyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushanth Sowmyan updated HIVE-10562:

Attachment: HIVE-10562.2.patch

Updated patch.

> Add version column to NOTIFICATION_LOG table and DbNotificationListener
> ---
>
> Key: HIVE-10562
> URL: https://issues.apache.org/jira/browse/HIVE-10562
> Project: Hive
>  Issue Type: Sub-task
>  Components: Import/Export
>Affects Versions: 1.2.0
>Reporter: Sushanth Sowmyan
>Assignee: Sushanth Sowmyan
> Attachments: HIVE-10562.2.patch, HIVE-10562.patch
>
>
> Currently, we have a JSON encoded message being stored in the 
> NOTIFICATION_LOG table.
> If we want to be future proof, we need to allow for versioning of this 
> message, since we might change what gets stored in the message. A prime 
> example of what we'd want to change is as in HIVE-10393.
> MessageFactory already has stubs to allow for versioning of messages, and we 
> could expand on this further in the future. NotificationListener currently 
> encodes the message version into the header for the JMS message it sends, 
> which seems to be the right place for a message version (instead of being 
> contained in the message, for eg.).
> So, we should have a similar ability for DbEventListener as well, and the 
> place this makes the most sense is to and add a version column to the 
> NOTIFICATION_LOG table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15554) Add task information to LLAP AM heartbeat

2017-01-18 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828963#comment-15828963
 ] 

Siddharth Seth commented on HIVE-15554:
---

+1, after removing the TODOs

> Add task information to LLAP AM heartbeat
> -
>
> Key: HIVE-15554
> URL: https://issues.apache.org/jira/browse/HIVE-15554
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-15554.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15546) Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel

2017-01-18 Thread JIRA

[ 
https://issues.apache.org/jira/browse/HIVE-15546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828961#comment-15828961
 ] 

Sergio Peña commented on HIVE-15546:


Patch looks good (reviewed on RB).
+1

> Optimize Utilities.getInputPaths() so each listStatus of a partition is done 
> in parallel
> 
>
> Key: HIVE-15546
> URL: https://issues.apache.org/jira/browse/HIVE-15546
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-15546.1.patch, HIVE-15546.2.patch, 
> HIVE-15546.3.patch, HIVE-15546.4.patch, HIVE-15546.5.patch
>
>
> When running on blobstores (like S3) where metadata operations (like 
> listStatus) are costly, Utilities.getInputPaths() can add significant 
> overhead when setting up the input paths for an MR / Spark / Tez job.
> The method performs a listStatus on all input paths in order to check if the 
> path is empty. If the path is empty, a dummy file is created for the given 
> partition. This is all done sequentially. This can be really slow when there 
> are a lot of empty partitions. Even when all partitions have input data, this 
> can take a long time.
> We should either:
> (1) Just remove the logic to check if each input path is empty, and handle 
> any edge cases accordingly.
> (2) Multi-thread the listStatus calls



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15623) Use customized version of netty for llap

2017-01-18 Thread Wei Zheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Zheng updated HIVE-15623:
-
Status: Patch Available  (was: Open)

> Use customized version of netty for llap
> 
>
> Key: HIVE-15623
> URL: https://issues.apache.org/jira/browse/HIVE-15623
> Project: Hive
>  Issue Type: Task
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Wei Zheng
>Assignee: Wei Zheng
> Attachments: HIVE-15623.1.patch, HIVE-15623.2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15623) Use customized version of netty for llap

2017-01-18 Thread Wei Zheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Zheng updated HIVE-15623:
-
Attachment: HIVE-15623.2.patch

> Use customized version of netty for llap
> 
>
> Key: HIVE-15623
> URL: https://issues.apache.org/jira/browse/HIVE-15623
> Project: Hive
>  Issue Type: Task
>  Components: llap
>Affects Versions: 2.2.0
>Reporter: Wei Zheng
>Assignee: Wei Zheng
> Attachments: HIVE-15623.1.patch, HIVE-15623.2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15647) Combination of a boolean condition and null-safe comparison leads to NPE

2017-01-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828957#comment-15828957
 ] 

Hive QA commented on HIVE-15647:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12848025/HIVE-15647.02.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 350 failed/errored test(s), 10957 tests 
executed
*Failed tests:*
{noformat}
TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) 
(batchId=234)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_joins] 
(batchId=218)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_predicate_pushdown]
 (batchId=218)
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_single_sourced_multi_insert]
 (batchId=218)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[ctas] 
(batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_dynamic_partitions]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_table]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_directory]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_dynamic_partitions]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_table]
 (batchId=230)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[write_final_output_blobstore]
 (batchId=230)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_subquery] 
(batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_table_stats] 
(batchId=48)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[analyze_tbl_part] 
(batchId=44)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[annotate_stats_join_pkfk]
 (batchId=13)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[avrocountemptytbl] 
(batchId=74)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_rp_udf_percentile2] 
(batchId=18)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_rp_udf_percentile] 
(batchId=39)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[create_or_replace_view] 
(batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_udf] (batchId=8)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input19] (batchId=79)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[insert_overwrite_directory]
 (batchId=25)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[join46] (batchId=1)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[join_emit_interval] 
(batchId=10)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[mapjoin46] (batchId=53)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[multi_insert_gby4] 
(batchId=43)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[nested_column_pruning] 
(batchId=31)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[sample5] (batchId=52)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[serde_opencsv] 
(batchId=68)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[smb_mapjoin_46] 
(batchId=38)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[specialChar] (batchId=22)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_exists] 
(batchId=38)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_notexists] 
(batchId=81)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_notin_having] 
(batchId=45)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_unqualcolumnrefs]
 (batchId=17)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_array_contains] 
(batchId=12)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_conv] (batchId=21)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_add] 
(batchId=44)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_sub] (batchId=2)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_hex] (batchId=22)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_java_method] 
(batchId=63)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_map_keys] 
(batchId=62)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_map_values] 
(batchId=46)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_months_between] 
(batchId=48)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_negative] (batchId=1)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_not] (batchId=51)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_percentile] 
(batchId=59)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_positive] 
(batchId=39)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sort_array] 
(batchId=59)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDri

  1   2   3   >