[jira] [Commented] (HIVE-11583) When PTF is used over a large partitions result could be corrupted

2015-11-12 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002983#comment-15002983
 ] 

Sergey Shelukhin commented on HIVE-11583:
-

You could generate it in the test by repeatedly cross joining. Or does the file 
have to be in a specific form that is not reproducible by the queries? 

> When PTF is used over a large partitions result could be corrupted
> --
>
> Key: HIVE-11583
> URL: https://issues.apache.org/jira/browse/HIVE-11583
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Affects Versions: 0.14.0, 0.13.1, 0.14.1, 1.0.0, 1.2.0, 1.2.1
> Environment: Hadoop 2.6 + Apache hive built from trunk
>Reporter: Illya Yalovyy
>Assignee: Illya Yalovyy
>Priority: Critical
> Fix For: 1.3.0, 2.0.0
>
> Attachments: HIVE-11583.patch
>
>
> Dataset: 
>  Window has 50001 record (2 blocks on disk and 1 block in memory)
>  Size of the second block is >32Mb (2 splits)
> Result:
> When the last block is read from the disk only first split is actually 
> loaded. The second split gets missed. The total count of the result dataset 
> is correct, but some records are missing and another are duplicated.
> Example:
> {code:sql}
> CREATE TABLE ptf_big_src (
>   id INT,
>   key STRING,
>   grp STRING,
>   value STRING
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO 
> TABLE ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> ---
> -- A  25000
> -- B  2
> -- C  5001
> ---
> CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key 
> ORDER BY grp) grp_num FROM ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> -- 
> -- A  34296
> -- B  15704
> -- C  1
> ---
> {code}
> Counts by 'grp' are incorrect!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11583) When PTF is used over a large partitions result could be corrupted

2015-11-12 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002955#comment-15002955
 ] 

Sergey Shelukhin commented on HIVE-11583:
-

This was committed a while ago... the test can be created in a separate JIRA if 
needed. I don't have background on this issue, I bulk commented yesterday on a 
large list of  issues whose title looks like a bug and that were committed to 
master but not to branch-1, obtained via a script

> When PTF is used over a large partitions result could be corrupted
> --
>
> Key: HIVE-11583
> URL: https://issues.apache.org/jira/browse/HIVE-11583
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Affects Versions: 0.14.0, 0.13.1, 0.14.1, 1.0.0, 1.2.0, 1.2.1
> Environment: Hadoop 2.6 + Apache hive built from trunk
>Reporter: Illya Yalovyy
>Assignee: Illya Yalovyy
>Priority: Critical
> Fix For: 1.3.0, 2.0.0
>
> Attachments: HIVE-11583.patch
>
>
> Dataset: 
>  Window has 50001 record (2 blocks on disk and 1 block in memory)
>  Size of the second block is >32Mb (2 splits)
> Result:
> When the last block is read from the disk only first split is actually 
> loaded. The second split gets missed. The total count of the result dataset 
> is correct, but some records are missing and another are duplicated.
> Example:
> {code:sql}
> CREATE TABLE ptf_big_src (
>   id INT,
>   key STRING,
>   grp STRING,
>   value STRING
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO 
> TABLE ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> ---
> -- A  25000
> -- B  2
> -- C  5001
> ---
> CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key 
> ORDER BY grp) grp_num FROM ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> -- 
> -- A  34296
> -- B  15704
> -- C  1
> ---
> {code}
> Counts by 'grp' are incorrect!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11583) When PTF is used over a large partitions result could be corrupted

2015-11-12 Thread Illya Yalovyy (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002949#comment-15002949
 ] 

Illya Yalovyy commented on HIVE-11583:
--

What about a qtest for this issue? What is the best course of action?

> When PTF is used over a large partitions result could be corrupted
> --
>
> Key: HIVE-11583
> URL: https://issues.apache.org/jira/browse/HIVE-11583
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Affects Versions: 0.14.0, 0.13.1, 0.14.1, 1.0.0, 1.2.0, 1.2.1
> Environment: Hadoop 2.6 + Apache hive built from trunk
>Reporter: Illya Yalovyy
>Assignee: Illya Yalovyy
>Priority: Critical
> Fix For: 1.3.0, 2.0.0
>
> Attachments: HIVE-11583.patch
>
>
> Dataset: 
>  Window has 50001 record (2 blocks on disk and 1 block in memory)
>  Size of the second block is >32Mb (2 splits)
> Result:
> When the last block is read from the disk only first split is actually 
> loaded. The second split gets missed. The total count of the result dataset 
> is correct, but some records are missing and another are duplicated.
> Example:
> {code:sql}
> CREATE TABLE ptf_big_src (
>   id INT,
>   key STRING,
>   grp STRING,
>   value STRING
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO 
> TABLE ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> ---
> -- A  25000
> -- B  2
> -- C  5001
> ---
> CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key 
> ORDER BY grp) grp_num FROM ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> -- 
> -- A  34296
> -- B  15704
> -- C  1
> ---
> {code}
> Counts by 'grp' are incorrect!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11583) When PTF is used over a large partitions result could be corrupted

2015-11-12 Thread Illya Yalovyy (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002963#comment-15002963
 ] 

Illya Yalovyy commented on HIVE-11583:
--

In  a nutshell the question is what is the best way to upload/provide a rather 
big binary file to the test? Should I just attach it to a ticket?

> When PTF is used over a large partitions result could be corrupted
> --
>
> Key: HIVE-11583
> URL: https://issues.apache.org/jira/browse/HIVE-11583
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Affects Versions: 0.14.0, 0.13.1, 0.14.1, 1.0.0, 1.2.0, 1.2.1
> Environment: Hadoop 2.6 + Apache hive built from trunk
>Reporter: Illya Yalovyy
>Assignee: Illya Yalovyy
>Priority: Critical
> Fix For: 1.3.0, 2.0.0
>
> Attachments: HIVE-11583.patch
>
>
> Dataset: 
>  Window has 50001 record (2 blocks on disk and 1 block in memory)
>  Size of the second block is >32Mb (2 splits)
> Result:
> When the last block is read from the disk only first split is actually 
> loaded. The second split gets missed. The total count of the result dataset 
> is correct, but some records are missing and another are duplicated.
> Example:
> {code:sql}
> CREATE TABLE ptf_big_src (
>   id INT,
>   key STRING,
>   grp STRING,
>   value STRING
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO 
> TABLE ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> ---
> -- A  25000
> -- B  2
> -- C  5001
> ---
> CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key 
> ORDER BY grp) grp_num FROM ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> -- 
> -- A  34296
> -- B  15704
> -- C  1
> ---
> {code}
> Counts by 'grp' are incorrect!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11583) When PTF is used over a large partitions result could be corrupted

2015-11-12 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002729#comment-15002729
 ] 

Sergey Shelukhin commented on HIVE-11583:
-

Committed to branch-1

> When PTF is used over a large partitions result could be corrupted
> --
>
> Key: HIVE-11583
> URL: https://issues.apache.org/jira/browse/HIVE-11583
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Affects Versions: 0.14.0, 0.13.1, 0.14.1, 1.0.0, 1.2.0, 1.2.1
> Environment: Hadoop 2.6 + Apache hive built from trunk
>Reporter: Illya Yalovyy
>Assignee: Illya Yalovyy
>Priority: Critical
> Fix For: 1.3.0, 2.0.0
>
> Attachments: HIVE-11583.patch
>
>
> Dataset: 
>  Window has 50001 record (2 blocks on disk and 1 block in memory)
>  Size of the second block is >32Mb (2 splits)
> Result:
> When the last block is read from the disk only first split is actually 
> loaded. The second split gets missed. The total count of the result dataset 
> is correct, but some records are missing and another are duplicated.
> Example:
> {code:sql}
> CREATE TABLE ptf_big_src (
>   id INT,
>   key STRING,
>   grp STRING,
>   value STRING
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO 
> TABLE ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> ---
> -- A  25000
> -- B  2
> -- C  5001
> ---
> CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key 
> ORDER BY grp) grp_num FROM ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> -- 
> -- A  34296
> -- B  15704
> -- C  1
> ---
> {code}
> Counts by 'grp' are incorrect!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11583) When PTF is used over a large partitions result could be corrupted

2015-11-11 Thread Illya Yalovyy (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001569#comment-15001569
 ] 

Illya Yalovyy commented on HIVE-11583:
--

Yes. It is a quite critical bug. 

> When PTF is used over a large partitions result could be corrupted
> --
>
> Key: HIVE-11583
> URL: https://issues.apache.org/jira/browse/HIVE-11583
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Affects Versions: 0.14.0, 0.13.1, 0.14.1, 1.0.0, 1.2.0, 1.2.1
> Environment: Hadoop 2.6 + Apache hive built from trunk
>Reporter: Illya Yalovyy
>Assignee: Illya Yalovyy
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HIVE-11583.patch
>
>
> Dataset: 
>  Window has 50001 record (2 blocks on disk and 1 block in memory)
>  Size of the second block is >32Mb (2 splits)
> Result:
> When the last block is read from the disk only first split is actually 
> loaded. The second split gets missed. The total count of the result dataset 
> is correct, but some records are missing and another are duplicated.
> Example:
> {code:sql}
> CREATE TABLE ptf_big_src (
>   id INT,
>   key STRING,
>   grp STRING,
>   value STRING
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO 
> TABLE ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> ---
> -- A  25000
> -- B  2
> -- C  5001
> ---
> CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key 
> ORDER BY grp) grp_num FROM ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> -- 
> -- A  34296
> -- B  15704
> -- C  1
> ---
> {code}
> Counts by 'grp' are incorrect!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11583) When PTF is used over a large partitions result could be corrupted

2015-09-25 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908368#comment-14908368
 ] 

Ashutosh Chauhan commented on HIVE-11583:
-

Spilling is controlled by config {{hive.join.cache.size}} Perhaps, you can set 
that to very low value in q test so as to trigger spilling and thus testing 
this without needing a large input data.

> When PTF is used over a large partitions result could be corrupted
> --
>
> Key: HIVE-11583
> URL: https://issues.apache.org/jira/browse/HIVE-11583
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Affects Versions: 0.14.0, 0.13.1, 0.14.1, 1.0.0, 1.2.0, 1.2.1
> Environment: Hadoop 2.6 + Apache hive built from trunk
>Reporter: Illya Yalovyy
>Assignee: Illya Yalovyy
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HIVE-11583.patch
>
>
> Dataset: 
>  Window has 50001 record (2 blocks on disk and 1 block in memory)
>  Size of the second block is >32Mb (2 splits)
> Result:
> When the last block is read from the disk only first split is actually 
> loaded. The second split gets missed. The total count of the result dataset 
> is correct, but some records are missing and another are duplicated.
> Example:
> {code:sql}
> CREATE TABLE ptf_big_src (
>   id INT,
>   key STRING,
>   grp STRING,
>   value STRING
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO 
> TABLE ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> ---
> -- A  25000
> -- B  2
> -- C  5001
> ---
> CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key 
> ORDER BY grp) grp_num FROM ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> -- 
> -- A  34296
> -- B  15704
> -- C  1
> ---
> {code}
> Counts by 'grp' are incorrect!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11583) When PTF is used over a large partitions result could be corrupted

2015-09-25 Thread Illya Yalovyy (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908437#comment-14908437
 ] 

Illya Yalovyy commented on HIVE-11583:
--

Oh... I was thinking about all possible ways to reduce the size of file. cahce 
size is only one piece of the puzzle. The important thing is physical file 
system blocks and it seems like I cannot control it from withing Hive script.

> When PTF is used over a large partitions result could be corrupted
> --
>
> Key: HIVE-11583
> URL: https://issues.apache.org/jira/browse/HIVE-11583
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Affects Versions: 0.14.0, 0.13.1, 0.14.1, 1.0.0, 1.2.0, 1.2.1
> Environment: Hadoop 2.6 + Apache hive built from trunk
>Reporter: Illya Yalovyy
>Assignee: Illya Yalovyy
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HIVE-11583.patch
>
>
> Dataset: 
>  Window has 50001 record (2 blocks on disk and 1 block in memory)
>  Size of the second block is >32Mb (2 splits)
> Result:
> When the last block is read from the disk only first split is actually 
> loaded. The second split gets missed. The total count of the result dataset 
> is correct, but some records are missing and another are duplicated.
> Example:
> {code:sql}
> CREATE TABLE ptf_big_src (
>   id INT,
>   key STRING,
>   grp STRING,
>   value STRING
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO 
> TABLE ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> ---
> -- A  25000
> -- B  2
> -- C  5001
> ---
> CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key 
> ORDER BY grp) grp_num FROM ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> -- 
> -- A  34296
> -- B  15704
> -- C  1
> ---
> {code}
> Counts by 'grp' are incorrect!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11583) When PTF is used over a large partitions result could be corrupted

2015-09-25 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908457#comment-14908457
 ] 

Ashutosh Chauhan commented on HIVE-11583:
-

Hive q tests use 
hive-shims-common/src/main/java/org/apache/hadoop/fs/ProxyLocalFileSystem.java 
I think you can configure its block size via {{fs.local.block.size}} 

> When PTF is used over a large partitions result could be corrupted
> --
>
> Key: HIVE-11583
> URL: https://issues.apache.org/jira/browse/HIVE-11583
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Affects Versions: 0.14.0, 0.13.1, 0.14.1, 1.0.0, 1.2.0, 1.2.1
> Environment: Hadoop 2.6 + Apache hive built from trunk
>Reporter: Illya Yalovyy
>Assignee: Illya Yalovyy
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HIVE-11583.patch
>
>
> Dataset: 
>  Window has 50001 record (2 blocks on disk and 1 block in memory)
>  Size of the second block is >32Mb (2 splits)
> Result:
> When the last block is read from the disk only first split is actually 
> loaded. The second split gets missed. The total count of the result dataset 
> is correct, but some records are missing and another are duplicated.
> Example:
> {code:sql}
> CREATE TABLE ptf_big_src (
>   id INT,
>   key STRING,
>   grp STRING,
>   value STRING
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO 
> TABLE ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> ---
> -- A  25000
> -- B  2
> -- C  5001
> ---
> CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key 
> ORDER BY grp) grp_num FROM ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> -- 
> -- A  34296
> -- B  15704
> -- C  1
> ---
> {code}
> Counts by 'grp' are incorrect!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11583) When PTF is used over a large partitions result could be corrupted

2015-09-25 Thread Illya Yalovyy (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908338#comment-14908338
 ] 

Illya Yalovyy commented on HIVE-11583:
--

[~ashutoshc], I have a qTest for this issue, but includes rather big gz - 
compressed file. What is the best way to contribute it? The question is how to 
create a patch for this big binary file?


> When PTF is used over a large partitions result could be corrupted
> --
>
> Key: HIVE-11583
> URL: https://issues.apache.org/jira/browse/HIVE-11583
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Affects Versions: 0.14.0, 0.13.1, 0.14.1, 1.0.0, 1.2.0, 1.2.1
> Environment: Hadoop 2.6 + Apache hive built from trunk
>Reporter: Illya Yalovyy
>Assignee: Illya Yalovyy
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HIVE-11583.patch
>
>
> Dataset: 
>  Window has 50001 record (2 blocks on disk and 1 block in memory)
>  Size of the second block is >32Mb (2 splits)
> Result:
> When the last block is read from the disk only first split is actually 
> loaded. The second split gets missed. The total count of the result dataset 
> is correct, but some records are missing and another are duplicated.
> Example:
> {code:sql}
> CREATE TABLE ptf_big_src (
>   id INT,
>   key STRING,
>   grp STRING,
>   value STRING
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO 
> TABLE ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> ---
> -- A  25000
> -- B  2
> -- C  5001
> ---
> CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key 
> ORDER BY grp) grp_num FROM ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> -- 
> -- A  34296
> -- B  15704
> -- C  1
> ---
> {code}
> Counts by 'grp' are incorrect!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11583) When PTF is used over a large partitions result could be corrupted

2015-09-25 Thread Illya Yalovyy (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908480#comment-14908480
 ] 

Illya Yalovyy commented on HIVE-11583:
--

I tried, and when I did it from hive script it didn't take any effect. Is the 
any way to reconfigure it BEFORE the test? 

> When PTF is used over a large partitions result could be corrupted
> --
>
> Key: HIVE-11583
> URL: https://issues.apache.org/jira/browse/HIVE-11583
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Affects Versions: 0.14.0, 0.13.1, 0.14.1, 1.0.0, 1.2.0, 1.2.1
> Environment: Hadoop 2.6 + Apache hive built from trunk
>Reporter: Illya Yalovyy
>Assignee: Illya Yalovyy
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HIVE-11583.patch
>
>
> Dataset: 
>  Window has 50001 record (2 blocks on disk and 1 block in memory)
>  Size of the second block is >32Mb (2 splits)
> Result:
> When the last block is read from the disk only first split is actually 
> loaded. The second split gets missed. The total count of the result dataset 
> is correct, but some records are missing and another are duplicated.
> Example:
> {code:sql}
> CREATE TABLE ptf_big_src (
>   id INT,
>   key STRING,
>   grp STRING,
>   value STRING
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO 
> TABLE ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> ---
> -- A  25000
> -- B  2
> -- C  5001
> ---
> CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key 
> ORDER BY grp) grp_num FROM ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> -- 
> -- A  34296
> -- B  15704
> -- C  1
> ---
> {code}
> Counts by 'grp' are incorrect!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11583) When PTF is used over a large partitions result could be corrupted

2015-09-14 Thread Illya Yalovyy (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743864#comment-14743864
 ] 

Illya Yalovyy commented on HIVE-11583:
--

I have implemented a qtest for this issue, but it requires a rather big data 
file. What is the best way to submit this file? It is a gzip file, size = 
204Kb. I can attach this file to the ticket.

> When PTF is used over a large partitions result could be corrupted
> --
>
> Key: HIVE-11583
> URL: https://issues.apache.org/jira/browse/HIVE-11583
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Affects Versions: 0.14.0, 0.13.1, 0.14.1, 1.0.0, 1.2.0, 1.2.1
> Environment: Hadoop 2.6 + Apache hive built from trunk
>Reporter: Illya Yalovyy
>Priority: Critical
> Attachments: HIVE-11583.patch
>
>
> Dataset: 
>  Window has 50001 record (2 blocks on disk and 1 block in memory)
>  Size of the second block is >32Mb (2 splits)
> Result:
> When the last block is read from the disk only first split is actually 
> loaded. The second split gets missed. The total count of the result dataset 
> is correct, but some records are missing and another are duplicated.
> Example:
> {code:sql}
> CREATE TABLE ptf_big_src (
>   id INT,
>   key STRING,
>   grp STRING,
>   value STRING
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO 
> TABLE ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> ---
> -- A  25000
> -- B  2
> -- C  5001
> ---
> CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key 
> ORDER BY grp) grp_num FROM ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> -- 
> -- A  34296
> -- B  15704
> -- C  1
> ---
> {code}
> Counts by 'grp' are incorrect!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11583) When PTF is used over a large partitions result could be corrupted

2015-09-14 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743882#comment-14743882
 ] 

Ashutosh Chauhan commented on HIVE-11583:
-

+1

> When PTF is used over a large partitions result could be corrupted
> --
>
> Key: HIVE-11583
> URL: https://issues.apache.org/jira/browse/HIVE-11583
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Affects Versions: 0.14.0, 0.13.1, 0.14.1, 1.0.0, 1.2.0, 1.2.1
> Environment: Hadoop 2.6 + Apache hive built from trunk
>Reporter: Illya Yalovyy
>Assignee: Illya Yalovyy
>Priority: Critical
> Attachments: HIVE-11583.patch
>
>
> Dataset: 
>  Window has 50001 record (2 blocks on disk and 1 block in memory)
>  Size of the second block is >32Mb (2 splits)
> Result:
> When the last block is read from the disk only first split is actually 
> loaded. The second split gets missed. The total count of the result dataset 
> is correct, but some records are missing and another are duplicated.
> Example:
> {code:sql}
> CREATE TABLE ptf_big_src (
>   id INT,
>   key STRING,
>   grp STRING,
>   value STRING
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO 
> TABLE ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> ---
> -- A  25000
> -- B  2
> -- C  5001
> ---
> CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key 
> ORDER BY grp) grp_num FROM ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> -- 
> -- A  34296
> -- B  15704
> -- C  1
> ---
> {code}
> Counts by 'grp' are incorrect!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11583) When PTF is used over a large partitions result could be corrupted

2015-09-14 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744582#comment-14744582
 ] 

Hive QA commented on HIVE-11583:




{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12755773/HIVE-11583.patch

{color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 9412 tests executed
*Failed tests:*
{noformat}
TestParseNegative - did not produce a TEST-*.xml file
org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/5276/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/5276/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-5276/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 2 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12755773 - PreCommit-HIVE-TRUNK-Build

> When PTF is used over a large partitions result could be corrupted
> --
>
> Key: HIVE-11583
> URL: https://issues.apache.org/jira/browse/HIVE-11583
> Project: Hive
>  Issue Type: Bug
>  Components: PTF-Windowing
>Affects Versions: 0.14.0, 0.13.1, 0.14.1, 1.0.0, 1.2.0, 1.2.1
> Environment: Hadoop 2.6 + Apache hive built from trunk
>Reporter: Illya Yalovyy
>Assignee: Illya Yalovyy
>Priority: Critical
> Attachments: HIVE-11583.patch
>
>
> Dataset: 
>  Window has 50001 record (2 blocks on disk and 1 block in memory)
>  Size of the second block is >32Mb (2 splits)
> Result:
> When the last block is read from the disk only first split is actually 
> loaded. The second split gets missed. The total count of the result dataset 
> is correct, but some records are missing and another are duplicated.
> Example:
> {code:sql}
> CREATE TABLE ptf_big_src (
>   id INT,
>   key STRING,
>   grp STRING,
>   value STRING
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> LOAD DATA LOCAL INPATH '../../data/files/ptf_3blocks.txt.gz' OVERWRITE INTO 
> TABLE ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> ---
> -- A  25000
> -- B  2
> -- C  5001
> ---
> CREATE TABLE ptf_big_trg AS SELECT *, row_number() OVER (PARTITION BY key 
> ORDER BY grp) grp_num FROM ptf_big_src;
> SELECT grp, COUNT(1) cnt FROM ptf_big_trg GROUP BY grp ORDER BY cnt desc;
> -- 
> -- A  34296
> -- B  15704
> -- C  1
> ---
> {code}
> Counts by 'grp' are incorrect!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)