[jira] [Updated] (HIVE-22579) ACID v1: covered delta-only splits (without base) should be marked as covered (branch-2)

2020-04-07 Thread Alan Gates (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated HIVE-22579:
--
Fix Version/s: (was: 2.3.7)
   2.4.0

> ACID v1: covered delta-only splits (without base) should be marked as covered 
> (branch-2)
> 
>
> Key: HIVE-22579
> URL: https://issues.apache.org/jira/browse/HIVE-22579
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: HIVE-22579.01.branch-2.patch, 
> HIVE-22579.01.branch-2.patch
>
>
> There is a scenario when different SplitGenerator instances try to cover the 
> delta-only buckets (having no base file) more than once, so there could be 
> multiple OrcSplit instances generated for the same delta file, causing more 
> tasks to read the same delta file more than once, causing duplicate records 
> in a simple select star query.
> File structure for a 256 bucket table
> {code}
> drwxrwxrwx   - hive hadoop  0 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_013
> -rw-r--r--   3 hive hadoop353 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_013/bucket_00012
> -rw-r--r--   3 hive hadoop   1642 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_013/bucket_00140
> drwxrwxrwx   - hive hadoop  0 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_014_014_
> -rwxrwxrwx   3 hive hadoop348 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_014_014_/bucket_00012
> -rwxrwxrwx   3 hive hadoop   1635 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_014_014_/bucket_00140
> drwxrwxrwx   - hive hadoop  0 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_015_015_
> -rwxrwxrwx   3 hive hadoop348 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_015_015_/bucket_00012
> -rwxrwxrwx   3 hive hadoop   1808 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_015_015_/bucket_00140
> drwxrwxrwx   - hive hadoop  0 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_016_016_
> -rwxrwxrwx   3 hive hadoop348 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00043
> -rwxrwxrwx   3 hive hadoop   1633 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00171
> {code}
> in this case, when bucket_00171 file has a record, and there is no base file 
> for that, a select (*) with ETL split strategy can generate 2 splits for the 
> same delta bucket...
> the scenario of the issue:
> 1. ETLSplitStrategy contains a [covered[] 
> array|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L763]
>  which is [shared between the SplitInfo 
> instances|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L824]
>  to be created
> 2. a SplitInfo instance is created for [every base file (2 in this 
> case)|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L809]
> 3. for every SplitInfo, [a SplitGenerator is 
> created|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L925-L926],
>  and in the constructor, [parent's getSplit is 
> called|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L1251],
>  which tries to take care of the deltas
> I'm not sure at the moment what's the intention of this, but this way, 
> duplicated delta split can be generated, which can cause duplicated read 
> later (note that both tasks read the same delta file: bucket_00171)
> {code}
> 2019-12-01T16:24:53,669  INFO [TezTR-127843_16_30_0_171_0 
> (1575040127843_0016_30_00_000171_0)] orc.ReaderImpl: Reading ORC rows from 
> hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00171
>  with {include: [true, true, true, true, true, true, true, true, true, true, 
> true, true], offset: 0, length: 9223372036854775807, schema: 
> struct}
> 2019-12-01T16:24:53,672  INFO [TezTR-127843_16_30_0_171_0 
> (1575040127843_0016_30_00_000171_0)] lib.MRReaderMapred: Processing split: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat:OrcSplit 
> 

[jira] [Updated] (HIVE-22579) ACID v1: covered delta-only splits (without base) should be marked as covered (branch-2)

2019-12-10 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-22579:

Resolution: Fixed
Status: Resolved  (was: Patch Available)

> ACID v1: covered delta-only splits (without base) should be marked as covered 
> (branch-2)
> 
>
> Key: HIVE-22579
> URL: https://issues.apache.org/jira/browse/HIVE-22579
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
> Fix For: 2.3.7
>
> Attachments: HIVE-22579.01.branch-2.patch, 
> HIVE-22579.01.branch-2.patch
>
>
> There is a scenario when different SplitGenerator instances try to cover the 
> delta-only buckets (having no base file) more than once, so there could be 
> multiple OrcSplit instances generated for the same delta file, causing more 
> tasks to read the same delta file more than once, causing duplicate records 
> in a simple select star query.
> File structure for a 256 bucket table
> {code}
> drwxrwxrwx   - hive hadoop  0 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_013
> -rw-r--r--   3 hive hadoop353 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_013/bucket_00012
> -rw-r--r--   3 hive hadoop   1642 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_013/bucket_00140
> drwxrwxrwx   - hive hadoop  0 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_014_014_
> -rwxrwxrwx   3 hive hadoop348 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_014_014_/bucket_00012
> -rwxrwxrwx   3 hive hadoop   1635 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_014_014_/bucket_00140
> drwxrwxrwx   - hive hadoop  0 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_015_015_
> -rwxrwxrwx   3 hive hadoop348 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_015_015_/bucket_00012
> -rwxrwxrwx   3 hive hadoop   1808 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_015_015_/bucket_00140
> drwxrwxrwx   - hive hadoop  0 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_016_016_
> -rwxrwxrwx   3 hive hadoop348 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00043
> -rwxrwxrwx   3 hive hadoop   1633 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00171
> {code}
> in this case, when bucket_00171 file has a record, and there is no base file 
> for that, a select (*) with ETL split strategy can generate 2 splits for the 
> same delta bucket...
> the scenario of the issue:
> 1. ETLSplitStrategy contains a [covered[] 
> array|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L763]
>  which is [shared between the SplitInfo 
> instances|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L824]
>  to be created
> 2. a SplitInfo instance is created for [every base file (2 in this 
> case)|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L809]
> 3. for every SplitInfo, [a SplitGenerator is 
> created|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L925-L926],
>  and in the constructor, [parent's getSplit is 
> called|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L1251],
>  which tries to take care of the deltas
> I'm not sure at the moment what's the intention of this, but this way, 
> duplicated delta split can be generated, which can cause duplicated read 
> later (note that both tasks read the same delta file: bucket_00171)
> {code}
> 2019-12-01T16:24:53,669  INFO [TezTR-127843_16_30_0_171_0 
> (1575040127843_0016_30_00_000171_0)] orc.ReaderImpl: Reading ORC rows from 
> hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00171
>  with {include: [true, true, true, true, true, true, true, true, true, true, 
> true, true], offset: 0, length: 9223372036854775807, schema: 
> struct}
> 2019-12-01T16:24:53,672  INFO [TezTR-127843_16_30_0_171_0 
> (1575040127843_0016_30_00_000171_0)] lib.MRReaderMapred: Processing split: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat:OrcSplit 
> 

[jira] [Updated] (HIVE-22579) ACID v1: covered delta-only splits (without base) should be marked as covered (branch-2)

2019-12-10 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-22579:

Fix Version/s: 2.3.7

> ACID v1: covered delta-only splits (without base) should be marked as covered 
> (branch-2)
> 
>
> Key: HIVE-22579
> URL: https://issues.apache.org/jira/browse/HIVE-22579
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
> Fix For: 2.3.7
>
> Attachments: HIVE-22579.01.branch-2.patch, 
> HIVE-22579.01.branch-2.patch
>
>
> There is a scenario when different SplitGenerator instances try to cover the 
> delta-only buckets (having no base file) more than once, so there could be 
> multiple OrcSplit instances generated for the same delta file, causing more 
> tasks to read the same delta file more than once, causing duplicate records 
> in a simple select star query.
> File structure for a 256 bucket table
> {code}
> drwxrwxrwx   - hive hadoop  0 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_013
> -rw-r--r--   3 hive hadoop353 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_013/bucket_00012
> -rw-r--r--   3 hive hadoop   1642 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_013/bucket_00140
> drwxrwxrwx   - hive hadoop  0 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_014_014_
> -rwxrwxrwx   3 hive hadoop348 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_014_014_/bucket_00012
> -rwxrwxrwx   3 hive hadoop   1635 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_014_014_/bucket_00140
> drwxrwxrwx   - hive hadoop  0 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_015_015_
> -rwxrwxrwx   3 hive hadoop348 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_015_015_/bucket_00012
> -rwxrwxrwx   3 hive hadoop   1808 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_015_015_/bucket_00140
> drwxrwxrwx   - hive hadoop  0 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_016_016_
> -rwxrwxrwx   3 hive hadoop348 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00043
> -rwxrwxrwx   3 hive hadoop   1633 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00171
> {code}
> in this case, when bucket_00171 file has a record, and there is no base file 
> for that, a select (*) with ETL split strategy can generate 2 splits for the 
> same delta bucket...
> the scenario of the issue:
> 1. ETLSplitStrategy contains a [covered[] 
> array|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L763]
>  which is [shared between the SplitInfo 
> instances|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L824]
>  to be created
> 2. a SplitInfo instance is created for [every base file (2 in this 
> case)|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L809]
> 3. for every SplitInfo, [a SplitGenerator is 
> created|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L925-L926],
>  and in the constructor, [parent's getSplit is 
> called|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L1251],
>  which tries to take care of the deltas
> I'm not sure at the moment what's the intention of this, but this way, 
> duplicated delta split can be generated, which can cause duplicated read 
> later (note that both tasks read the same delta file: bucket_00171)
> {code}
> 2019-12-01T16:24:53,669  INFO [TezTR-127843_16_30_0_171_0 
> (1575040127843_0016_30_00_000171_0)] orc.ReaderImpl: Reading ORC rows from 
> hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00171
>  with {include: [true, true, true, true, true, true, true, true, true, true, 
> true, true], offset: 0, length: 9223372036854775807, schema: 
> struct}
> 2019-12-01T16:24:53,672  INFO [TezTR-127843_16_30_0_171_0 
> (1575040127843_0016_30_00_000171_0)] lib.MRReaderMapred: Processing split: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat:OrcSplit 
> 

[jira] [Updated] (HIVE-22579) ACID v1: covered delta-only splits (without base) should be marked as covered (branch-2)

2019-12-09 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-22579:

Attachment: HIVE-22579.01.branch-2.patch

> ACID v1: covered delta-only splits (without base) should be marked as covered 
> (branch-2)
> 
>
> Key: HIVE-22579
> URL: https://issues.apache.org/jira/browse/HIVE-22579
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
> Attachments: HIVE-22579.01.branch-2.patch, 
> HIVE-22579.01.branch-2.patch
>
>
> There is a scenario when different SplitGenerator instances try to cover the 
> delta-only buckets (having no base file) more than once, so there could be 
> multiple OrcSplit instances generated for the same delta file, causing more 
> tasks to read the same delta file more than once, causing duplicate records 
> in a simple select star query.
> File structure for a 256 bucket table
> {code}
> drwxrwxrwx   - hive hadoop  0 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_013
> -rw-r--r--   3 hive hadoop353 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_013/bucket_00012
> -rw-r--r--   3 hive hadoop   1642 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_013/bucket_00140
> drwxrwxrwx   - hive hadoop  0 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_014_014_
> -rwxrwxrwx   3 hive hadoop348 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_014_014_/bucket_00012
> -rwxrwxrwx   3 hive hadoop   1635 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_014_014_/bucket_00140
> drwxrwxrwx   - hive hadoop  0 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_015_015_
> -rwxrwxrwx   3 hive hadoop348 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_015_015_/bucket_00012
> -rwxrwxrwx   3 hive hadoop   1808 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_015_015_/bucket_00140
> drwxrwxrwx   - hive hadoop  0 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_016_016_
> -rwxrwxrwx   3 hive hadoop348 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00043
> -rwxrwxrwx   3 hive hadoop   1633 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00171
> {code}
> in this case, when bucket_00171 file has a record, and there is no base file 
> for that, a select (*) with ETL split strategy can generate 2 splits for the 
> same delta bucket...
> the scenario of the issue:
> 1. ETLSplitStrategy contains a [covered[] 
> array|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L763]
>  which is [shared between the SplitInfo 
> instances|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L824]
>  to be created
> 2. a SplitInfo instance is created for [every base file (2 in this 
> case)|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L809]
> 3. for every SplitInfo, [a SplitGenerator is 
> created|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L925-L926],
>  and in the constructor, [parent's getSplit is 
> called|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L1251],
>  which tries to take care of the deltas
> I'm not sure at the moment what's the intention of this, but this way, 
> duplicated delta split can be generated, which can cause duplicated read 
> later (note that both tasks read the same delta file: bucket_00171)
> {code}
> 2019-12-01T16:24:53,669  INFO [TezTR-127843_16_30_0_171_0 
> (1575040127843_0016_30_00_000171_0)] orc.ReaderImpl: Reading ORC rows from 
> hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00171
>  with {include: [true, true, true, true, true, true, true, true, true, true, 
> true, true], offset: 0, length: 9223372036854775807, schema: 
> struct}
> 2019-12-01T16:24:53,672  INFO [TezTR-127843_16_30_0_171_0 
> (1575040127843_0016_30_00_000171_0)] lib.MRReaderMapred: Processing split: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat:OrcSplit 
> 

[jira] [Updated] (HIVE-22579) ACID v1: covered delta-only splits (without base) should be marked as covered (branch-2)

2019-12-04 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-22579:

Description: 
There is a scenario when different SplitGenerator instances try to cover the 
delta-only buckets (having no base file) more than once, so there could be 
multiple OrcSplit instances generated for the same delta file, causing more 
tasks to read the same delta file more than once, causing duplicate records in 
a simple select star query.

File structure for a 256 bucket table
{code}
drwxrwxrwx   - hive hadoop  0 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/base_013
-rw-r--r--   3 hive hadoop353 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/base_013/bucket_00012
-rw-r--r--   3 hive hadoop   1642 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/base_013/bucket_00140
drwxrwxrwx   - hive hadoop  0 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/delta_014_014_
-rwxrwxrwx   3 hive hadoop348 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/delta_014_014_/bucket_00012
-rwxrwxrwx   3 hive hadoop   1635 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/delta_014_014_/bucket_00140
drwxrwxrwx   - hive hadoop  0 2019-11-29 16:04 
/apps/hive/warehouse/naresh.db/test1/delta_015_015_
-rwxrwxrwx   3 hive hadoop348 2019-11-29 16:04 
/apps/hive/warehouse/naresh.db/test1/delta_015_015_/bucket_00012
-rwxrwxrwx   3 hive hadoop   1808 2019-11-29 16:04 
/apps/hive/warehouse/naresh.db/test1/delta_015_015_/bucket_00140
drwxrwxrwx   - hive hadoop  0 2019-11-29 16:06 
/apps/hive/warehouse/naresh.db/test1/delta_016_016_
-rwxrwxrwx   3 hive hadoop348 2019-11-29 16:06 
/apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00043
-rwxrwxrwx   3 hive hadoop   1633 2019-11-29 16:06 
/apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00171
{code}

in this case, when bucket_00171 file has a record, and there is no base file 
for that, a select (*) with ETL split strategy can generate 2 splits for the 
same delta bucket...

the scenario of the issue:
1. ETLSplitStrategy contains a [covered[] 
array|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L763]
 which is [shared between the SplitInfo 
instances|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L824]
 to be created
2. a SplitInfo instance is created for [every base file (2 in this 
case)|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L809]
3. for every SplitInfo, [a SplitGenerator is 
created|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L925-L926],
 and in the constructor, [parent's getSplit is 
called|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L1251],
 which tries to take care of the deltas

I'm not sure at the moment what's the intention of this, but this way, 
duplicated delta split can be generated, which can cause duplicated read later 
(note that both tasks read the same delta file: bucket_00171)
{code}
2019-12-01T16:24:53,669  INFO [TezTR-127843_16_30_0_171_0 
(1575040127843_0016_30_00_000171_0)] orc.ReaderImpl: Reading ORC rows from 
hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00171
 with {include: [true, true, true, true, true, true, true, true, true, true, 
true, true], offset: 0, length: 9223372036854775807, schema: 
struct}
2019-12-01T16:24:53,672  INFO [TezTR-127843_16_30_0_171_0 
(1575040127843_0016_30_00_000171_0)] lib.MRReaderMapred: Processing split: 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat:OrcSplit 
[hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1,
 start=171, length=0, isOriginal=false, fileLength=9223372036854775807, 
hasFooter=false, hasBase=false, deltas=[{ minTxnId: 14 maxTxnId: 14 stmtIds: 
[0] }, { minTxnId: 15 maxTxnId: 15 stmtIds: [0] }, { minTxnId: 16 maxTxnId: 16 
stmtIds: [0] }]]
2019-12-01T16:24:55,807  INFO [TezTR-127843_16_30_0_425_0 
(1575040127843_0016_30_00_000425_0)] orc.ReaderImpl: Reading ORC rows from 
hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00171
 with {include: [true, true, true, true, true, true, true, true, true, true, 
true, true], offset: 0, length: 9223372036854775807, schema: 
struct}

[jira] [Updated] (HIVE-22579) ACID v1: covered delta-only splits (without base) should be marked as covered (branch-2)

2019-12-04 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-22579:

Description: 
There is a scenario when different SplitGenerator instances try to cover the 
delta-only buckets (having no base file) more than once, so there could be 
multiple OrcSplit instances generated for the same delta file, causing more 
tasks to read the same delta file more than once, causing duplicate records in 
a simple select star query.

File structure for a 256 bucket table
{code}
drwxrwxrwx   - hive hadoop  0 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/base_013
-rw-r--r--   3 hive hadoop353 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/base_013/bucket_00012
-rw-r--r--   3 hive hadoop   1642 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/base_013/bucket_00140
drwxrwxrwx   - hive hadoop  0 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/delta_014_014_
-rwxrwxrwx   3 hive hadoop348 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/delta_014_014_/bucket_00012
-rwxrwxrwx   3 hive hadoop   1635 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/delta_014_014_/bucket_00140
drwxrwxrwx   - hive hadoop  0 2019-11-29 16:04 
/apps/hive/warehouse/naresh.db/test1/delta_015_015_
-rwxrwxrwx   3 hive hadoop348 2019-11-29 16:04 
/apps/hive/warehouse/naresh.db/test1/delta_015_015_/bucket_00012
-rwxrwxrwx   3 hive hadoop   1808 2019-11-29 16:04 
/apps/hive/warehouse/naresh.db/test1/delta_015_015_/bucket_00140
drwxrwxrwx   - hive hadoop  0 2019-11-29 16:06 
/apps/hive/warehouse/naresh.db/test1/delta_016_016_
-rwxrwxrwx   3 hive hadoop348 2019-11-29 16:06 
/apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00043
-rwxrwxrwx   3 hive hadoop   1633 2019-11-29 16:06 
/apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00171
{code}

in this case, when bucket_00171 file has a record, and there is no base file 
for that, a select (*) with ETL split strategy can generate 2 splits for the 
same delta bucket...

seems like this issue doesn't affect AcidV2, as getSplits() returns an empty 
collection or throws an exception in case of unexpected deltas (which was the 
case here, where deltas was not unexpected):
https://github.com/apache/hive/blob/8ee3497f87f81fa84ee1023e891dc54087c2cd5e/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L1178-L1197

  was:
There is a scenario when different SplitGenerator instances try to cover the 
delta-only buckets (having no base file) more than once, so there could be 
multiple OrcSplit instances generated for the same delta file, causing more 
tasks to read the same delta file more than once, causing duplicate records in 
a simple select star query.

File structure for a 256 bucket table
{code}
drwxrwxrwx   - hive hadoop  0 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/base_013
-rw-r--r--   3 hive hadoop353 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/base_013/bucket_00012
-rw-r--r--   3 hive hadoop   1642 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/base_013/bucket_00140
drwxrwxrwx   - hive hadoop  0 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/delta_014_014_
-rwxrwxrwx   3 hive hadoop348 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/delta_014_014_/bucket_00012
-rwxrwxrwx   3 hive hadoop   1635 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/delta_014_014_/bucket_00140
drwxrwxrwx   - hive hadoop  0 2019-11-29 16:04 
/apps/hive/warehouse/naresh.db/test1/delta_015_015_
-rwxrwxrwx   3 hive hadoop348 2019-11-29 16:04 
/apps/hive/warehouse/naresh.db/test1/delta_015_015_/bucket_00012
-rwxrwxrwx   3 hive hadoop   1808 2019-11-29 16:04 
/apps/hive/warehouse/naresh.db/test1/delta_015_015_/bucket_00140
drwxrwxrwx   - hive hadoop  0 2019-11-29 16:06 
/apps/hive/warehouse/naresh.db/test1/delta_016_016_
-rwxrwxrwx   3 hive hadoop348 2019-11-29 16:06 
/apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00043
-rwxrwxrwx   3 hive hadoop   1633 2019-11-29 16:06 
/apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00171
{code}

in this case, when bucket_00171 file has a record, and there is no base file 
for that, a select (*) with ETL split strategy can generate 2 splits for the 
same delta bucket...


> ACID v1: covered delta-only splits (without base) should be marked as covered 
> (branch-2)
> 
>
> Key: HIVE-22579
> URL: 

[jira] [Updated] (HIVE-22579) ACID v1: covered delta-only splits (without base) should be marked as covered (branch-2)

2019-12-04 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-22579:

Description: 
There is a scenario when different SplitGenerator instances try to cover the 
delta-only buckets (having no base file) more than once, so there could be 
multiple OrcSplit instances generated for the same delta file, causing more 
tasks to read the same delta file more than once, causing duplicate records in 
a simple select star query.

File structure for a 256 bucket table
{code}
drwxrwxrwx   - hive hadoop  0 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/base_013
-rw-r--r--   3 hive hadoop353 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/base_013/bucket_00012
-rw-r--r--   3 hive hadoop   1642 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/base_013/bucket_00140
drwxrwxrwx   - hive hadoop  0 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/delta_014_014_
-rwxrwxrwx   3 hive hadoop348 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/delta_014_014_/bucket_00012
-rwxrwxrwx   3 hive hadoop   1635 2019-11-29 15:55 
/apps/hive/warehouse/naresh.db/test1/delta_014_014_/bucket_00140
drwxrwxrwx   - hive hadoop  0 2019-11-29 16:04 
/apps/hive/warehouse/naresh.db/test1/delta_015_015_
-rwxrwxrwx   3 hive hadoop348 2019-11-29 16:04 
/apps/hive/warehouse/naresh.db/test1/delta_015_015_/bucket_00012
-rwxrwxrwx   3 hive hadoop   1808 2019-11-29 16:04 
/apps/hive/warehouse/naresh.db/test1/delta_015_015_/bucket_00140
drwxrwxrwx   - hive hadoop  0 2019-11-29 16:06 
/apps/hive/warehouse/naresh.db/test1/delta_016_016_
-rwxrwxrwx   3 hive hadoop348 2019-11-29 16:06 
/apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00043
-rwxrwxrwx   3 hive hadoop   1633 2019-11-29 16:06 
/apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00171
{code}

in this case, when bucket_00171 file has a record, and there is no base file 
for that, a select (*) with ETL split strategy can generate 2 splits for the 
same delta bucket...

> ACID v1: covered delta-only splits (without base) should be marked as covered 
> (branch-2)
> 
>
> Key: HIVE-22579
> URL: https://issues.apache.org/jira/browse/HIVE-22579
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
> Attachments: HIVE-22579.01.branch-2.patch
>
>
> There is a scenario when different SplitGenerator instances try to cover the 
> delta-only buckets (having no base file) more than once, so there could be 
> multiple OrcSplit instances generated for the same delta file, causing more 
> tasks to read the same delta file more than once, causing duplicate records 
> in a simple select star query.
> File structure for a 256 bucket table
> {code}
> drwxrwxrwx   - hive hadoop  0 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_013
> -rw-r--r--   3 hive hadoop353 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_013/bucket_00012
> -rw-r--r--   3 hive hadoop   1642 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_013/bucket_00140
> drwxrwxrwx   - hive hadoop  0 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_014_014_
> -rwxrwxrwx   3 hive hadoop348 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_014_014_/bucket_00012
> -rwxrwxrwx   3 hive hadoop   1635 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_014_014_/bucket_00140
> drwxrwxrwx   - hive hadoop  0 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_015_015_
> -rwxrwxrwx   3 hive hadoop348 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_015_015_/bucket_00012
> -rwxrwxrwx   3 hive hadoop   1808 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_015_015_/bucket_00140
> drwxrwxrwx   - hive hadoop  0 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_016_016_
> -rwxrwxrwx   3 hive hadoop348 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00043
> -rwxrwxrwx   3 hive hadoop   1633 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_016_016_/bucket_00171
> {code}
> in this case, when bucket_00171 file has a record, and there is no base file 
> for that, a select (*) with ETL split strategy can generate 2 splits for the 
> same delta bucket...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-22579) ACID v1: covered delta-only splits (without base) should be marked as covered (branch-2)

2019-12-04 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-22579:

Status: Patch Available  (was: Open)

> ACID v1: covered delta-only splits (without base) should be marked as covered 
> (branch-2)
> 
>
> Key: HIVE-22579
> URL: https://issues.apache.org/jira/browse/HIVE-22579
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
> Attachments: HIVE-22579.01.branch-2.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-22579) ACID v1: covered delta-only splits (without base) should be marked as covered (branch-2)

2019-12-04 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-22579:

Attachment: HIVE-22579.01.branch-2.patch

> ACID v1: covered delta-only splits (without base) should be marked as covered 
> (branch-2)
> 
>
> Key: HIVE-22579
> URL: https://issues.apache.org/jira/browse/HIVE-22579
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
> Attachments: HIVE-22579.01.branch-2.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-22579) ACID v1: covered delta-only splits (without base) should be marked as covered (branch-2)

2019-12-04 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-22579:

Summary: ACID v1: covered delta-only splits (without base) should be marked 
as covered (branch-2)  (was: ACID v1: covered delta splits (without base) 
should be marked as covered (branch-2))

> ACID v1: covered delta-only splits (without base) should be marked as covered 
> (branch-2)
> 
>
> Key: HIVE-22579
> URL: https://issues.apache.org/jira/browse/HIVE-22579
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)