[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2020-06-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-9660:
-
Labels: pull-request-available  (was: )

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, 
> HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.05.patch, 
> HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.07.patch, 
> HIVE-9660.08.patch, HIVE-9660.09.patch, HIVE-9660.10.patch, 
> HIVE-9660.10.patch, HIVE-9660.11.patch, HIVE-9660.patch, HIVE-9660.patch, 
> HIVE-9660.patch, owen-hive-9660.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-05-20 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-9660:

Attachment: HIVE-9660.patch

This patch does:
* implements a PositionedOutputStream.Callback to track when compression blocks 
and RLE are finished.
* Adds lengths to the OrcProto.RowIndexEntry.
* Uses the lengths when determining the number of bytes to read when doing 
predicate push down.
* Creates a callback for RowIndexEntry in the WriterImpl such that the entry 
isn't finalized until all of the streams do their callback. To ensure that the 
entry isn't finalized before all of the streams are added there is an 
activation after the last stream has been added to the RowIndexEntry.
* Removing the positions and lengths from the RowIndexEntry for ispresent 
stream removal is done softly so that remaining callbacks don't get impacted.
* The code dealing with the string columns and the dictionary vs direct 
encoding has been significantly cleaned up.
* TreeWriter.writeStripe has been split into a flush method that will finalize 
all of the streams.
* Lots of test case updates for the changes ORC file sizes.
* A new test case that tests the callbacks.

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, 
> HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.05.patch, 
> HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.07.patch, 
> HIVE-9660.08.patch, HIVE-9660.09.patch, HIVE-9660.10.patch, 
> HIVE-9660.10.patch, HIVE-9660.11.patch, HIVE-9660.patch, HIVE-9660.patch, 
> HIVE-9660.patch, owen-hive-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-05-03 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HIVE-9660:

Attachment: owen-hive-9660.patch

Here's my first pass. As I've discussed:
* PositionedOutputStream gets a registerCallback method that gets called when 
the next compression block finishes.
* IntegerWriter, BitFieldWriter, and RunLengthByteWriter all get a 
registerCallback method that gets called when the RLE block and then the 
compression block finish.
* StringRedBlackTree gets a method that writes to an OutputStream without 
copying the bytes to a Text object first.
* The StringTreeWriter gets refactored so that once the decision to use direct 
encoding is made, it works just like BinaryTreeWriter.

There are a couple of things that I haven't addressed yet:
* isPresent stream suppression doesn't remove the first length from the vector.
* The reading code doesn't use the length vector yet.
* I haven't added unit tests for the new code.
* I know the current ORC unit tests are failing. The changes to the OutStream 
and run length encoders are very straightforward, but I'm tracking down what 
I've messed up in WriterImpl.


> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, 
> HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.05.patch, 
> HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.07.patch, 
> HIVE-9660.08.patch, HIVE-9660.09.patch, HIVE-9660.10.patch, 
> HIVE-9660.10.patch, HIVE-9660.11.patch, HIVE-9660.patch, HIVE-9660.patch, 
> owen-hive-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-04-26 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: HIVE-9660.11.patch

Rebased again, removed the writer version (which was also the cause of some 
test failures)

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, 
> HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.05.patch, 
> HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.07.patch, 
> HIVE-9660.08.patch, HIVE-9660.09.patch, HIVE-9660.10.patch, 
> HIVE-9660.10.patch, HIVE-9660.11.patch, HIVE-9660.patch, HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-04-22 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: HIVE-9660.10.patch

Addressing the CR feedback

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, 
> HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.05.patch, 
> HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.07.patch, 
> HIVE-9660.08.patch, HIVE-9660.09.patch, HIVE-9660.10.patch, 
> HIVE-9660.10.patch, HIVE-9660.patch, HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-04-22 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: HIVE-9660.10.patch

Rebased the patch

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, 
> HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.05.patch, 
> HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.07.patch, 
> HIVE-9660.08.patch, HIVE-9660.09.patch, HIVE-9660.10.patch, HIVE-9660.patch, 
> HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-04-11 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: HIVE-9660.09.patch

Removed the writer setting, added the reader setting.

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, 
> HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.05.patch, 
> HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.07.patch, 
> HIVE-9660.08.patch, HIVE-9660.09.patch, HIVE-9660.patch, HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-04-08 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: HIVE-9660.08.patch

Fixing the tests so they could run over the weekend.
Will address the rest of the feedback later.

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, 
> HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.05.patch, 
> HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.07.patch, 
> HIVE-9660.08.patch, HIVE-9660.patch, HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-04-07 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: HIVE-9660.07.patch

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, 
> HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.05.patch, 
> HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.07.patch, HIVE-9660.patch, 
> HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-04-07 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: HIVE-9660.07.patch

Addressing review comments. The biggest change is the index to kind change for 
lengths tracking

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, 
> HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.05.patch, 
> HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.patch, HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-04-06 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: HIVE-9660.06.patch

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, 
> HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.05.patch, 
> HIVE-9660.06.patch, HIVE-9660.patch, HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-04-06 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: HIVE-9660.05.patch

Small update.

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, 
> HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.05.patch, HIVE-9660.patch, 
> HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-04-05 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: HIVE-9660.04.patch

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, 
> HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.patch, HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-04-04 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: HIVE-9660.03.patch

Fixing one more issue and updating the test outputs (they are stats changes, 
file dump changes, and such).

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, 
> HIVE-9660.03.patch, HIVE-9660.patch, HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-04-01 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: HIVE-9660.02.patch

Fixing some tests. I will fix FileDump tests later, they failed for obvious 
reasons and there's no autoupdate it appears. 

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, HIVE-9660.patch, 
> HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-03-28 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: (was: HIVE-9660.WIP2.patch)

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.01.patch, HIVE-9660.patch, HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-03-28 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: HIVE-9660.01.patch

Rebased the patch and fixed some small issue.

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.01.patch, HIVE-9660.WIP2.patch, 
> HIVE-9660.patch, HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-03-22 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: HIVE-9660.patch

The attempt #1

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.WIP2.patch, HIVE-9660.patch, HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-03-22 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Status: Patch Available  (was: Open)

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.WIP2.patch, HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-03-21 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: HIVE-9660.patch

This doesn't quite work for uncompressed, I'd need to fix some things. I was 
able to see at least some tests pass on this, though. Also, needs some comments 
and cleanup. Let's see what fails...

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.WIP2.patch, HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-03-20 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: (was: HIVE-9660.WIP0.patch)

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.WIP2.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-03-19 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: (was: HIVE-9660.WIP1.patch)

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.WIP2.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-03-19 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: HIVE-9660.WIP2.patch

This patch writes ORC files that look correct, but there are issues during 
reading. A lot of debugging will be needed to determine what's going on... 
Attaching here for now.

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.WIP2.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-03-14 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: HIVE-9660.WIP1.patch

> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.WIP0.patch, HIVE-9660.WIP1.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

2016-03-10 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-9660:
---
Attachment: HIVE-9660.WIP0.patch

WIP patch that takes care of the reading; the writing is only done for 
compressed path and not done for string writer yet cause its logic is 
different... whether it works at all is an open question.
Also, my head hurts now... I feel like after researching how Kerberos works.


> store end offset of compressed data for RG in RowIndex in ORC
> -
>
> Key: HIVE-9660
> URL: https://issues.apache.org/jira/browse/HIVE-9660
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-9660.WIP0.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)