Re: ORC double encoding optimization proposal

2018-03-31 Thread Dain Sundstrom
> On Mar 30, 2018, at 8:37 AM, Owen O'Malley wrote: > > Ok, so what I'm trying is: > * Move the dictionaries (the string contents and lengths) between the > indexes and the data. If we’re talking about moving stuff around, ideally, the index would be at the end of the

Re: ORC double encoding optimization proposal

2018-03-30 Thread Owen O'Malley
On Wed, Mar 28, 2018 at 1:01 AM, Xiening Dai wrote: This modification will increase the complexity of implementation, and I am > not sure how much we will gain by not closing compression and rle chunks. > You probably have some data when you firstly designed row group and

Re: ORC double encoding optimization proposal

2018-03-28 Thread Xiening Dai
So we could modify my #2 proposal to be sensitive to rle and compression chunks. If at the end of the row group, we wait until the rle and compression chunks close and interleave the streams. That means that for a column with three streams and two row groups, we could something like: I think

Re: ORC double encoding optimization proposal

2018-03-27 Thread Owen O'Malley
Going back to the point of double split encoding, it would make sense to try a variant where we combine the sign and the mantissa. That should remove the sign stream at a relatively little cost of making the mantissa stream signed. Thinking more about the layout options... Another consideration

Re: ORC double encoding optimization proposal

2018-03-27 Thread Sergey Shelukhin
Afair ORC used to have some threshold below which it would still do one read if the gap is small. On 18/3/25, 23:47, "Gopal Vijayaraghavan" wrote: > >>2. Under seek or predicate pushdown scenario, there’s no need to >>load the entire stream. > >Yes, that is a valid

Re: ORC double encoding optimization proposal

2018-03-26 Thread Dain Sundstrom
For our installations, we sort the streams based on size before writing them. This places all the small streams next to each other so a single IO can grab all of them, and then the large streams are typically so large they need multiple IOs anyway. This really helps when you have (small)

Re: ORC double encoding optimization proposal

2018-03-26 Thread Owen O'Malley
This is a really interesting conversation. Of course, the original use case for ORC was that you were never reading less than a stripe. So putting all of the data streams for a column back to back, which isn't in the spec, but should be, was optimal in terms of seeks. There are two cases that

Re: ORC double encoding optimization proposal

2018-03-26 Thread Gopal Vijayaraghavan
> the bad thing is that we still have TWO encodings to discuss. Two is exactly what we need, not five - from the existing ORC configs hive.exec.orc.encoding.strategy=[SPEED, COMPRESSION]; FLIP8 was my original suggestion to Teddy from the byteuniq UDF runs, though the regressions in

Re: ORC double encoding optimization proposal

2018-03-26 Thread Xiening Dai
Where does the 2x IO drop come from? Based on Cheng Xu’s data, Split + Zstd has ~15% improvement over PlainV2 + Zstd in terms of the file size. If I understand correctly, the total number of IO reads are almost the same, but Split will need an additional seek for each read. The random IOPS

RE: ORC double encoding optimization proposal

2018-03-26 Thread Xu, Cheng A
@orc.apache.org Cc: u...@orc.apache.org Subject: Re: ORC double encoding optimization proposal >2. Under seek or predicate pushdown scenario, there’s no need to load the > entire stream. Yes, that is a valid scenario where the reader reads partial-streams & causes random IO. The curr

Re: ORC double encoding optimization proposal

2018-03-26 Thread Gopal Vijayaraghavan
>2. Under seek or predicate pushdown scenario, there’s no need to load the > entire stream. Yes, that is a valid scenario where the reader reads partial-streams & causes random IO. The current double encoding is actually 2 streams today & will continue to use 2 streams for the FLIP

Re: ORC double encoding optimization proposal

2018-03-26 Thread Xiening Dai
Hi Gopal, ORC spec doesn’t guarantee streams belong to the same column are stored together. Even if that’s guaranteed, there are reasons why we cannot read adjacent streams with one single IO - 1. Streams can be large. Reading the whole stream(s) will add unnecessary memory pressure. 2. Under

Re: ORC double encoding optimization proposal

2018-03-25 Thread Gopal Vijayaraghavan
Hi, > Since Split creates two separated streams, reading one data batch will need > an additional seek in order to reconstruct the column data If you are seeing a seek like that, we've messed up something else higher up in the pipeline & that can be fixed. ORC columnar reads only do random

RE: ORC double encoding optimization proposal

2018-03-25 Thread Xu, Cheng A
on friendly - SPLIT Any thoughts on this? Thanks Ferdinand Xu -Original Message- From: Xiening Dai [mailto:xndai@live.com] Sent: Monday, March 26, 2018 11:07 AM To: Gopal Vijayaraghavan <gop...@apache.org>; dev@orc.apache.org; u...@orc.apache.org Subject: Re: ORC double encod

Re: ORC double encoding optimization proposal

2018-03-25 Thread Xiening Dai
with very busy IOs. From: Gopal Vijayaraghavan <gop...@apache.org> Sent: Monday, March 19, 2018 3:45 PM To: dev@orc.apache.org; u...@orc.apache.org Subject: Re: ORC double encoding optimization proposal > existing work [1] from Teddy Choi and Owen

Re: ORC double encoding optimization proposal

2018-03-19 Thread Gopal Vijayaraghavan
> existing work [1] from Teddy Choi and Owen O'Malley with some new compression > codec (e.g. ZSTD and Brotli), we proposed to prompt FLIP as the default > encoding for ORC double type to move this feature forwards. Since we're discussing these, I'm going to summarize my existing notes on this,