> On Mar 30, 2018, at 8:37 AM, Owen O'Malley wrote:
>
> Ok, so what I'm trying is:
> * Move the dictionaries (the string contents and lengths) between the
> indexes and the data.
If we’re talking about moving stuff around, ideally, the index would be at the
end of the
On Wed, Mar 28, 2018 at 1:01 AM, Xiening Dai wrote:
This modification will increase the complexity of implementation, and I am
> not sure how much we will gain by not closing compression and rle chunks.
> You probably have some data when you firstly designed row group and
So we could modify my #2 proposal to be sensitive to rle and compression
chunks. If at the end of the row group, we wait until the rle and compression
chunks close and interleave the streams. That means that for a column with
three streams and two row groups, we could something like:
I think
Going back to the point of double split encoding, it would make sense to
try a variant where we combine the sign and the mantissa. That should
remove the sign stream at a relatively little cost of making the mantissa
stream signed.
Thinking more about the layout options...
Another consideration
Afair ORC used to have some threshold below which it would still do one
read if the gap is small.
On 18/3/25, 23:47, "Gopal Vijayaraghavan" wrote:
>
>>2. Under seek or predicate pushdown scenario, there’s no need to
>>load the entire stream.
>
>Yes, that is a valid
For our installations, we sort the streams based on size before writing them.
This places all the small streams next to each other so a single IO can grab
all of them, and then the large streams are typically so large they need
multiple IOs anyway. This really helps when you have (small)
This is a really interesting conversation. Of course, the original use case
for ORC was that you were never reading less than a stripe. So putting all
of the data streams for a column back to back, which isn't in the spec, but
should be, was optimal in terms of seeks.
There are two cases that
> the bad thing is that we still have TWO encodings to discuss.
Two is exactly what we need, not five - from the existing ORC configs
hive.exec.orc.encoding.strategy=[SPEED, COMPRESSION];
FLIP8 was my original suggestion to Teddy from the byteuniq UDF runs, though
the regressions in
Where does the 2x IO drop come from? Based on Cheng Xu’s data, Split + Zstd has
~15% improvement over PlainV2 + Zstd in terms of the file size. If I understand
correctly, the total number of IO reads are almost the same, but Split will
need an additional seek for each read.
The random IOPS
@orc.apache.org
Cc: u...@orc.apache.org
Subject: Re: ORC double encoding optimization proposal
>2. Under seek or predicate pushdown scenario, there’s no need to load the
> entire stream.
Yes, that is a valid scenario where the reader reads partial-streams & causes
random IO.
The curr
>2. Under seek or predicate pushdown scenario, there’s no need to load the
> entire stream.
Yes, that is a valid scenario where the reader reads partial-streams & causes
random IO.
The current double encoding is actually 2 streams today & will continue to use
2 streams for the FLIP
Hi Gopal,
ORC spec doesn’t guarantee streams belong to the same column are stored
together. Even if that’s guaranteed, there are reasons why we cannot read
adjacent streams with one single IO -
1. Streams can be large. Reading the whole stream(s) will add unnecessary
memory pressure.
2. Under
Hi,
> Since Split creates two separated streams, reading one data batch will need
> an additional seek in order to reconstruct the column data
If you are seeing a seek like that, we've messed up something else higher up in
the pipeline & that can be fixed.
ORC columnar reads only do random
on friendly - SPLIT
Any thoughts on this?
Thanks
Ferdinand Xu
-Original Message-
From: Xiening Dai [mailto:xndai@live.com]
Sent: Monday, March 26, 2018 11:07 AM
To: Gopal Vijayaraghavan <gop...@apache.org>; dev@orc.apache.org;
u...@orc.apache.org
Subject: Re: ORC double encod
with very busy IOs.
From: Gopal Vijayaraghavan <gop...@apache.org>
Sent: Monday, March 19, 2018 3:45 PM
To: dev@orc.apache.org; u...@orc.apache.org
Subject: Re: ORC double encoding optimization proposal
> existing work [1] from Teddy Choi and Owen
> existing work [1] from Teddy Choi and Owen O'Malley with some new compression
> codec (e.g. ZSTD and Brotli), we proposed to prompt FLIP as the default
> encoding for ORC double type to move this feature forwards.
Since we're discussing these, I'm going to summarize my existing notes on this,
16 matches
Mail list logo