Thanks for all the work you've done on benchmarking here, seems like it
could be a big improvement. I can't seem to find decompression numbers in
your spreadsheet. I think those should be where some of these newer codecs
really shine. E.g. zstd's own numbers look really impressive:
good case for using RLE codecs. If
> you can guarantee that you won't have the msb set unless the number really
> is large, then why not allow people to use them?
>
> rb
>
> On Thu, Dec 7, 2017 at 11:33 AM, Tim Armstrong <tarmstr...@cloudera.com>
> wrote:
>
>> FW
arquet implementations because there is no place to store the
> bit
> > width.
> >
> > On Thu, Dec 7, 2017 at 3:48 PM, Tim Armstrong <tarmstr...@cloudera.com>
> > wrote:
> >
> >> > Using the RLE encoding will be different from the plain encoding
> bec
The current RLE coding has bit-packing baked into it, so I'm wondering what
it even means to bit-pack a lot of the types, particularly if you don't
have bounds on the range of values.
I can see if you have a logic int8 column stored in an int32, you have
bounds on the values, so bit-packing would
sing dictionary encoding, which is most of
> the time), and the repetition and definition levels.
>
> - Wes
>
> On Wed, Dec 6, 2017 at 8:46 PM, Tim Armstrong <tarmstr...@cloudera.com>
> wrote:
> > The current RLE coding has bit-packing baked into it, so I'm wondering
> what
>
ent RLE implementations and it may lead to subtle
> > bugs. I would therefore add a maximum run length to the spec. If there is
> > really a need for having longer runs, then someone needs to step up and
> > make the changes to the spec and the implementations. As long as there is
> > no gre
I'm looking at Impala bug with decoding Parquet RLE with run lengths >=
2^31. The bug was found by fuzz testing rather than a realistic file. I'm
trying to determine whether the Parquet spec actually allows runs of that
length, but Encodings.md does not seem to specify any upper bound. It
mentions
Congrats Lars, Zoltan and Deepak!
On Fri, Oct 27, 2017 at 10:12 AM, Wes McKinney wrote:
> Congrats, and thanks for your hard work!
>
> On Fri, Oct 27, 2017 at 1:10 PM, Julien Le Dem
> wrote:
> > Zolta Ivanfi and Lars Volker are now Parquet
Impala defaults to 64kb:
https://github.com/apache/impala/blob/daff8eb0ca19aa612c9fc7cc2ddd647735b31266/be/src/exec/hdfs-parquet-table-writer.h#L83
I think larger pages probably have slightly less runtime and encoding
overhead associated with handling page boundaries, but consume more memory
and
There is an extensibility mechanism with the ColumnOrder union - I think
that was meant to avoid the need to add new stat fields?
Given that the bug was in the Parquet spec, we'll need to make a spec
change anyway, so we could add a new ColumnOrder - FloatingPointTotalOrder?
at the same time as
>
> On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <tarmstr...@cloudera.com>
> wrote:
>
> > There is an extensibility mechanism with the ColumnOrder union - I think
> > that was meant to avoid the need to add new stat fields?
> >
> > Given that the bug was
t; >
> > That fix does not preclude a more thorough solution in the future, but it
> > addresses the common case quickly.
> >
> > For existing data files we could check the writer version ignore filters
> on
> > float/double. I don't know whether min/max
; Yeah, I missed that. We set it per column, so all other types could keep
> TypeDefinedOrder and floats could have something like NanAwareDoubleOrder.
>
> On Fri, Feb 16, 2018 at 9:18 AM, Tim Armstrong <tarmstr...@cloudera.com>
> wrote:
>
> > We wouldn't need to rev th
We could also consider treating NaN similar to NULL and having a separate
piece of information with a count of NaN values (or just a bit indicating
presence/absence of NaN). I'm not sure if that is easier or harder to
implement than a total order.
On Thu, Feb 15, 2018 at 9:12 AM, Laszlo Gaal
We could drop NaNs and require that -0 be normalised to +0 when writing out
stats. That would remove any degrees of freedom from the writer and then
straightforward comparison with =, <, >, >=, <=, != would work as expected.
On Mon, Feb 19, 2018 at 8:04 AM, Zoltan Ivanfi
I don't have a direct stake in this beyond wanting to see Parquet be
successful, but I thought I'd give my two cents.
For me, the thing that makes the biggest difference in contributing to a
new codebase is the number of steps in the workflow for writing, testing,
posting and iterating on a
I had a similar concern to Uwe - if there are a large number of columns
with variable size there does seem to be a real risk of having many tiny
pages.
I wonder if we could do something in-between where we allow different page
sizes for different columns, but require that the row ranges for pages
I think you and I have different priors on this, Wes. It's definitely not
clear-cut. I think it's an interesting point to discuss and it's
unfortunate that you feel that way.
Partially the current state of things is due to path-dependence, but there
are some parts of the Impala runtime that it's
optimised and the bar is lower.
- Tim
On Thu, Mar 7, 2019 at 11:07 AM Wes McKinney wrote:
> hi Tim,
>
> On Thu, Mar 7, 2019 at 11:52 AM Tim Armstrong
> wrote:
> >
> > I think you and I have different priors on this, Wes. It's definitely not
> > clear-cut. I think it's a
Congratulations!
On Tue, Jun 25, 2019 at 7:12 AM 俊杰陈 wrote:
> Congrats Fokko!
>
> On Tue, Jun 25, 2019 at 7:08 PM Zoltan Ivanfi
> wrote:
>
> > Hi,
> >
> > The Project Management Committee (PMC) for Apache Parquet has invited
> Fokko
> > Driesprong to become a committer and we are pleased to
Congratulations!
On Tue, Jun 25, 2019 at 7:13 AM 俊杰陈 wrote:
> Congrats Nandor!
>
> On Tue, Jun 25, 2019 at 7:09 PM Zoltan Ivanfi
> wrote:
>
> > Hi,
> >
> > The Project Management Committee (PMC) for Apache Parquet has invited
> > Nandor Kollar to become a committer and we are pleased to
at writer/reader.
>
> NaNs representing missing values occur frequently in a myriad of SAS use
> cases. Other data types may be NULL as well, so I'm wondering if using def
> level to indicate NULLs is safer (with consideration to other readers) and
> also consumes less memory/storag
Parquet float/double values can hold any IEEE floating point value -
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L413.
So there's no reason you can't write NaN to the files. If a reader isn't
handling NaN values correctly, that seems like an issue with that
Not reusing file names is generally a good idea - there are a bunch of
interesting consistency issues, particularly on object stores, if you reuse
file paths. This has come up for us with things like INSERT OVERWRITE in
Hive, which tends to generate the same file names.
I think there's an
Congrats Gabor!
On Fri, Jun 28, 2019 at 10:08 AM Wes McKinney wrote:
> Congrats!
>
> On Fri, Jun 28, 2019 at 10:34 AM Lars Volker
> wrote:
> >
> > Congratulations Gabor!
> >
> > On Fri, Jun 28, 2019, 08:32 Anna Szonyi wrote:
> >
> > > Congrats Gabor!! Best news I've heard in a while :)
> > >
I think generally the best solution, if it's supported by the tools you're
using, is to do schema evolution by *not* rewriting the files and just
updating the metadata, and rely on the engine that's querying the table to
promote the int32 to int64 if the parquet file has an int32 but the hive
to upgrade impala and hive to fix this. We only
> need to update the metadata after the engine upgrading.
>
> Thank a lot and wish you have a nice day.
>
> Best Regards,
> Ronnie
> ----------
> *From:* Tim Armstrong
> *Sent:* Wednesday, July 17, 2019 12:5
The usual solution is to partition the data based on the criteria you want
to filter by. E.g. for Hive tables, you would partition by date and have a
separate directory per date.
If you have a relatively modern version of Parquet, stats and page indices
will allow the reader to filter out files
ty on what is required/not required for
> > implementations.
> >
> > e.g. I don't think the delta encoding reflects current best practices in
> > > the literature.
> >
> >
> > Could you expand on this?
> >
> >
> > [1] https://
can later
promote things based on community consensus.
On Fri, Dec 4, 2020 at 11:14 AM Tim Armstrong
wrote:
> I think it would be good for the project to define a core set of features
> that a Parquet implementation must support to be able to correctly read
> files all written by anoth
I think it would be good for the project to define a core set of features
that a Parquet implementation must support to be able to correctly read
files all written by another compliant writer with the same version.
There are then additional extensions like page indices that are not
required to
ate a jira and a PR where we can
> start/continue the discussion about the core features themselves.
>
>
> On Mon, Dec 7, 2020 at 9:00 PM Tim Armstrong
> wrote:
>
> > > Introducing new logical types as "experimental" is a bit tricky.
> > Maybe exp
I'm chiming in a bit late here, but I wanted to mention that Hive has a
tendency to reuse file names when you do an INSERT OVERWRITE of a
partition. We've had to deal with a number of problems related to this when
caching data from parquet files - it's necessary to be scrupulous about
comparing
It seems like you would be best off writing out N separate parquet files of
the desired size. That seems better than having N files with one row group
each and a shared footer that you have to stitch together to read. I guess
there would be a small amount of redundancy between footer contents, but
nario on the CephFS + RADOS stack but with
> the added capability to push down filters and projections to the storage
> layer.
>
> On Thu, Dec 31, 2020 at 8:28 AM Tim Armstrong
> wrote:
>
> > It seems like you would be best off writing out N separate parquet files
> of
>
[
https://issues.apache.org/jira/browse/PARQUET-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839024#comment-15839024
]
Tim Armstrong commented on PARQUET-843:
---
Created https://issues.cloudera.org/browse/IMPALA-4826
[
https://issues.apache.org/jira/browse/PARQUET-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281223#comment-16281223
]
Tim Armstrong commented on PARQUET-1171:
BIT_PACKED is deprecated so I don't think it makes
[
https://issues.apache.org/jira/browse/PARQUET-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281277#comment-16281277
]
Tim Armstrong commented on PARQUET-1171:
Thanks for the explanation. I agree that this wasn't
[
https://issues.apache.org/jira/browse/PARQUET-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Armstrong resolved PARQUET-1290.
Resolution: Fixed
> Clarify maximum run lengths for RLE encod
Tim Armstrong created PARQUET-1290:
--
Summary: Clarify maximum run lengths for RLE encoding
Key: PARQUET-1290
URL: https://issues.apache.org/jira/browse/PARQUET-1290
Project: Parquet
Issue
[
https://issues.apache.org/jira/browse/PARQUET-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461333#comment-16461333
]
Tim Armstrong commented on PARQUET-1290:
I can take this on if someone will assign it to me
41 matches
Mail list logo