Re: Compression test data

2017-09-28 Thread Tim Armstrong
Thanks for all the work you've done on benchmarking here, seems like it could be a big improvement. I can't seem to find decompression numbers in your spreadsheet. I think those should be where some of these newer codecs really shine. E.g. zstd's own numbers look really impressive:

Re: Clarifying valid uses for RLE encoding type

2017-12-07 Thread Tim Armstrong
good case for using RLE codecs. If > you can guarantee that you won't have the msb set unless the number really > is large, then why not allow people to use them? > > rb > > On Thu, Dec 7, 2017 at 11:33 AM, Tim Armstrong <tarmstr...@cloudera.com> > wrote: > >> FW

Re: Clarifying valid uses for RLE encoding type

2017-12-08 Thread Tim Armstrong
arquet implementations because there is no place to store the > bit > > width. > > > > On Thu, Dec 7, 2017 at 3:48 PM, Tim Armstrong <tarmstr...@cloudera.com> > > wrote: > > > >> > Using the RLE encoding will be different from the plain encoding > bec

Re: Clarifying valid uses for RLE encoding type

2017-12-06 Thread Tim Armstrong
The current RLE coding has bit-packing baked into it, so I'm wondering what it even means to bit-pack a lot of the types, particularly if you don't have bounds on the range of values. I can see if you have a logic int8 column stored in an int32, you have bounds on the values, so bit-packing would

Re: Clarifying valid uses for RLE encoding type

2017-12-07 Thread Tim Armstrong
sing dictionary encoding, which is most of > the time), and the repetition and definition levels. > > - Wes > > On Wed, Dec 6, 2017 at 8:46 PM, Tim Armstrong <tarmstr...@cloudera.com> > wrote: > > The current RLE coding has bit-packing baked into it, so I'm wondering > what >

Re: What is the maximum run length in the RLE encoding?

2018-05-01 Thread Tim Armstrong
ent RLE implementations and it may lead to subtle > > bugs. I would therefore add a maximum run length to the spec. If there is > > really a need for having longer runs, then someone needs to step up and > > make the changes to the spec and the implementations. As long as there is > > no gre

What is the maximum run length in the RLE encoding?

2018-04-30 Thread Tim Armstrong
I'm looking at Impala bug with decoding Parquet RLE with run lengths >= 2^31. The bug was found by fuzz testing rather than a realistic file. I'm trying to determine whether the Parquet spec actually allows runs of that length, but Encodings.md does not seem to specify any upper bound. It mentions

Re: [Announce] Congrats to our new Parquet committers

2017-10-27 Thread Tim Armstrong
Congrats Lars, Zoltan and Deepak! On Fri, Oct 27, 2017 at 10:12 AM, Wes McKinney wrote: > Congrats, and thanks for your hard work! > > On Fri, Oct 27, 2017 at 1:10 PM, Julien Le Dem > wrote: > > Zolta Ivanfi and Lars Volker are now Parquet

Re: Recommended page size controversy

2018-01-09 Thread Tim Armstrong
Impala defaults to 64kb: https://github.com/apache/impala/blob/daff8eb0ca19aa612c9fc7cc2ddd647735b31266/be/src/exec/hdfs-parquet-table-writer.h#L83 I think larger pages probably have slightly less runtime and encoding overhead associated with handling page boundaries, but consume more memory and

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Tim Armstrong
There is an extensibility mechanism with the ColumnOrder union - I think that was meant to avoid the need to add new stat fields? Given that the bug was in the Parquet spec, we'll need to make a spec change anyway, so we could add a new ColumnOrder - FloatingPointTotalOrder? at the same time as

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Tim Armstrong
> > On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <tarmstr...@cloudera.com> > wrote: > > > There is an extensibility mechanism with the ColumnOrder union - I think > > that was meant to avoid the need to add new stat fields? > > > > Given that the bug was

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Tim Armstrong
t; > > > That fix does not preclude a more thorough solution in the future, but it > > addresses the common case quickly. > > > > For existing data files we could check the writer version ignore filters > on > > float/double. I don't know whether min/max

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Tim Armstrong
; Yeah, I missed that. We set it per column, so all other types could keep > TypeDefinedOrder and floats could have something like NanAwareDoubleOrder. > > On Fri, Feb 16, 2018 at 9:18 AM, Tim Armstrong <tarmstr...@cloudera.com> > wrote: > > > We wouldn't need to rev th

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-15 Thread Tim Armstrong
We could also consider treating NaN similar to NULL and having a separate piece of information with a count of NaN values (or just a bit indicating presence/absence of NaN). I'm not sure if that is easier or harder to implement than a total order. On Thu, Feb 15, 2018 at 9:12 AM, Laszlo Gaal

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-19 Thread Tim Armstrong
We could drop NaNs and require that -0 be normalised to +0 when writing out stats. That would remove any degrees of freedom from the writer and then straightforward comparison with =, <, >, >=, <=, != would work as expected. On Mon, Feb 19, 2018 at 8:04 AM, Zoltan Ivanfi

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-08-01 Thread Tim Armstrong
I don't have a direct stake in this beyond wanting to see Parquet be successful, but I thought I'd give my two cents. For me, the thing that makes the biggest difference in contributing to a new codebase is the number of steps in the workflow for writing, testing, posting and iterating on a

Re: Status of column index in parquet-mr

2018-08-20 Thread Tim Armstrong
I had a similar concern to Uwe - if there are a large number of columns with variable size there does seem to be a real risk of having many tiny pages. I wonder if we could do something in-between where we allow different page sizes for different columns, but require that the row ranges for pages

Re: Column index testing break down

2019-03-07 Thread Tim Armstrong
I think you and I have different priors on this, Wes. It's definitely not clear-cut. I think it's an interesting point to discuss and it's unfortunate that you feel that way. Partially the current state of things is due to path-dependence, but there are some parts of the Impala runtime that it's

Re: Impala/parquet-cpp

2019-03-07 Thread Tim Armstrong
optimised and the bar is lower. - Tim On Thu, Mar 7, 2019 at 11:07 AM Wes McKinney wrote: > hi Tim, > > On Thu, Mar 7, 2019 at 11:52 AM Tim Armstrong > wrote: > > > > I think you and I have different priors on this, Wes. It's definitely not > > clear-cut. I think it's a

Re: New committer: Fokko Driesprong

2019-06-25 Thread Tim Armstrong
Congratulations! On Tue, Jun 25, 2019 at 7:12 AM 俊杰陈 wrote: > Congrats Fokko! > > On Tue, Jun 25, 2019 at 7:08 PM Zoltan Ivanfi > wrote: > > > Hi, > > > > The Project Management Committee (PMC) for Apache Parquet has invited > Fokko > > Driesprong to become a committer and we are pleased to

Re: New committer: Nandor Kollar

2019-06-25 Thread Tim Armstrong
Congratulations! On Tue, Jun 25, 2019 at 7:13 AM 俊杰陈 wrote: > Congrats Nandor! > > On Tue, Jun 25, 2019 at 7:09 PM Zoltan Ivanfi > wrote: > > > Hi, > > > > The Project Management Committee (PMC) for Apache Parquet has invited > > Nandor Kollar to become a committer and we are pleased to

Re: Definition Levels and Null

2019-05-13 Thread Tim Armstrong
at writer/reader. > > NaNs representing missing values occur frequently in a myriad of SAS use > cases. Other data types may be NULL as well, so I'm wondering if using def > level to indicate NULLs is safer (with consideration to other readers) and > also consumes less memory/storag

Re: Definition Levels and Null

2019-05-13 Thread Tim Armstrong
Parquet float/double values can hold any IEEE floating point value - https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L413. So there's no reason you can't write NaN to the files. If a reader isn't handling NaN values correctly, that seems like an issue with that

Re: Parquet File Naming Convention Standards

2019-05-22 Thread Tim Armstrong
Not reusing file names is generally a good idea - there are a bunch of interesting consistency issues, particularly on object stores, if you reuse file paths. This has come up for us with things like INSERT OVERWRITE in Hive, which tends to generate the same file names. I think there's an

Re: New PMC member: Gabor Szadovszky

2019-06-28 Thread Tim Armstrong
Congrats Gabor! On Fri, Jun 28, 2019 at 10:08 AM Wes McKinney wrote: > Congrats! > > On Fri, Jun 28, 2019 at 10:34 AM Lars Volker > wrote: > > > > Congratulations Gabor! > > > > On Fri, Jun 28, 2019, 08:32 Anna Szonyi wrote: > > > > > Congrats Gabor!! Best news I've heard in a while :) > > >

Re: [Question] Change Column Type in Parquet File

2019-07-17 Thread Tim Armstrong
I think generally the best solution, if it's supported by the tools you're using, is to do schema evolution by *not* rewriting the files and just updating the metadata, and rely on the engine that's querying the table to promote the int32 to int64 if the parquet file has an int32 but the hive

Re: [Question] Change Column Type in Parquet File

2019-07-18 Thread Tim Armstrong
to upgrade impala and hive to fix this. We only > need to update the metadata after the engine upgrading. > > Thank a lot and wish you have a nice day. > > Best Regards, > Ronnie > ---------- > *From:* Tim Armstrong > *Sent:* Wednesday, July 17, 2019 12:5

Re: How to incrementally store timeseries in Parquet files for efficient retrieval?

2020-07-20 Thread Tim Armstrong
The usual solution is to partition the data based on the criteria you want to filter by. E.g. for Hive tables, you would partition by date and have a separate directory per date. If you have a relatively modern version of Parquet, stats and page indices will allow the reader to filter out files

Re: Parquet File Meta Data & Compatibility

2020-12-07 Thread Tim Armstrong
ty on what is required/not required for > > implementations. > > > > e.g. I don't think the delta encoding reflects current best practices in > > > the literature. > > > > > > Could you expand on this? > > > > > > [1] https://

Re: Parquet File Meta Data & Compatibility

2020-12-04 Thread Tim Armstrong
can later promote things based on community consensus. On Fri, Dec 4, 2020 at 11:14 AM Tim Armstrong wrote: > I think it would be good for the project to define a core set of features > that a Parquet implementation must support to be able to correctly read > files all written by anoth

Re: Parquet File Meta Data & Compatibility

2020-12-04 Thread Tim Armstrong
I think it would be good for the project to define a core set of features that a Parquet implementation must support to be able to correctly read files all written by another compliant writer with the same version. There are then additional extensions like page indices that are not required to

Re: Parquet File Meta Data & Compatibility

2020-12-08 Thread Tim Armstrong
ate a jira and a PR where we can > start/continue the discussion about the core features themselves. > > > On Mon, Dec 7, 2020 at 9:00 PM Tim Armstrong > wrote: > > > > Introducing new logical types as "experimental" is a bit tricky. > > Maybe exp

Re: Metadata summary file deprecation

2020-12-16 Thread Tim Armstrong
I'm chiming in a bit late here, but I wanted to mention that Hive has a tendency to reuse file names when you do an INSERT OVERWRITE of a partition. We've had to deal with a number of problems related to this when caching data from parquet files - it's necessary to be scrupulous about comparing

Re: Query on striping parquet files while maintaining Row group alignment

2020-12-30 Thread Tim Armstrong
It seems like you would be best off writing out N separate parquet files of the desired size. That seems better than having N files with one row group each and a shared footer that you have to stitch together to read. I guess there would be a small amount of redundancy between footer contents, but

Re: Query on striping parquet files while maintaining Row group alignment

2021-01-05 Thread Tim Armstrong
nario on the CephFS + RADOS stack but with > the added capability to push down filters and projections to the storage > layer. > > On Thu, Dec 31, 2020 at 8:28 AM Tim Armstrong > wrote: > > > It seems like you would be best off writing out N separate parquet files > of >

[jira] [Commented] (PARQUET-843) [C++] Impala unable to read files created by parquet-cpp

2017-01-25 Thread Tim Armstrong (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839024#comment-15839024 ] Tim Armstrong commented on PARQUET-843: --- Created https://issues.cloudera.org/browse/IMPALA-4826

[jira] [Commented] (PARQUET-1171) [C++] Support RLE and BITPACKED as encodings for data

2017-12-06 Thread Tim Armstrong (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281223#comment-16281223 ] Tim Armstrong commented on PARQUET-1171: BIT_PACKED is deprecated so I don't think it makes

[jira] [Commented] (PARQUET-1171) [C++] Support RLE and BITPACKED as encodings for data

2017-12-06 Thread Tim Armstrong (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281277#comment-16281277 ] Tim Armstrong commented on PARQUET-1171: Thanks for the explanation. I agree that this wasn't

[jira] [Resolved] (PARQUET-1290) Clarify maximum run lengths for RLE encoding

2018-05-07 Thread Tim Armstrong (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Armstrong resolved PARQUET-1290. Resolution: Fixed > Clarify maximum run lengths for RLE encod

[jira] [Created] (PARQUET-1290) Clarify maximum run lengths for RLE encoding

2018-05-01 Thread Tim Armstrong (JIRA)
Tim Armstrong created PARQUET-1290: -- Summary: Clarify maximum run lengths for RLE encoding Key: PARQUET-1290 URL: https://issues.apache.org/jira/browse/PARQUET-1290 Project: Parquet Issue

[jira] [Commented] (PARQUET-1290) Clarify maximum run lengths for RLE encoding

2018-05-02 Thread Tim Armstrong (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461333#comment-16461333 ] Tim Armstrong commented on PARQUET-1290: I can take this on if someone will assign it to me