Re: Maintenance of backward-compatible IO wrappers in Parquet C++

2019-12-03 Thread Micah Kornfield
Hi Wes, Is there much maintenance needed on the wrappers? I'd like to see them maintained for at least one more release of Arrow (i.e. 1.0.0). Thanks, Micah On Tue, Dec 3, 2019 at 12:23 PM Wes McKinney wrote: > About 4 months ago I migrated the Parquet C++ project to use the Arrow > IO APIs

Re: Following up on Parquet file format issue posted on StackOverflow.

2020-01-28 Thread Micah Kornfield
Hi Raphael, One suggestion that might make your use cases work better in parquet is to build some encoding/decoding logic into your application. If I understand it correctly you are storing strings in the format "${ISO_8601 timestamp}.${String1}.${string2}". If this is the case you can split the

Re: [C++] Adding RLEEncoder/Decoder to parquet API column writer/reader API

2020-03-10 Thread Micah Kornfield
do this late than never. > > - Wes > > On Sat, Mar 7, 2020 at 4:54 PM Micah Kornfield > wrote: > > > > The current API for writing repetition and definition levels takes arrays > > of int16_t values for the levels. This seems inefficient when there are > > &quo

[C++] Adding RLEEncoder/Decoder to parquet API column writer/reader API

2020-03-07 Thread Micah Kornfield
The current API for writing repetition and definition levels takes arrays of int16_t values for the levels. This seems inefficient when there are "runs" of the same level. When there are runs, writers that don't already have the data in array form, need to explode it out to an array which

Re: Spark + Parquet, parquet dictionary

2020-09-09 Thread Micah Kornfield
> > 1. Where is the parquet dictionary is stored in ParquetFile? Is it stored > in the Footer of the file? Or is it stored in each page? It is stored in its own page [1] 2. When Spark reads a Parquet File, how is an RDD partitioned to read a > ParquetFile? Does it allocate one RDD partition

Re: Current status of Data Page V2?

2020-10-08 Thread Micah Kornfield
s well. That said, I'd love to make some > progress on better encodings and finalizing v2 so we can use them! > > On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield > wrote: > >> What is the current status of support for Data Page V2? Is it recommended >> for production

Current status of Data Page V2?

2020-10-08 Thread Micah Kornfield
What is the current status of support for Data Page V2? Is it recommended for production workloads? Thanks, Micah

Re: Current status of Data Page V2?

2020-10-13 Thread Micah Kornfield
2 only. > > > > > > > > > > > > https://www.mail-archive.com/dev@parquet.apache.org/msg11826.html > > > > > > > > > > > > > > > > On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau > > > wrote: > >

Re: Parquet File Meta Data & Compatibility

2020-10-16 Thread Micah Kornfield
> > IMHO, shouldn't the spec mention - quite precisely - what versions exist > and what features can be used in which version, so an implementation can > say "yes, I can fully write this versions" or "no, I can't" instead of > having a fuzzy set of features where some are described to "not work on

Re: Current status of Data Page V2?

2020-10-09 Thread Micah Kornfield
ion for that in V1. > > On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield > wrote: > >> Thanks for the quick reply Ryan. >> >> >> > We only use v1 and it still works well. That said, I'd love to make some >> > progress on better encodings and f

Re: Implementing support for interpreting ColumnChunk.file_path in parquet file readers

2020-09-28 Thread Micah Kornfield
> > Concurrent writing will also benefit wide Parquet schemas that have > hundreds or thousands of columns. The impression I got was this use-case was typically outside of what parquet typically supports? > Concurrent writing will also benefit wide Parquet schemas that have > hundreds or

Re: Current status of Data Page V2?

2020-10-21 Thread Micah Kornfield
can we start recommending it for production use? Thanks, Micah On Tue, Oct 13, 2020 at 9:23 AM Micah Kornfield wrote: > I am not sure 2.0 means the v2 pages here. I think there was/is a bit of >> confusion between the v1/v2 pages and the parquet-mr releases. Maybe the >> parquet-

Re: [DISCUSS] Parquet data masking/anonymization

2020-08-07 Thread Micah Kornfield
Hi Gidon, Was there prior discussion on this on the mailing list? I left a comment on the document, but it isn't clear to me why this particular use-case needs to be part of the core parquet library, Are there motivating use-cases that wouldn't be served by an external library/application level?

Re: Writing very large rowgroups to Apache Parquet

2020-07-11 Thread Micah Kornfield
they behave. It makes more sense to discuss further ideas once I >> have some performance numbers. >> >> Thanks, >> Roman >> >> >> Am Fr., 10. Juli 2020 um 06:47 Uhr schrieb Micah Kornfield < >> emkornfi...@gmail.com>: >> >> >

Re: Does Parquet format provide indexing for quick retrieval based on column filters?

2020-07-10 Thread Micah Kornfield
Hi Yash, there are a few mechanisms in Parquet that can help with this. Not all of them will be present in every parquet file. And not all implementations make use of them or populate them (i.e. C++ lacks a few): 1. Per Column statistics per-row-group and data pages [1]. Includes min/max

Re: Writing very large rowgroups to Apache Parquet

2020-07-09 Thread Micah Kornfield
+parquet-dev as this seems more concerned with the non-arrow pieces of parquet Hi Roman, Answers inline. One way to solve that problem would be to use memory mapped files instead > of plain memory buffers. That way, the number of required memory can be > limited by the number of columns times

Re: Proposal for CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr

2020-06-22 Thread Micah Kornfield
Instead of a custom compressor name is there some way to expose more metadata about the parameters a particular codec used for compression (e.g. compression level used or block size) be sufficient? I'm not sure how standardized these are across given implementations/versions of the codecs

Re: Arrow 1404: Adding index for Page-level Skipping

2020-06-22 Thread Micah Kornfield
iee > > > Regards, > > Arun Balajiee > > > From: Micah Kornfield > Sent: Thursday, June 18, 2020 11:21:42 PM > To: dev@parquet.apache.org > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > Is this internally in the class or adding a parameter i

Re: Arrow 1404: Adding index for Page-level Skipping

2020-06-18 Thread Micah Kornfield
Is this internally in the class or adding a parameter in the API? What is the use case? On Saturday, June 13, 2020, Lekshmi Narayanan, Arun Balajiee < arl...@pitt.edu> wrote: > Hi Dev > > Thanks Wes for these comments. > > As Informed in other threads, I have completed most of it. Will try to >

Re: Writing very large rowgroups to Apache Parquet

2020-07-17 Thread Micah Kornfield
data already encoded/compressed). But on the other > > hand, having one memory mapped file per column is not something that > seems > > to fit well with the current design of arrow. > > > > Thanks for the feedback, > > Roman > > > > Am So., 12.

Re: Parquet File Meta Data & Compatibility

2020-12-06 Thread Micah Kornfield
er row groups. What should I do if I > anticipate > > > that my library will be used to write files where this will overflow? > Just > > > use it for the first 2^15 row groups and then leave it out? Or don't > write > > > it at all for any row group? > > >

Re: [ANNOUNCE] New Parquet PMC member - Xinli Shang

2020-11-09 Thread Micah Kornfield
Congrats! On Mon, Nov 9, 2020 at 4:52 PM Julien Le Dem wrote: > On behalf of the Apache Parquet PMC, I'm happy to announce that Xinli Shang > has accepted to join the PMC. > > Congrats Xinli! >

Re: [DISCUSS] Alternative design for KMS interaction in parquet-cpp

2020-11-12 Thread Micah Kornfield
I skimmed through and this seems like a clean design (I would have to reread the PR to do a comparison. A few thoughts of the top of my head: > - Multiple internal classes are left public in header files, where it > would be > preferred that public classes be kept to a minimum. I think some

Re: Current status of Data Page V2?

2020-10-22 Thread Micah Kornfield
to me if we want to recommend V2 for production use > at all or simply introduce the new encodings for V1. I would suggest > discussing this topic on the parquet sync next Tuesday. > > On Thu, Oct 22, 2020 at 6:04 AM Micah Kornfield > wrote: > > > I've created https://githu

Re: Query on incoherent total_byte_size and offset difference calculation results

2021-01-05 Thread Micah Kornfield
Hi Jayjeet, It isn't clear from your description whether the files being produced are corrupt or can be read but do not match your expectations. Either way some sample code and a more detailed explanation would be helpful in trying to figure out where the problem is. Thanks, Micah On Tue, Jan

Re: Bloom filter for apache parquet

2021-01-25 Thread Micah Kornfield
Welcome Vivianna, I think taking a look at https://issues.apache.org/jira/browse/PARQUET-41 and sub-issues should give you a sense of the current implementation. Java seems to have an implementation. The python implementation of parquet is a binding on top of the C++ implementation. Bloom

Purpose of isAdjustedToUTC for time type?

2021-06-15 Thread Micah Kornfield
It seems that Time has an isAdjustedToUTC flag [1] in the parquet.thrift, the current documentation covers the use of this flag clearly for Timestamp type [2], however given the explanation it isn't immediately clear to me how the flag should be interpreted for Time. For instance in Java [3] it

Re: Self Referencing Protobuf Solution

2021-06-14 Thread Micah Kornfield
> > That all being said, has there been any thought put into these types of > protos and how to effectively deal with them? Or is it just assumed any > proto being converted to parquet has no self-referenced attributes? Typically, the way I've seen this handled in other systems is to have a

Re: Purpose of isAdjustedToUTC for time type?

2021-06-21 Thread Micah Kornfield
xample: assertion error, not implemented or unsupported data type). > > Br, > > Zoltan > > On Wed, Jun 16, 2021 at 12:41 AM Micah Kornfield > wrote: > >> It seems that Time has an isAdjustedToUTC flag [1] in the parquet.thrift, >> the current documentation cove

New Interval Logical Type?

2021-05-12 Thread Micah Kornfield
For context Arrow is leaning towards introducing a new 3-field 16-byte interval type [1] (int32 months, int32 days, int64 nanos). This differs from the existing Interval converted type [2] in Parquet which is all unsigned and only supports millisecond granularity. It appears that a Logical type

Re: New Parquet PMC chair

2021-05-29 Thread Micah Kornfield
Congratulations Xinli! On Fri, May 28, 2021 at 10:16 PM Gidon Gershinsky wrote: > Congratulations Xinli, well deserved!! > > Cheers, Gidon > > > On Sat, May 29, 2021 at 12:34 AM Julien Le Dem > wrote: > > > Hello Parquet community, > > The Parquet PMC discussed and decided some time ago to

Re: Order of encodings?

2021-06-02 Thread Micah Kornfield
the write path. Meanwhile, > based on the parquet-mr code, it seems that the scenario explained can be > read properly. (If we think we have/will have writers that support such > scenarios we shall write unit tests for them.) > > Cheers, > Gabor > > On Tue, Jun 1, 2021 at

Re: Parquet sync meeting May 2021

2021-05-25 Thread Micah Kornfield
> > Iceberg filtering / V2 api > 3. > Adopt Arrow as data model Curious if more notes, or discussion around the two points above will happen on the mailing list? Or are there relevant JIRAs? Thanks, Micah On Tue, May 25, 2021 at 9:45 AM Xinli shang wrote: > 5/25/2021 > >

Order of encodings?

2021-06-01 Thread Micah Kornfield
I couldn't find anything in the specification on this, but is there any constraint on the ordering of encoded pages in a column for a row group. I think in practice most implementations try to dictionary-encode first and then fallback to another encoding if the dictionary doesn't yield benefits

Re: New parquet-format release?

2021-04-02 Thread Micah Kornfield
> > "Core features" is clearly not in a shape to be finalized soon so we > can postpone it to the release after. What do we think we need to do to get it to a releasable state? On Tue, Mar 30, 2021 at 6:44 AM Gabor Szadovszky wrote: > Thanks a lot, Antoine for the summary and heads up. #166

Re: [C++] Changing the versioning string for Parquet-CPP

2021-03-09 Thread Micah Kornfield
, 5 Mar 2021 10:26:57 -0800 > Micah Kornfield > wrote: > > > > I'd like to propose that we change the default version string [1] for > > parquet-cpp to reflect arrow releases (e.g. "parquet-cpp-arrow version > > 3.0.0" instead of "parquet-cpp version

Re: [Announce] new committer: Gidon Gershinsky

2021-04-07 Thread Micah Kornfield
Congrats Gidon, well deserved. On Wed, Apr 7, 2021 at 5:10 AM Nándor Kollár wrote: > Congrats Gidon! > > On 2021/04/07 11:55:45, Gabor Szadovszky wrote: > > The Project Management Committee (PMC) for Apache Parquet > > has invited Gidon Gershinsky to become a committer and we are pleased > >

Re: Request deprecation / removal of LZ4 compression

2021-02-17 Thread Micah Kornfield
out that for features like compression codecs the specification is > the key. We might always have specific data that we did not include in the > integration tests and still fails at runtime. > > On Wed, Feb 17, 2021 at 4:59 AM Micah Kornfield > wrote: > > > > > &g

Re: Request deprecation / removal of LZ4 compression

2021-02-16 Thread Micah Kornfield
> > I think it would be a mistake for someone who has written Hadoop-Lz4 for > several years with parquet-mr to all of sudden be no longer able to read > their files. (I believe that parquet-mr with this pattern has been > incorporated into various libraries for several years now--correct me if >

Re: Concatenation of parquet files

2021-10-15 Thread Micah Kornfield
Hi David, I'm not sure I understand. Concatenating files like this would likely break things. In particular in the example: > Merged: > > > ROW GROUP A1 > > > FOOTER A1 > > > ROW GROUP A2 > > > FOOTER A2 > > > ROW GROUP B1 > > > FOOTER B1 > > > ROW GROUP B2 > > > FOOTER B2 There should only

RE: Concatenation of parquet files

2021-10-23 Thread Micah Kornfield
e. > > The footer includes the file schema (column names and their types) as well > as details about every row group (total size, number of rows, min/max > statistics, number of NULL values for every column). > > Note that this column statistics is per row group, not for the entir

Re: [ANNOUNCEMENT] Gidon Gershinsky as Apache Parquet PMC

2021-11-24 Thread Micah Kornfield
Congrats Gidon! On Wed, Nov 24, 2021 at 2:12 PM Driesprong, Fokko wrote: > Congrats Gidon, well deserved! > > Op wo 24 nov. 2021 om 22:46 schreef Chao Sun > > > Congratulations Gidon! > > > > On Wed, Nov 24, 2021 at 1:27 PM Xinli shang > > wrote: > > > > > Hi all, > > > > > > The Project

Map Type duplicate keys

2021-10-25 Thread Micah Kornfield
Hi dev@parquet, The Logical Type Specification [1] has the following to say about duplicate keys. If there are multiple key-value pairs for the same key, then the final > value for that key must be the last value. Other values may be ignored or > may be added with replacement to the map container

Re: Support for DELTA_LENGTH_BYTE_ARRAY?

2021-07-20 Thread Micah Kornfield
Hi Jorge, This has been discussed previously, there is an open PR [1] which has lost some steam to try to figure out minimally supported features. -Micah [1] https://github.com/apache/parquet-format/pull/164 On Tue, Jul 20, 2021 at 1:15 AM Kyle Bendickson wrote: > Oh my apologies - > >

Re: statistics null count in nested types

2021-07-16 Thread Micah Kornfield
I agree this is non-intuitive based on field names but seems consistent with the text noted below (15 values are present and only 11 are written). It seems another way of defining the value for this field would be number of definition levels written that aren't less than the max definition level?

Re: num_values vs num_rows vs num_nulls

2021-07-15 Thread Micah Kornfield
> > I don't have any experience in pyarrow but either it writes wrong values > into these fields or the schema is not the same as the one in your example. The number of rows from pyarrow is clearly a bug (the code passes num_values for both). I think it might be worth discussing the null count

Re: num_values vs num_rows vs num_nulls

2021-07-16 Thread Micah Kornfield
t; Thanks, that was exactly what I was looking for. > > I do think we could offer this or other examples in the spec to make it > clear what they represent (including the null count). > > I filled ARROW-13349 to track the pyarrow discrepancy. > > Best, > Jorge > >

Re: num_values vs num_rows vs num_nulls

2021-07-16 Thread Micah Kornfield
at 11:51 PM Micah Kornfield wrote: > Yeah I guess we only ever write 4 values for the example so even though > the wording is strange in num_values = 6 (which I don't think anyone is > debating it must be 2). Still a little confusing. > > On Thu, Jul 15, 2021 at 11:43 PM Jorge

Re: Decode DELTA_LENGTH_BYTE_ARRAY in chunks

2022-02-05 Thread Micah Kornfield
Hi Jorge, I agree, I don't think you can decode the actual binary data without doing some level of decoding the lengths. It seems possible to skip over some decoding by only encoding the block header data and and the bitdwidths in each miniblock to skip to the next block. Cheers, -Micah On

Re: Decode DELTA_LENGTH_BYTE_ARRAY in chunks

2022-02-06 Thread Micah Kornfield
Apologies, I just realized the last sentence doesn't make sense, apologies. TL;DR; I think you can skip over decoding most of the mini-blocks if this helps your use-case. On Sat, Feb 5, 2022 at 10:49 PM Micah Kornfield wrote: > Hi Jorge, > I agree, I don't think you can decode the

Re: Parquet dictionary size limits?

2023-09-15 Thread Micah Kornfield
a bit of work. Additionally, one would likely > > need heuristics for when to potentially use the new mode versus a > complete > > fallback. > > > > Got it, thanks for the explanation! It does seem like a huge amount of work > > > Best, > Claire >

Re: Parquet dictionary size limits?

2023-09-15 Thread Micah Kornfield
/DictionaryValuesWriter.java#L124 On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield wrote: > I'm glad I was looking at the right setting for dictionary size. I just >> tried it out with 10x, 50x, and even total file size, though, and still am >> not seeing a dictionary get created. Is i

Re: Parquet dictionary size limits?

2023-09-14 Thread Micah Kornfield
> > - What's the heuristic for Parquet dictionary writing to succeed for a > given column? https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117 > - Is that

Re: Lossy compression of floating point data

2023-11-03 Thread Micah Kornfield
Hi Michael, Taking a quick scan of the repo it seems like there is only a C++ implementation for SZ3? If so, I don't think this is a good candidate unless the algorithm is easy to port to other languages. As a secondary concern, I don't think that BYTE_STREAM_SPLIT has even been widely adopted

[VOTE][FORMAT] Add repetition, definition and variable length size metadata statistics

2023-11-06 Thread Micah Kornfield
We would like to add statistics to better estimate size of pages and column chunks after they are read back into memory from parquet: https://github.com/apache/parquet-format/pull/197 Additionally, this metadata can support finer grained null filters and lists lengths for nested types. At a

Re: Parquet dictionary size limits?

2023-10-01 Thread Micah Kornfield
t; > > Claire > > >> > > > > > > > > >> > > > > > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty < > > >> > > > > > claire.d.mcgi...@gmail.com > > >> > > > > >

Re: [Request] Send automated notifications to a separate mailing-list

2023-10-01 Thread Micah Kornfield
+1 I think we need a PMC member to make this change? On Tue, Aug 29, 2023 at 7:27 PM Gang Wu wrote: > I think we can send a notification email to the dev@ so that > people can know what is going on and subscribe to what they > want after the split. We should also update the website to > reflect

Re: [VOTE][Format] Add Float16 type to specification

2023-10-06 Thread Micah Kornfield
I'm +1 (non-binding) for the proposal in general. I do have a concern that we should be implementing https://issues.apache.org/jira/browse/PARQUET-2182 (ignoring stats for logical types the reader doesn't understand) and its equivalent in other libraries first, but given potential low usage we

Re: [VOTE][FORMAT] Add repetition, definition and variable length size metadata statistics

2023-11-10 Thread Micah Kornfield
Hello, we need one more PMC member to approve this before the result can become official. Would someone mind chiming in? Thanks, Micah On Wed, Nov 8, 2023 at 8:55 AM Gábor Szádovszky wrote: > +1 (binding) > > Cheers, > Gabor > > On 2023/11/07 02:46:37 Xinli shang wrote: > > +1 (binding) > > >

Re: Forward & Backwards Compatibility

2022-05-29 Thread Micah Kornfield
I'd be in favor of maybe adding a flag that allows this type of schema which is by default false. Another option is if we can identify the writer of these files, we can make an exception specifically for those versions? On Wed, May 18, 2022 at 4:02 PM William Butler wrote: > > > > Well, why is

Re: Interest in adding the float16 logical type to the Parquet spec

2022-09-04 Thread Micah Kornfield
Just as a follow-up on the proposal PR [1]. A blocker came up based on the fact that we have never fully addressed how statistics for floating point values (PARQUET-1222 ). [1] https://github.com/apache/parquet-format/pull/184 On Wed, Aug 24,

Re: IMPORTANT: specification bugs around v2 data pages

2023-01-06 Thread Micah Kornfield
> > - https://issues.apache.org/jira/browse/PARQUET-2221: Encoding spec > incorrect for dictionary fallback The way I've always interpreted the encodings on the writer's side is that any fallback (or series of fallbacks) should be considered valid, even though that isn't as the spec reads, and

Re: [Format] Clarifying Sort Order Requirements for Floating Points and Logical Types

2022-12-07 Thread Micah Kornfield
https://github.com/apache/parquet-format/pull/185 has been merged. On Fri, Nov 4, 2022 at 9:54 PM Micah Kornfield wrote: > A new proposal for adding a logical annotation to support Float16 values > [1] reopened the discussion on specifying how parquet should deal with > edge cases for

Re: parquet checksum coverage

2022-12-13 Thread Micah Kornfield
where the CRC checks place and the actual reader code. > > > On Thu, 1 Dec 2022 at 05:53, Micah Kornfield > wrote: > > > Hi Steve, > > > > 1. What data in a parquet file is covered by CRC checks, and are there > > >any blocks of data (footers, summari

Re: parquet checksum coverage

2022-11-30 Thread Micah Kornfield
Hi Steve, 1. What data in a parquet file is covered by CRC checks, and are there >any blocks of data (footers, summaries etc) which aren't checksummed? https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L642 I think has the best summary. My understanding is

[Format] Clarifying Sort Order Requirements for Floating Points and Logical Types

2022-11-04 Thread Micah Kornfield
A new proposal for adding a logical annotation to support Float16 values [1] reopened the discussion on specifying how parquet should deal with edge cases for floating point types (PARQUET-1222 [2]). To try to resolve this the consensus from the JIRA is to not try to specify an ordering when

Re: Add FilteredPageReader to filter rows based on page statistics

2022-10-31 Thread Micah Kornfield
Hi Fatemah, I think there are likely two things to consider here: 1. How will expressions be modeled? There are already some examples of using expressions in Arrow for pruning predicates [1]. Do you plan to re-use them? 2. Along these lines is the proposed approach taken because the API to

[DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-24 Thread Micah Kornfield
Parquet metadata currently tracks uncompressed and compressed page/column sizes [1][2]. Uncompressed size here corresponds to encoded size which can differ substantially from the plain encoding size due to RLE/Dictionary encoding. When doing query planning/execution it can be useful to

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Micah Kornfield
e it seems like a plausible approach, there might be others. On Sat, Mar 25, 2023 at 4:59 PM Micah Kornfield wrote: > 1. How primitive types are computed? Should we simply compute the raw size >> by assuming the data is plain-encoded? >> For example, does INT16 use the same b

Re: Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Micah Kornfield
complex to track. I'd > > definitely be interested to hear from folks who have worked on the > > implementations for the other size fields what the level of difficulty is > > to implement such a field. > > > > Best, > > > > Will Jones > > >

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Micah Kornfield
gt; implementations for the other size fields what the level of difficulty is > > to implement such a field. > > > > Best, > > > > Will Jones > > > > [1] https://github.com/apache/arrow/issues/34712 > > > > On Fri, Mar 24, 2023 at 9:27 

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Micah Kornfield
if there are other approaches. On Sat, Mar 25, 2023 at 5:37 PM Micah Kornfield wrote: > 2. For repeated values, I think it is sufficient to get a reasonable >> estimate to know the number of start arrays (this includes nested arrays) >> contained in a page/column chunk and we can

Re: Gang Wu as new Apache Parquet committer

2023-03-04 Thread Micah Kornfield
Congrats! On Monday, February 27, 2023, Xinli shang wrote: > The Project Management Committee (PMC) for Apache Parquet has invited Gang > Wu (gangwu) to become a committer and we are pleased to announce that he > has accepted. > > Congratulations and welcome, Gang! > > -- > Xinli Shang >

Re: Parquet Null logical type question

2023-02-28 Thread Micah Kornfield
It is a validation bug that you can read and write values to the column. My understanding of the use-case for the type is coming from more loosely typed systems that infer schemas on the fly and then write in the parquet. In these systems if a column contains all Null values then the actual type

Re: Fwd: [C++] Parquet and Arrow overlap

2023-04-14 Thread Micah Kornfield
Bumping this thread again to see in any Parquet PMC members can chime in/maybe start a formal vote to move governance of Parquet-CPP under the umbrella. -Micah On 2023/02/02 10:34:25 Antoine Pitrou wrote: > > > Hi Will, > > Le 01/02/2023 à 20:27, Will Jones a écrit : > > > > First, it's not

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-04-19 Thread Micah Kornfield
gt; variable-size-summary + sizeof(ByteArray) * value-count > 3. Some time Arrow data is not equal to Parquet data, like Decimal stored > as int32 or int64. > Hope that helps. > > Best, Xuwei Fu > > On 2023/03/24 16:26:51 Micah Kornfield wrote: > > Parquet metadata c

Re: [Discussion] Parquet “parquet.writer.version” info is missing in the parquet metadata/dump

2023-04-22 Thread Micah Kornfield
I'm not familiar with it but I would think the show metadata data command would work get general metadata. Please note the version field is not entirely helpful as some implementations always hard-code it to certain value. The application/created by is generally better way to determine the

Re: [C++] Parquet and Arrow overlap

2023-02-12 Thread Micah Kornfield
> > I am a committer on Arrow, > but not on Parquet right now. Does that mean I should only merge Parquet > C++ PRs for code changes in parquet/arrow? FWIW, This was the mode I was operating under. My preference here would be to continue to operate under this mode for the governance perspective.

Re: [DISCUSS] Time to release parquet format 2.10.0?

2023-07-16 Thread Micah Kornfield
uld have a new > release. In addition to PARQUET-2261, there was also a discussion in Feb > with PMCs for PARQUET-758. We may want to check for the plan with Antoine > Pitrou <https://github.com/pitrou> if PARQUET-758 wants to be in also. > > > > On Sat, May 13, 2023 at 9:

Re: Bloom filters for full-text search and predicate pushdown

2023-06-07 Thread Micah Kornfield
Hi Marco, Could you describe how your proposal differs from tokenizing the target string and storing the list of tokens in a column that has a bloom filter attached? I think this should be supportable today by the format at least if not existing libraries. Thanks, Micah On Wednesday, June 7,

Re: Bloom filters for full-text search and predicate pushdown

2023-06-07 Thread Micah Kornfield
am using Apache Arrow in particular to deal with Parquet files > > Il Mer 7 Giu 2023, 16:00 Micah Kornfield ha > scritto: > > > Hi Marco, > > Could you describe how your proposal differs from tokenizing the target > > string and storing the list of tokens in a colu

Re: Rewrite Parquet List columns

2023-07-29 Thread Micah Kornfield
I think this is referring to changing the schema for List logical types [1] from deprecated schema layout to the "requirered" 3 level schema layout. [1] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists On Sun, Jul 23, 2023 at 8:32 AM Xinli shang wrote: > hI Rajesh, >

Re: [DISCUSS] Time to release parquet format 2.10.0?

2023-05-13 Thread Micah Kornfield
> > BTW, I'd like to see the implementation from Micah to fully > understand the use case. If he is too busy to do that, I can do it based on > my understanding. I can allocate some time to try to make a PoC in C++ next month if we are willing to wait until then. On Fri, May 12, 2023 at 5:04 

Re: Pitch for Pcodec Encoding in Parquet

2024-01-04 Thread Micah Kornfield
st toolchain to use a jar built with Rust > sources. > > Thanks. > > On Wed, Jan 3, 2024, 13:43 Micah Kornfield wrote: > > > Hi Martin, > > The results are impressive. However I'll point you to a recent prior > > discussion on a proposed new encoding/compression techniqu

Re: Pitch for Pcodec Encoding in Parquet

2024-01-03 Thread Micah Kornfield
Hi Martin, The results are impressive. However I'll point you to a recent prior discussion on a proposed new encoding/compression technique [1]. While this seems to avoid the lossiness concerns. There are also suggested benchmarks

Re: Pitch for Pcodec Encoding in Parquet

2024-01-05 Thread Micah Kornfield
> > I don't believe Apache has any restriction against Rust. We are not > collectively beholden to any other organization's restrictions, are we? It is correct that Apache does not have any restrictions. The point is mostly about: 1. Even if there is no restriction, maintainers of Apache

Re: Pitch for Pcodec Encoding in Parquet

2024-01-13 Thread Micah Kornfield
Martin Loncaric > wrote: > > > Micah: I've added a format doc now: > > https://github.com/mwlon/pcodec/blob/main/docs/format.md. Would > appreciate > > any feedback or thoughts on it. > > > > On Thu, Jan 11, 2024 at 11:47 PM Micah Kornfield > > wrote: &g

Re: Pitch for Pcodec Encoding in Parquet

2024-01-11 Thread Micah Kornfield
> > Pco could technically work as a Parquet encoding, but people are wary of > its newness and weak FFI support. It seems there is no immediate action to > take, but would be worthwhile to consider this again further in the future. I guess I'm more optimistic on the potential gaps. I think if

Re: Pitch for Pcodec Encoding in Parquet

2024-01-13 Thread Micah Kornfield
have a sense of how much the difference-in-difference vs the tANS step contributes to the compression experimental ratio? Thanks, Micah On Sat, Jan 13, 2024 at 9:43 PM Micah Kornfield wrote: > Hi Martin, > I agree with Gang's point about tAns. I opened up an issue against the > pcodec

Re: [Format] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY

2024-01-07 Thread Micah Kornfield
I responded there but generally, this doesn't seem like it imposes a lot of implementation burden and can be useful. On Thu, Dec 14, 2023 at 12:59 PM Antoine Pitrou wrote: > > Hello, > > Just a heads up here so as to reach a wider audience: I've posted a > format addition proposal in >

Files with inconsistent num_rows and num_values?

2023-11-28 Thread Micah Kornfield
We've recently encountered files that have inconsistencies between the number of rows specified in the row group [1] and the total number of values in a column [2] for non-repeated columns (within a file there is inconsistency between columns but all counts appear to be greater than or equal to

Re: Files with inconsistent num_rows and num_values?

2023-11-28 Thread Micah Kornfield
ttps://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1108 > > Best, > Gang > > On Wed, Nov 29, 2023 at 2:22 AM Micah Kornfield > wrote: > > > We've recently encountered files that have inconsistencies between the > > number of rows speci

Re: Files with inconsistent num_rows and num_values?

2023-12-05 Thread Micah Kornfield
may have > their own column-wise column writer implementations and only write pages > to the parquet-mr layer. > > Best, > Gang > > On Wed, Nov 29, 2023 at 2:14 PM Micah Kornfield > wrote: > > > Hi Gang, > > For writes I'm seeing "parquet-mr version 1.11.1"

Re: [VOTE][RESULT][FORMAT] Add repetition, definition and variable length size metadata statistics

2023-11-14 Thread Micah Kornfield
> > > > > > Kind regards, > > > Fokko Driesprong > > > > > > (Also pinged two more PMC members, hopefully they have time to jump in > > > here) > > > > > > Op vr 10 nov 2023 om 19:40 schreef Micah Kornfield < > > em

Re: [VOTE] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64

2024-03-08 Thread Micah Kornfield
+1 (non-binding) On Thursday, March 7, 2024, Gang Wu wrote: > +1 (non-binding) > > Best, > Gang > > On Fri, Mar 8, 2024 at 5:05 AM Edward Seidl wrote: > > > +1 (non-binding) > > > > Thanks for your work on this! > > Ed > > > > From: Antoine Pitrou > > Sent:

Re: Selecting format_version=2.6 ?

2024-03-15 Thread Micah Kornfield
> > Thanks for raising the issue! You are right that the version is always > 1 written by parquet-mr. Last I checked at least Impala fails if the version is not set to 1 (not sure if there are other engines). On Fri, Mar 15, 2024 at 9:07 AM Gang Wu wrote: > Hi Stephen, > > Thanks for raising

Re: Repeated fields spec clarification

2024-05-10 Thread Micah Kornfield
> >- I.e., is a parquet file with a page that starts at an r-level > 0 ill >formed? I.e., is this a bug in pyarrow.parquet? As noted above, my understanding is that is only ill-formed if a page index is present OR data-page V2 is present. If neither hold, then I think it is a valid

Re: Archival of parquet-cpp repository

2024-05-11 Thread Micah Kornfield
I think this is a great idea, thanks for driving it Uwe. On Mon, May 6, 2024 at 4:50 AM Uwe L. Korn wrote: > Hi, > > Given that we haven't the parquet-cpp for over six years now, I made a PR > https://github.com/apache/parquet-cpp/pull/504 that removes most of the > contents over at

Interest in Parquet V3

2024-05-11 Thread Micah Kornfield
Hi Parquet Dev, I wanted to start a conversation within the community about working on a new revision of Parquet. For context there have been a bunch of new formats [1][2][3] that show there is decent room for improvement across data encodings and how metadata is organized. Specifically, in a

Re: [ANNOUNCE] New Parquet PMC Member: Gang Wu

2024-05-11 Thread Micah Kornfield
Congrats Gang! On Sat, May 11, 2024 at 12:15 PM Vinoo Ganesh wrote: > Congrats, Gang!! > > > > > > On Sat, May 11, 2024 at 8:45 PM Claire McGinty > > wrote: > > > Congrats Gang!! Well deserved! > > > > - Claire > > > > On Sat, May 11, 2024 at 6:22 PM Fokko Driesprong > wrote: > > > > >

  1   2   3   >