Re: Selecting format_version=2.6 ?

2024-03-15 Thread Micah Kornfield
> > Thanks for raising the issue! You are right that the version is always > 1 written by parquet-mr. Last I checked at least Impala fails if the version is not set to 1 (not sure if there are other engines). On Fri, Mar 15, 2024 at 9:07 AM Gang Wu wrote: > Hi Stephen, > > Thanks for raising

Re: [VOTE] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64

2024-03-08 Thread Micah Kornfield
+1 (non-binding) On Thursday, March 7, 2024, Gang Wu wrote: > +1 (non-binding) > > Best, > Gang > > On Fri, Mar 8, 2024 at 5:05 AM Edward Seidl wrote: > > > +1 (non-binding) > > > > Thanks for your work on this! > > Ed > > > > From: Antoine Pitrou > > Sent:

Re: Pitch for Pcodec Encoding in Parquet

2024-01-13 Thread Micah Kornfield
have a sense of how much the difference-in-difference vs the tANS step contributes to the compression experimental ratio? Thanks, Micah On Sat, Jan 13, 2024 at 9:43 PM Micah Kornfield wrote: > Hi Martin, > I agree with Gang's point about tAns. I opened up an issue against the > pcodec

Re: Pitch for Pcodec Encoding in Parquet

2024-01-13 Thread Micah Kornfield
Martin Loncaric > wrote: > > > Micah: I've added a format doc now: > > https://github.com/mwlon/pcodec/blob/main/docs/format.md. Would > appreciate > > any feedback or thoughts on it. > > > > On Thu, Jan 11, 2024 at 11:47 PM Micah Kornfield > > wrote: &g

Re: Pitch for Pcodec Encoding in Parquet

2024-01-11 Thread Micah Kornfield
> > Pco could technically work as a Parquet encoding, but people are wary of > its newness and weak FFI support. It seems there is no immediate action to > take, but would be worthwhile to consider this again further in the future. I guess I'm more optimistic on the potential gaps. I think if

Re: [Format] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY

2024-01-07 Thread Micah Kornfield
I responded there but generally, this doesn't seem like it imposes a lot of implementation burden and can be useful. On Thu, Dec 14, 2023 at 12:59 PM Antoine Pitrou wrote: > > Hello, > > Just a heads up here so as to reach a wider audience: I've posted a > format addition proposal in >

Re: Pitch for Pcodec Encoding in Parquet

2024-01-05 Thread Micah Kornfield
> > I don't believe Apache has any restriction against Rust. We are not > collectively beholden to any other organization's restrictions, are we? It is correct that Apache does not have any restrictions. The point is mostly about: 1. Even if there is no restriction, maintainers of Apache

Re: Pitch for Pcodec Encoding in Parquet

2024-01-04 Thread Micah Kornfield
st toolchain to use a jar built with Rust > sources. > > Thanks. > > On Wed, Jan 3, 2024, 13:43 Micah Kornfield wrote: > > > Hi Martin, > > The results are impressive. However I'll point you to a recent prior > > discussion on a proposed new encoding/compression techniqu

Re: Pitch for Pcodec Encoding in Parquet

2024-01-03 Thread Micah Kornfield
Hi Martin, The results are impressive. However I'll point you to a recent prior discussion on a proposed new encoding/compression technique [1]. While this seems to avoid the lossiness concerns. There are also suggested benchmarks

Re: Files with inconsistent num_rows and num_values?

2023-12-05 Thread Micah Kornfield
may have > their own column-wise column writer implementations and only write pages > to the parquet-mr layer. > > Best, > Gang > > On Wed, Nov 29, 2023 at 2:14 PM Micah Kornfield > wrote: > > > Hi Gang, > > For writes I'm seeing "parquet-mr version 1.11.1"

Re: Files with inconsistent num_rows and num_values?

2023-11-28 Thread Micah Kornfield
ttps://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1108 > > Best, > Gang > > On Wed, Nov 29, 2023 at 2:22 AM Micah Kornfield > wrote: > > > We've recently encountered files that have inconsistencies between the > > number of rows speci

Files with inconsistent num_rows and num_values?

2023-11-28 Thread Micah Kornfield
We've recently encountered files that have inconsistencies between the number of rows specified in the row group [1] and the total number of values in a column [2] for non-repeated columns (within a file there is inconsistency between columns but all counts appear to be greater than or equal to

[jira] [Commented] (PARQUET-2221) [Format] Encoding spec incorrect for dictionary fallback

2023-11-21 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17788519#comment-17788519 ] Micah Kornfield commented on PARQUET-2221: -- I agree with [~wgtmac] here.  I think we should

Re: [VOTE][RESULT][FORMAT] Add repetition, definition and variable length size metadata statistics

2023-11-14 Thread Micah Kornfield
> > > > > > Kind regards, > > > Fokko Driesprong > > > > > > (Also pinged two more PMC members, hopefully they have time to jump in > > > here) > > > > > > Op vr 10 nov 2023 om 19:40 schreef Micah Kornfield < > > em

Re: [VOTE][FORMAT] Add repetition, definition and variable length size metadata statistics

2023-11-10 Thread Micah Kornfield
Hello, we need one more PMC member to approve this before the result can become official. Would someone mind chiming in? Thanks, Micah On Wed, Nov 8, 2023 at 8:55 AM Gábor Szádovszky wrote: > +1 (binding) > > Cheers, > Gabor > > On 2023/11/07 02:46:37 Xinli shang wrote: > > +1 (binding) > > >

[VOTE][FORMAT] Add repetition, definition and variable length size metadata statistics

2023-11-06 Thread Micah Kornfield
We would like to add statistics to better estimate size of pages and column chunks after they are read back into memory from parquet: https://github.com/apache/parquet-format/pull/197 Additionally, this metadata can support finer grained null filters and lists lengths for nested types. At a

Re: Lossy compression of floating point data

2023-11-03 Thread Micah Kornfield
Hi Michael, Taking a quick scan of the repo it seems like there is only a C++ implementation for SZ3? If so, I don't think this is a good candidate unless the algorithm is easy to port to other languages. As a secondary concern, I don't think that BYTE_STREAM_SPLIT has even been widely adopted

Re: [VOTE][Format] Add Float16 type to specification

2023-10-06 Thread Micah Kornfield
I'm +1 (non-binding) for the proposal in general. I do have a concern that we should be implementing https://issues.apache.org/jira/browse/PARQUET-2182 (ignoring stats for logical types the reader doesn't understand) and its equivalent in other libraries first, but given potential low usage we

Re: [Request] Send automated notifications to a separate mailing-list

2023-10-01 Thread Micah Kornfield
+1 I think we need a PMC member to make this change? On Tue, Aug 29, 2023 at 7:27 PM Gang Wu wrote: > I think we can send a notification email to the dev@ so that > people can know what is going on and subscribe to what they > want after the split. We should also update the website to > reflect

[jira] [Commented] (PARQUET-2345) The Parquet Spec doesn't specify whether multiple columns are allowed to have the same name.

2023-10-01 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770902#comment-17770902 ] Micah Kornfield commented on PARQUET-2345: -- I've at least seen in the wild two columns

Re: Parquet dictionary size limits?

2023-10-01 Thread Micah Kornfield
t; > > Claire > > >> > > > > > > > > >> > > > > > > On Tue, Sep 19, 2023 at 9:35 AM Claire McGinty < > > >> > > > > > claire.d.mcgi...@gmail.com > > >> > > > > >

Re: Parquet dictionary size limits?

2023-09-15 Thread Micah Kornfield
/DictionaryValuesWriter.java#L124 On Fri, Sep 15, 2023 at 9:51 AM Micah Kornfield wrote: > I'm glad I was looking at the right setting for dictionary size. I just >> tried it out with 10x, 50x, and even total file size, though, and still am >> not seeing a dictionary get created. Is i

Re: Parquet dictionary size limits?

2023-09-15 Thread Micah Kornfield
a bit of work. Additionally, one would likely > > need heuristics for when to potentially use the new mode versus a > complete > > fallback. > > > > Got it, thanks for the explanation! It does seem like a huge amount of work > > > Best, > Claire >

Re: Parquet dictionary size limits?

2023-09-14 Thread Micah Kornfield
> > - What's the heuristic for Parquet dictionary writing to succeed for a > given column? https://github.com/apache/parquet-mr/blob/9b5a962df3007009a227ef421600197531f970a5/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java#L117 > - Is that

Re: Rewrite Parquet List columns

2023-07-29 Thread Micah Kornfield
I think this is referring to changing the schema for List logical types [1] from deprecated schema layout to the "requirered" 3 level schema layout. [1] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists On Sun, Jul 23, 2023 at 8:32 AM Xinli shang wrote: > hI Rajesh, >

Re: [DISCUSS] Time to release parquet format 2.10.0?

2023-07-16 Thread Micah Kornfield
uld have a new > release. In addition to PARQUET-2261, there was also a discussion in Feb > with PMCs for PARQUET-758. We may want to check for the plan with Antoine > Pitrou <https://github.com/pitrou> if PARQUET-758 wants to be in also. > > > > On Sat, May 13, 2023 at 9:

Re: Bloom filters for full-text search and predicate pushdown

2023-06-07 Thread Micah Kornfield
am using Apache Arrow in particular to deal with Parquet files > > Il Mer 7 Giu 2023, 16:00 Micah Kornfield ha > scritto: > > > Hi Marco, > > Could you describe how your proposal differs from tokenizing the target > > string and storing the list of tokens in a colu

Re: Bloom filters for full-text search and predicate pushdown

2023-06-07 Thread Micah Kornfield
Hi Marco, Could you describe how your proposal differs from tokenizing the target string and storing the list of tokens in a column that has a bloom filter attached? I think this should be supportable today by the format at least if not existing libraries. Thanks, Micah On Wednesday, June 7,

Re: [DISCUSS] Time to release parquet format 2.10.0?

2023-05-13 Thread Micah Kornfield
> > BTW, I'd like to see the implementation from Micah to fully > understand the use case. If he is too busy to do that, I can do it based on > my understanding. I can allocate some time to try to make a PoC in C++ next month if we are willing to wait until then. On Fri, May 12, 2023 at 5:04 

Re: [Discussion] Parquet “parquet.writer.version” info is missing in the parquet metadata/dump

2023-04-22 Thread Micah Kornfield
I'm not familiar with it but I would think the show metadata data command would work get general metadata. Please note the version field is not entirely helpful as some implementations always hard-code it to certain value. The application/created by is generally better way to determine the

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-04-19 Thread Micah Kornfield
gt; variable-size-summary + sizeof(ByteArray) * value-count > 3. Some time Arrow data is not equal to Parquet data, like Decimal stored > as int32 or int64. > Hope that helps. > > Best, Xuwei Fu > > On 2023/03/24 16:26:51 Micah Kornfield wrote: > > Parquet metadata c

Re: Fwd: [C++] Parquet and Arrow overlap

2023-04-14 Thread Micah Kornfield
Bumping this thread again to see in any Parquet PMC members can chime in/maybe start a formal vote to move governance of Parquet-CPP under the umbrella. -Micah On 2023/02/02 10:34:25 Antoine Pitrou wrote: > > > Hi Will, > > Le 01/02/2023 à 20:27, Will Jones a écrit : > > > > First, it's not

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Micah Kornfield
if there are other approaches. On Sat, Mar 25, 2023 at 5:37 PM Micah Kornfield wrote: > 2. For repeated values, I think it is sufficient to get a reasonable >> estimate to know the number of start arrays (this includes nested arrays) >> contained in a page/column chunk and we can

[jira] [Created] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-25 Thread Micah Kornfield (Jira)
Micah Kornfield created PARQUET-2261: Summary: [Format] Add statistics that reflect decoded size to metadata Key: PARQUET-2261 URL: https://issues.apache.org/jira/browse/PARQUET-2261 Project

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Micah Kornfield
e it seems like a plausible approach, there might be others. On Sat, Mar 25, 2023 at 4:59 PM Micah Kornfield wrote: > 1. How primitive types are computed? Should we simply compute the raw size >> by assuming the data is plain-encoded? >> For example, does INT16 use the same b

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Micah Kornfield
gt; implementations for the other size fields what the level of difficulty is > > to implement such a field. > > > > Best, > > > > Will Jones > > > > [1] https://github.com/apache/arrow/issues/34712 > > > > On Fri, Mar 24, 2023 at 9:27 

Re: Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Micah Kornfield
complex to track. I'd > > definitely be interested to hear from folks who have worked on the > > implementations for the other size fields what the level of difficulty is > > to implement such a field. > > > > Best, > > > > Will Jones > > >

[DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-24 Thread Micah Kornfield
Parquet metadata currently tracks uncompressed and compressed page/column sizes [1][2]. Uncompressed size here corresponds to encoded size which can differ substantially from the plain encoding size due to RLE/Dictionary encoding. When doing query planning/execution it can be useful to

Re: Gang Wu as new Apache Parquet committer

2023-03-04 Thread Micah Kornfield
Congrats! On Monday, February 27, 2023, Xinli shang wrote: > The Project Management Committee (PMC) for Apache Parquet has invited Gang > Wu (gangwu) to become a committer and we are pleased to announce that he > has accepted. > > Congratulations and welcome, Gang! > > -- > Xinli Shang >

[jira] [Resolved] (PARQUET-2225) [C++] Allow reading dense with RecordReader

2023-03-03 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved PARQUET-2225. -- Fix Version/s: cpp-11.0.0 Resolution: Fixed Issue resolved by pull request

Re: Parquet Null logical type question

2023-02-28 Thread Micah Kornfield
It is a validation bug that you can read and write values to the column. My understanding of the use-case for the type is coming from more loosely typed systems that infer schemas on the fly and then write in the parquet. In these systems if a column contains all Null values then the actual type

[jira] [Assigned] (PARQUET-2201) Add Stress test for RecordReader SkipRecords

2023-02-23 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield reassigned PARQUET-2201: Assignee: fatemah > Add Stress test for RecordReader SkipReco

[jira] [Resolved] (PARQUET-2201) Add Stress test for RecordReader SkipRecords

2023-02-23 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved PARQUET-2201. -- Fix Version/s: cpp-11.0.0 Resolution: Fixed Issue resolved by pull request

Re: [C++] Parquet and Arrow overlap

2023-02-12 Thread Micah Kornfield
> > I am a committer on Arrow, > but not on Parquet right now. Does that mean I should only merge Parquet > C++ PRs for code changes in parquet/arrow? FWIW, This was the mode I was operating under. My preference here would be to continue to operate under this mode for the governance perspective.

[jira] [Assigned] (PARQUET-2210) [C++] Skip pages based on header metadata using a callback

2023-01-12 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield reassigned PARQUET-2210: Assignee: fatemah > [C++] Skip pages based on header metadata using a callb

[jira] [Resolved] (PARQUET-2210) [C++] Skip pages based on header metadata using a callback

2023-01-12 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved PARQUET-2210. -- Fix Version/s: cpp-11.0.0 Resolution: Fixed Issue resolved by pull request

[jira] [Commented] (PARQUET-2219) ParquetFileReader throws a runtime exception when a file contains only headers and now row data

2023-01-06 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1762#comment-1762 ] Micah Kornfield commented on PARQUET-2219: -- I'm not aware of anything in the specification

Re: IMPORTANT: specification bugs around v2 data pages

2023-01-06 Thread Micah Kornfield
> > - https://issues.apache.org/jira/browse/PARQUET-2221: Encoding spec > incorrect for dictionary fallback The way I've always interpreted the encodings on the writer's side is that any fallback (or series of fallbacks) should be considered valid, even though that isn't as the spec reads, and

Re: parquet checksum coverage

2022-12-13 Thread Micah Kornfield
where the CRC checks place and the actual reader code. > > > On Thu, 1 Dec 2022 at 05:53, Micah Kornfield > wrote: > > > Hi Steve, > > > > 1. What data in a parquet file is covered by CRC checks, and are there > > >any blocks of data (footers, summari

Re: [Format] Clarifying Sort Order Requirements for Floating Points and Logical Types

2022-12-07 Thread Micah Kornfield
https://github.com/apache/parquet-format/pull/185 has been merged. On Fri, Nov 4, 2022 at 9:54 PM Micah Kornfield wrote: > A new proposal for adding a logical annotation to support Float16 values > [1] reopened the discussion on specifying how parquet should deal with > edge cases for

Re: parquet checksum coverage

2022-11-30 Thread Micah Kornfield
Hi Steve, 1. What data in a parquet file is covered by CRC checks, and are there >any blocks of data (footers, summaries etc) which aren't checksummed? https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L642 I think has the best summary. My understanding is

[Format] Clarifying Sort Order Requirements for Floating Points and Logical Types

2022-11-04 Thread Micah Kornfield
A new proposal for adding a logical annotation to support Float16 values [1] reopened the discussion on specifying how parquet should deal with edge cases for floating point types (PARQUET-1222 [2]). To try to resolve this the consensus from the JIRA is to not try to specify an ordering when

Re: Add FilteredPageReader to filter rows based on page statistics

2022-10-31 Thread Micah Kornfield
Hi Fatemah, I think there are likely two things to consider here: 1. How will expressions be modeled? There are already some examples of using expressions in Arrow for pruning predicates [1]. Do you plan to re-use them? 2. Along these lines is the proposed approach taken because the API to

[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-10-08 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17614581#comment-17614581 ] Micah Kornfield commented on PARQUET-1222: -- Elevating the specification level seems fine. I

[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-09-29 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17611356#comment-17611356 ] Micah Kornfield commented on PARQUET-1222: -- I'd propose the following "fix": -

Re: Interest in adding the float16 logical type to the Parquet spec

2022-09-04 Thread Micah Kornfield
Just as a follow-up on the proposal PR [1]. A blocker came up based on the fact that we have never fully addressed how statistics for floating point values (PARQUET-1222 ). [1] https://github.com/apache/parquet-format/pull/184 On Wed, Aug 24,

[jira] [Commented] (PARQUET-2175) Skip method skips levels and not rows for repeated fields

2022-08-24 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584388#comment-17584388 ] Micah Kornfield commented on PARQUET-2175: -- I think the current signature is [Skip

[jira] [Resolved] (PARQUET-2172) [C++] Make field return const NodePtr& instead of forcing copy of shared_ptr

2022-08-12 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved PARQUET-2172. -- Resolution: Fixed > [C++] Make field return const NodePtr& instead of forci

[jira] [Updated] (PARQUET-2172) [C++] Make field return const NodePtr& instead of forcing copy of shared_ptr

2022-08-12 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield updated PARQUET-2172: - Fix Version/s: cpp-10.0.0 > [C++] Make field return const NodePtr& instead of

[jira] [Created] (PARQUET-2172) [C++] Make field return const NodePtr& instead of forcing copy of shared_ptr

2022-08-12 Thread Micah Kornfield (Jira)
Micah Kornfield created PARQUET-2172: Summary: [C++] Make field return const NodePtr& instead of forcing copy of shared_ptr Key: PARQUET-2172 URL: https://issues.apache.org/jira/browse/PARQUET-

[jira] [Assigned] (PARQUET-2172) [C++] Make field return const NodePtr& instead of forcing copy of shared_ptr

2022-08-12 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield reassigned PARQUET-2172: Assignee: Micah Kornfield > [C++] Make field return const NodePtr&

[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

2022-07-23 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570331#comment-17570331 ] Micah Kornfield commented on PARQUET-1711: -- {quote}[~emkornfield] Can we expect a fix any time

[jira] [Resolved] (PARQUET-2163) Parquet C++ Float Runtime Error in Decimal Schema

2022-07-06 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved PARQUET-2163. -- Fix Version/s: cpp-9.0.0 Resolution: Fixed Issue resolved by pull request

Re: Forward & Backwards Compatibility

2022-05-29 Thread Micah Kornfield
I'd be in favor of maybe adding a flag that allows this type of schema which is by default false. Another option is if we can identify the writer of these files, we can make an exception specifically for those versions? On Wed, May 18, 2022 at 4:02 PM William Butler wrote: > > > > Well, why is

[jira] [Commented] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

2022-05-29 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17543672#comment-17543672 ] Micah Kornfield commented on PARQUET-1711: -- the way one could handle this is allow users

[jira] [Commented] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700

2022-05-09 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534123#comment-17534123 ] Micah Kornfield commented on PARQUET-2122: -- I believe the answer is the Bloom filter

[jira] [Commented] (PARQUET-2133) Support Int8 and Int16 as basic type

2022-04-08 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519738#comment-17519738 ] Micah Kornfield commented on PARQUET-2133: -- before we start working on it it should probably

[jira] [Resolved] (PARQUET-2131) Number values decoded DCHECKs should be exceptions

2022-03-04 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved PARQUET-2131. -- Fix Version/s: cpp-8.0.0 Resolution: Fixed Issue resolved by pull request

[jira] [Resolved] (PARQUET-2130) Crash on non-standard map key name in debug

2022-03-04 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved PARQUET-2130. -- Fix Version/s: cpp-8.0.0 Resolution: Fixed Issue resolved by pull request

Re: Decode DELTA_LENGTH_BYTE_ARRAY in chunks

2022-02-06 Thread Micah Kornfield
Apologies, I just realized the last sentence doesn't make sense, apologies. TL;DR; I think you can skip over decoding most of the mini-blocks if this helps your use-case. On Sat, Feb 5, 2022 at 10:49 PM Micah Kornfield wrote: > Hi Jorge, > I agree, I don't think you can decode the

[jira] [Updated] (PARQUET-2118) [C++] thift_internal.h assumes shared_ptr type in some cases

2022-02-06 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield updated PARQUET-2118: - Component/s: parquet-cpp > [C++] thift_internal.h assumes shared_ptr type in s

[jira] [Updated] (PARQUET-2118) [C++] thift_internal.h assumes shared_ptr type in some cases

2022-02-06 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield updated PARQUET-2118: - Summary: [C++] thift_internal.h assumes shared_ptr type in some cases

[jira] [Moved] (PARQUET-2118) thift_internal.h assumes shared_ptr type in some cases

2022-02-06 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield moved ARROW-15596 to PARQUET-2118: -- Key: PARQUET-2118 (was: ARROW-15596) Workflow

Re: Decode DELTA_LENGTH_BYTE_ARRAY in chunks

2022-02-05 Thread Micah Kornfield
Hi Jorge, I agree, I don't think you can decode the actual binary data without doing some level of decoding the lengths. It seems possible to skip over some decoding by only encoding the block header data and and the bitdwidths in each miniblock to skip to the next block. Cheers, -Micah On

[jira] [Updated] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types

2022-01-03 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield updated PARQUET-1361: - Component/s: parquet-mr > [C++] 1.4.1 library allows creation of parquet file w/N

Re: [ANNOUNCEMENT] Gidon Gershinsky as Apache Parquet PMC

2021-11-24 Thread Micah Kornfield
Congrats Gidon! On Wed, Nov 24, 2021 at 2:12 PM Driesprong, Fokko wrote: > Congrats Gidon, well deserved! > > Op wo 24 nov. 2021 om 22:46 schreef Chao Sun > > > Congratulations Gidon! > > > > On Wed, Nov 24, 2021 at 1:27 PM Xinli shang > > wrote: > > > > > Hi all, > > > > > > The Project

Map Type duplicate keys

2021-10-25 Thread Micah Kornfield
Hi dev@parquet, The Logical Type Specification [1] has the following to say about duplicate keys. If there are multiple key-value pairs for the same key, then the final > value for that key must be the last value. Other values may be ignored or > may be added with replacement to the map container

RE: Concatenation of parquet files

2021-10-23 Thread Micah Kornfield
e. > > The footer includes the file schema (column names and their types) as well > as details about every row group (total size, number of rows, min/max > statistics, number of NULL values for every column). > > Note that this column statistics is per row group, not for the entir

[jira] [Resolved] (PARQUET-2095) [C++] Read Parquet file with MapArray

2021-10-23 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved PARQUET-2095. -- Resolution: Not A Problem > [C++] Read Parquet file with MapAr

Re: Concatenation of parquet files

2021-10-15 Thread Micah Kornfield
Hi David, I'm not sure I understand. Concatenating files like this would likely break things. In particular in the example: > Merged: > > > ROW GROUP A1 > > > FOOTER A1 > > > ROW GROUP A2 > > > FOOTER A2 > > > ROW GROUP B1 > > > FOOTER B1 > > > ROW GROUP B2 > > > FOOTER B2 There should only

[jira] [Commented] (PARQUET-2095) [C++] Read Parquet file with MapArray

2021-10-07 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17425958#comment-17425958 ] Micah Kornfield commented on PARQUET-2095: -- Hi [~longshanpdd] did the above response fix your

[jira] [Created] (PARQUET-2099) [C++] Statistics::num_values() is misleading

2021-09-30 Thread Micah Kornfield (Jira)
Micah Kornfield created PARQUET-2099: Summary: [C++] Statistics::num_values() is misleading Key: PARQUET-2099 URL: https://issues.apache.org/jira/browse/PARQUET-2099 Project: Parquet

[jira] [Commented] (PARQUET-2095) [C++] Read Parquet file with MapArray

2021-09-26 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420316#comment-17420316 ] Micah Kornfield commented on PARQUET-2095: -- Can you run ValidateFull on the array? This would

[jira] [Commented] (PARQUET-2095) [C++] Read Parquet file with MapArray

2021-09-25 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420158#comment-17420158 ] Micah Kornfield commented on PARQUET-2095: -- Hi it isn't clear if this is reporting a bug

[jira] [Commented] (PARQUET-2092) [Go] Fix in go implementation

2021-09-14 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415060#comment-17415060 ] Micah Kornfield commented on PARQUET-2092: -- OK, would you mind closing this and opening up

[jira] [Commented] (PARQUET-2092) [Go] Fix in go implementation

2021-09-14 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415056#comment-17415056 ] Micah Kornfield commented on PARQUET-2092: -- [~zeroshade] the move option if you are allowed

[jira] [Commented] (PARQUET-2092) [Go] Fix in go implementation

2021-09-14 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415050#comment-17415050 ] Micah Kornfield commented on PARQUET-2092: -- Hmm, it doesn't look like I have permissions

[jira] [Commented] (PARQUET-2092) [Go] Fix in go implementation

2021-09-14 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415048#comment-17415048 ] Micah Kornfield commented on PARQUET-2092: -- I'm going to move this to the Arrow tracker

[jira] [Resolved] (PARQUET-2090) [C++] Parquet writes incorrect file_offset

2021-09-13 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved PARQUET-2090. -- Resolution: Invalid > [C++] Parquet writes incorrect file_off

[jira] [Commented] (PARQUET-2090) [C++] Parquet writes incorrect file_offset

2021-09-13 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414716#comment-17414716 ] Micah Kornfield commented on PARQUET-2090: -- [~csun]  according the [spec|https://github.com

[jira] [Assigned] (PARQUET-2089) [C++] RowGroupMetaData file_offset set incorrectly

2021-09-13 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield reassigned PARQUET-2089: Assignee: Micah Kornfield > [C++] RowGroupMetaData file_offset set incorrec

[jira] [Assigned] (PARQUET-2090) [C++] Parquet writes incorrect file_offset

2021-09-13 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield reassigned PARQUET-2090: Assignee: Micah Kornfield > [C++] Parquet writes incorrect file_off

[jira] [Updated] (PARQUET-2089) [C++] RowGroupMetaData file_offset set incorrectly

2021-09-13 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield updated PARQUET-2089: - Summary: [C++] RowGroupMetaData file_offset set incorrectly (was: RowGroupMetaData

[jira] [Commented] (PARQUET-2089) RowGroupMetaData file_offset set incorrectly

2021-09-13 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414695#comment-17414695 ] Micah Kornfield commented on PARQUET-2089: -- CC [~zeroshade] > RowGroupMetaData file_off

[jira] [Commented] (PARQUET-2090) [C++] Parquet writes incorrect file_offset

2021-09-13 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414694#comment-17414694 ] Micah Kornfield commented on PARQUET-2090: -- CC [~zeroshade] > [C++] Parquet writes incorr

[jira] [Moved] (PARQUET-2090) [C++] Parquet writes incorrect file_offset

2021-09-13 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield moved ARROW-13941 to PARQUET-2090: -- Component/s: (was: Parquet) parquet-cpp

[jira] [Moved] (PARQUET-2089) RowGroupMetaData file_offset set incorrectly

2021-09-13 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield moved ARROW-13609 to PARQUET-2089: -- Component/s: (was: C++) parquet-cpp

[jira] [Commented] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types

2021-08-22 Thread Micah Kornfield (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17402882#comment-17402882 ] Micah Kornfield commented on PARQUET-1361: -- Sorry for the late reply, but I think

Re: Support for DELTA_LENGTH_BYTE_ARRAY?

2021-07-20 Thread Micah Kornfield
Hi Jorge, This has been discussed previously, there is an open PR [1] which has lost some steam to try to figure out minimally supported features. -Micah [1] https://github.com/apache/parquet-format/pull/164 On Tue, Jul 20, 2021 at 1:15 AM Kyle Bendickson wrote: > Oh my apologies - > >

Re: statistics null count in nested types

2021-07-16 Thread Micah Kornfield
I agree this is non-intuitive based on field names but seems consistent with the text noted below (15 values are present and only 11 are written). It seems another way of defining the value for this field would be number of definition levels written that aren't less than the max definition level?

  1   2   3   >