Re: [VOTE] Migration of parquet-* issues from Jira to GitHub

2024-06-13 Thread Gang Wu
+1 (binding) Best, Gang On Fri, Jun 14, 2024 at 2:26 AM Ed Seidl wrote: > +1 (non-binding) > > Thanks! > Ed > > On 6/13/24 11:20 AM, Micah Kornfield wrote: > > +1 (non-binding) > > > > On Thu, Jun 13, 2024 at 11:14 AM Rok Mihevc > wrote: > > > >> Hi all, > >> > >> Following the ML discussion

Re: [DISCUSS] Migration of parquet-* issues from Jira to GitHub

2024-06-13 Thread Gang Wu
+1 on this BTW, I created following PRs to enable github issues to these repos: - https://github.com/apache/parquet-format/pull/255 - https://github.com/apache/parquet-java/pull/1362 - https://github.com/apache/parquet-testing/pull/50 I will not merge them until the formal vote passes. Best,

[VOTE] Release Apache Parquet-Java 1.14.1 RC0

2024-06-13 Thread Gang Wu
Hi everyone, I propose the following RC to be released as the official Apache Parquet-Java 1.14.1 release. The commit id is 97ede968377400d1d79e3196636ba3de392196ba * This corresponds to the tag: apache-parquet-1.14.1-rc0 *

Re: [DISCUSS] Patch release for parquet-java 1.14.1?

2024-06-12 Thread Gang Wu
t have > any fixes that can go in, so from my end, we're good for starting the > release process. > > Kind regards, > Fokko > > Op di 4 jun 2024 om 09:03 schreef Gang Wu : > > > Hi, > > > > It seems that we need a patch release 1.14.1 to fix [1]. Al

Re: [DISCUSS] Migration of parquet-cpp issues to GitHub

2024-06-12 Thread Gang Wu
et-cpp repos before the > action. > > Agreed. Did we discuss this enough to call for a vote yet? > > On Wed, Jun 12, 2024 at 5:23 PM Gang Wu wrote: > > > Thanks Rok for the update! > > > > Yes, the copied issues look good to me. Perhaps we need a separate > >

Re: [DISCUSS] Migration of parquet-cpp issues to GitHub

2024-06-12 Thread Gang Wu
Fri, May 31, 2024 at 10:04 AM Rok Mihevc wrote: > > > Would we also want to add issue templates to encourage some structure? > See > > [1] for inspiration. > > > > [1] https://github.com/apache/arrow/blob/main/.github/ISSUE_TEMPLATE > > > > On Fri, May 31, 2024

[DISCUSS] Patch release for parquet-java 1.14.1?

2024-06-04 Thread Gang Wu
Hi, It seems that we need a patch release 1.14.1 to fix [1]. All new commits in branch 1.14.x can be viewed at [2]. If there is any additional fix to be included, please let me know. If the community believes the release is necessary, I can volunteer to be the release manager. [1]

Re: ColumnMetaData location

2024-06-03 Thread Gang Wu
> modifying the spec to state that the ColumnMetaData following > the chunk data is also optional +1 on this > adding language to the effect that if the value of file_offset is 0, > then no such metadata is present in the file. What about marking this as deprecated and discouraged to use it?

Re: [DISCUSS] Migration of parquet-cpp issues to GitHub

2024-05-30 Thread Gang Wu
che/arrow-nanoarrow/blob/81711045e8bb4ded1cb3b5a6fa354b35f18aa4e7/.asf.yaml#L24-L25 > > On Wed, May 29, 2024 at 10:39 PM Gang Wu wrote: > > > > Just want to mention that these apache/parquet-* Github repositories > > have not yet enabled issues and INFRA tickets are required before > > mi

Re: [DISCUSS] Extensibility of Parquet

2024-05-30 Thread Gang Wu
This is similar to what we do internally to provide non-standard encoding by duplicating data in the customized index pages. It is free to vendor's choice to pay extra storage cost for better encoding support. So I like this idea to support encoding extensions. Best, Gang On Thu, May 30, 2024 at

Re: [DISCUSS] Encoding improvements (follow-up from Parquet "V3" discussion)

2024-05-29 Thread Gang Wu
I'm interested in experimenting and implementing new encodings. Will follow up with concrete proposals or findings. Best, Gang On Thu, May 30, 2024 at 3:29 AM Ed Seidl wrote: > Maybe this is putting the cart too far in front of the horse, but I'd be > willing to implement an encoding like this

Re: [DISCUSS] Migration of parquet-cpp issues to GitHub

2024-05-29 Thread Gang Wu
Just want to mention that these apache/parquet-* Github repositories have not yet enabled issues and INFRA tickets are required before migration. Best, Gang On Thu, May 30, 2024 at 1:55 AM Micah Kornfield wrote: > SGTM +1 > > On Wed, May 29, 2024 at 10:50 AM Rok Mihevc wrote: > > > On Wed,

Re: [DISCUSS] Unify Record / Row terminology (to Row)

2024-05-29 Thread Gang Wu
Hi, I agree that row sounds clearer than record, however we have a class RecordReader in the parquet cpp: [1]. Not sure if we need to rename it and it is still considered an internal class. [1]

Re: [VOTE] Migration of parquet-cpp issues to Arrow's issue tracker

2024-05-29 Thread Gang Wu
+1 (binding for Parquet) Thanks! Gang On Wed, May 29, 2024 at 10:47 PM Fokko Driesprong wrote: > +1 (non-binding) > > Op wo 29 mei 2024 om 16:46 schreef Felipe Oliveira Carvalho < > felipe...@gmail.com>: > > > +1 (non-binding) > > > > On Wed, 29 May 2024 at 11:30 Micah Kornfield > > wrote: >

Re: [DISCUSS] Extension types in Parquet?

2024-05-28 Thread Gang Wu
I think adding extension type support will make it easier for adding tensor or vector type, which is [1] trying to target. However, the geometry type seems not easy to fit to the imagination of the extension type. It would be better to explicitly define geospatial statistics in the spec,

Re: [DISCUSS] Extensibility of Parquet

2024-05-28 Thread Gang Wu
I'm supportive of most of the points in this thread. For 2), making encodings pluggable does not eliminate the work on implementation and interoperability. If people are worried about the lengthy process to promote a new encoding to the spec, perhaps we can preserve an encoding type for each new

Re: [DISCUSS] Migration of parquet-cpp issues to GitHub

2024-05-28 Thread Gang Wu
+1 on this. IIUC, I didn't see any objection to this in the discussion [1]. Perhaps we can directly proceed to a vote? Sorry that I was intended to initialize the vote but got distracted by other stuff. [1] https://lists.apache.org/thread/jf9wos3t6xxk6xdyx2dof1jlkbpkr56p Best, Gang On Wed, May

Re: BYTE_ARRAY vs binary in Parquet specification

2024-05-26 Thread Gang Wu
Hi Ed, Sorry for the late reply. I agree that we need to replace BINARY with BYTE_ARRAY to avoid confusion because FIXED_LENGTH_BYTE_ARRAY may also be regarded as BINARY. Best, Gang On Fri, May 24, 2024 at 2:01 AM Ed Seidl wrote: > Hi all, > > A question came up in the discussion of

Re: Repeated fields spec clarification

2024-05-21 Thread Gang Wu
BTW, it seems totally valid to create page index for a subset of all columns. Does it mean columns without page index can have their records spanning more than one page? Best, Gang On Tue, May 21, 2024 at 7:26 PM Gang Wu wrote: > I would like to ask if it is valid to create only ColumnIn

Re: Repeated fields spec clarification

2024-05-21 Thread Gang Wu
I would like to ask if it is valid to create only ColumnIndex but omit OffsetIndex? My answer is NO according to [1]. If agreed, my inclination is option 1. [1] https://github.com/apache/parquet-format/blob/079a2dff06e32b7d1ad8c9aa67f2e2128fb5ffa5/src/main/thrift/parquet.thrift#L1019-L1022 On

Re: [DISCUSS] Parquet C++ under which PMC?

2024-05-16 Thread Gang Wu
gt; > > > > > > On Tue, 14 May 2024 10:58:58 +0200 > > > Rok Mihevc wrote: > > >> Second Raphael's point. > > >> Would it be reasonable to say specification change requires > implementation > > >> in two parquet implementations

Re: [DISCUSS] rename parquet-mr to parquet-java?

2024-05-15 Thread Gang Wu
+1 on renaming the repo to reduce confusion. However, the java library still uses the "parquet-mr" prefix to write its application version [1] and it is consumed by downstream projects like parquet-cpp [2] as well. [1]

Re: [DISCUSSION] Introduce FIXED_SIZE_LIST logical type

2024-05-15 Thread Gang Wu
Hi Rok, Happy to see you here :) According to my past experience, it would be more helpful to open a PR against the parquet-format repository and post it here. Best, Gang On Wed, May 15, 2024 at 7:25 PM Rok Mihevc wrote: > Hi all, > > Arrow recently introduced FixedShapeTensor and

Re: Interest in Parquet V3

2024-05-14 Thread Gang Wu
> I would hazard that simply storing statistics separately might > be sufficient for the wide column use-cases, without requiring > switching to something like flatbuffers? I agree with Raphael. Column chunks and pages can be referenced by offset and length. To avoid compatibility issues, we can

Re: Better announcement message [Apache Parquet release 1.14.0]

2024-05-14 Thread Gang Wu
essage does not mention "mr" or "Java" at > all (except in the url, and that there are Java artifacts available). > > Cheers, > Joris > > On Wed, 8 May 2024 at 05:26, Gang Wu wrote: > > > > Hi, > > > > I'm pleased to announce the re

Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-14 Thread Gang Wu
gt; implementation > > in two parquet implementations within Apache Parquet project? > > > > Rok > > > > On Tue, May 14, 2024 at 10:50 AM Gang Wu wrote: > > > > > IMHO, it looks more reasonable if a reference implementation is > required > &g

Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-14 Thread Gang Wu
IMHO, it looks more reasonable if a reference implementation is required to support most (not all) elements from the specification. Another question is: should we discuss (and vote for) each candidate one by one? We can start with parquet-mr which is most well-known implementation. Best, Gang

Re: Fwd: [C++] Parquet and Arrow overlap

2024-05-13 Thread Gang Wu
t; > > > > > Thank you, that sounds great! On first glance some seem to be rather > > old > > > > and probably don't apply anymore. > > > > > > > > > BTW, do we really need to make a full copy of them to have a mirror > > in &g

Re: Fwd: [C++] Parquet and Arrow overlap

2024-05-12 Thread Gang Wu
> Thank you, that sounds great! On first glance some seem to be rather > > old > > > > and probably don't apply anymore. > > > > > > > > > BTW, do we really need to make a full copy of them to have a mirror > > in > > > > the Arrow

Re: Interest in Parquet V3

2024-05-12 Thread Gang Wu
Hi Micah, I have also noticed the emergence of these new file formats which are challenging the popularity of Apache Parquet. It would always be good to evolve Parquet to be competitive. Personally I'm +1 on this. I'm also proposing adding a new geometry type to the specs: [1]. This seems to

[DISCUSS] Add geometry logical type

2024-05-12 Thread Gang Wu
Hi, Apache Iceberg community is proposing to add geospatial support [1]. It would be good if Apache Parquet can support native geometry type to implement more efficient encoding, statistics and filtering. Therefore, I'd like to propose a format change to add a new geometry logical type: [2]. It

Re: [DISCUSS] Propose changing the default branch of the parquet-site repo

2024-05-12 Thread Gang Wu
+1 This makes sense. I was also confused when I had access to parquet-site for the first time. Thanks Andrew! Best, Gang On Sun, May 12, 2024 at 3:15 AM Vinoo Ganesh wrote: > +1, this would be great. It's something Xinli and I discussed when we first > made the website updates, but it ended

Re: [ANNOUNCE] New Parquet PMC Member: Gang Wu

2024-05-12 Thread Gang Wu
gt;>> On Sat, May 11, 2024 at 10:34 AM Andrew Lamb < > >> andrewlam...@gmail.com > >>>>>> wrote: > >>>>>> > >>>>>>> Congratulations Gang! That is very exciting. > >>>>>>> > >>>>>>&

Re: Archival of parquet-cpp repository

2024-05-11 Thread Gang Wu
Update: parquet-cpp has been archived by ASF via https://issues.apache.org/jira/browse/INFRA-25766 and now https://github.com/apache/parquet-cpp is read-only. On Sun, May 12, 2024 at 12:15 PM Micah Kornfield wrote: > I think this is a great idea, thanks for driving it Uwe. > > On Mon, May 6,

Re: Fwd: [C++] Parquet and Arrow overlap

2024-05-10 Thread Gang Wu
Thanks, > Jacob > > Arrow committer > > On 2024/04/25 05:31:18 Gang Wu wrote: > > I know we have some non-Java committers and PMCs. But after the > parquet-cpp > > donation, it seems that no one worked on Parquet from arrow (cpp, rust, > go, > > etc.) > > and

[ANNOUNCE] Apache Parquet release 1.14.0

2024-05-07 Thread Gang Wu
Hi, I'm pleased to announce the release of Apache Parquet 1.14.0! Parquet is a general-purpose columnar file format for nested data. It uses space-efficient encodings and a compressed and splittable structure for processing frameworks like Hadoop. Changes are listed at:

[VOTE][RESULT] Release Apache Parquet 1.14.0 RC1

2024-05-07 Thread Gang Wu
With three +1 binding votes and additional +2 votes this release vote passes. +1 votes: Fokko Driesprong Gang Wu Gábor Szádovszky (binding) Gidon Gershinsky (binding) Xinli shang (binding) -1 votes: None Thank you all who have voted. Cheers, Gang

Re: [VOTE] Release Apache Parquet 1.14.0 RC1

2024-05-07 Thread Gang Wu
gt; - ran with the Iceberg encryption code > > > > Cheers, Gidon > > > > > > On Tue, May 7, 2024 at 4:28 AM Gang Wu wrote: > > > > > Hi, > > > > > > It has been open for more than 72 hours already. We still need 2 more > &g

Re: [VOTE] Release Apache Parquet 1.14.0 RC1

2024-05-06 Thread Gang Wu
Since I've never used CHANGES.md to actually check a release content, I > > don't feel this issue is so crucial to fail this vote. I would let the > > other voters decide. > > +1 (binding) > > > > Gang Wu ezt írta (időpont: 2024. máj. 6., H, 3:33): > > > >

Re: Parquet feature matrix

2024-05-06 Thread Gang Wu
Hi, There was an effort on this: https://github.com/apache/parquet-site/pull/34 It would be good if we can have something like what Apache Arrow does: - https://arrow.apache.org/docs/status.html - https://arrow.apache.org/docs/cpp/parquet.html#supported-parquet-features But I do have concern

Re: [VOTE] Release Apache Parquet 1.14.0 RC1

2024-05-05 Thread Gang Wu
+1 (non-binding) Verified signature, checksum and build. Thanks Fokko for doing this! Let me take care of the rest. Best, Gang On Mon, May 6, 2024 at 4:36 AM Fokko Driesprong wrote: > Hey everyone, > > +1 (non-binding) > > - Checked against Trino and the RC1 runs cleanly >

[VOTE][RESULT] Release Apache Parquet 1.14.0 RC0

2024-05-03 Thread Gang Wu
Hi, The vote for parquet 1.14.0 RC0 release is FAILED for a possible compatibility issue. We will fix the issue before preparing the next 1.14.0 RC1. Thanks everyone! Best regards, Gang

Re: [VOTE] Release Apache Parquet 1.14.0 RC0

2024-05-03 Thread Gang Wu
nd ran a few tests too > > > > > > > > > > > > > > > > > > On Tue, Apr 30, 2024 at 10:20 AM Xinli shang > > > wrote: > > > > > > > +1 (binding) > > > > > > > > Validated the KEY > > > >

Re: [VOTE] Release Apache Parquet 1.14.0 RC0

2024-04-30 Thread Gang Wu
Thank you! On Tue, Apr 30, 2024 at 4:16 PM Gábor Szádovszky wrote: > By importing the KEYS file under [1] the check of the .asc file passed! > So, I went forward and updated the KEYS file under [2] with your new one. > > Giving +1 (binding) for the release > > Cheers, > G

Re: [VOTE] Release Apache Parquet 1.14.0 RC0

2024-04-30 Thread Gang Wu
://dist.apache.org/repos/dist/release/parquet/KEYS On Tue, Apr 30, 2024 at 3:45 PM Gábor Szádovszky wrote: > Sure, please add your new public key to the referenced KEYS file then we > should be good. (The previous one would still be required to check the > previous releases, so do not remove it.) &g

Re: [VOTE] Release Apache Parquet 1.14.0 RC0

2024-04-30 Thread Gang Wu
. Could you double check if you signed it with the correct key? > No other issues were discovered, so no RC1 is required for now if you can > change the .asc file for the current tarball. > > Cheers, > Gabor > > Gang Wu ezt írta (időpont: 2024. ápr. 30., K, 7:45): > > >

[VOTE] Release Apache Parquet 1.14.0 RC0

2024-04-29 Thread Gang Wu
Hi everyone, I propose the following RC to be released as the official Apache Parquet 1.14.0 release. The commit id is af0740229929337e1395fd24253a4ed787df2db3 * This corresponds to the tag: apache-parquet-1.14.0-rc0 *

Re: Parquet Sync meeting notes - April 23 2024

2024-04-25 Thread Gang Wu
Let me take a look at the exclusions of japicmp. Will try to remove them as much as possible. Best, Gang On Thu, Apr 25, 2024 at 10:01 PM Gábor Szádovszky wrote: > Sorry, I was not able to attend the meeting. Let me put some notes here: > > 2. We have been fighting with compatibility issues

Re: Fwd: [C++] Parquet and Arrow overlap

2024-04-24 Thread Gang Wu
r as Parquet > commuters? > > We are doing this (speaking as a Parquet PMC who didn't work on > parquet-mr, but parquet-cpp). > > Best > Uwe > > On Wed, Apr 24, 2024, at 2:38 PM, Gang Wu wrote: > > +1 for moving parquet-cpp issues from Apache Jira to Arrow's GitHub &g

Re: Fwd: [C++] Parquet and Arrow overlap

2024-04-24 Thread Gang Wu
+1 for moving parquet-cpp issues from Apache Jira to Arrow's GitHub issue. Besides, I want to echo Will's question in the thread. Should we consider Parquet developers from other projects than parquet-mr as Parquet commiters? Currently apache/parquet-format and apache/parquet-testing repositories

Re: How to differentiate between Parquet V1 and V2

2024-04-23 Thread Gang Wu
As I have said in another thread, Parquet V2 is a concept which contains a lot of features. FWIW, what are defined in the specs [1] are finalized and some of them have been implemented in various implementations. Any file that contains one or more of those features can be considered v2 but the

Re: Parquet Sync meeting notes - April 23 2024

2024-04-23 Thread Gang Wu
I would expect so. parquet-mr has a complete implementation of all v2 encodings and some other Parquet implementations (e.g. Apache Arrow C++ and arrow-rs) have already supported most (if not all) v2 encodings for a long time. Best, Gang On Tue, Apr 23, 2024 at 11:02 PM Prem Sahoo wrote: > Are

Re: Next release date

2024-04-21 Thread Gang Wu
Hi David, There are already some discussions about the 1.14.0 release and it seems that many users are expecting it to be released soon. I will go through all the pending PRs this week and see if we can move forward to the release process. I will volunteer as the release manager and try to get it

Re: which version parquet is supported my parquet-mr 1.2.1

2024-04-16 Thread Gang Wu
Hi, The release note is https://github.com/apache/parquet-mr/blob/master/CHANGES.md, which would be helpful to check what feature is supported in each release. IMO, parquet v2 is a vague concept which contains a lot of features. Hope it helps. Best, Gang On Tue, Apr 16, 2024 at 6:26 AM Prem

Re: Re: [DISCUSS] Parquet 1.14.0 and looking forward

2024-04-11 Thread Gang Wu
especially SSD -more from the ability to do parallel block > reads > > > than anything else. What does that mean? use the hadoop raw local fS > and > > > you get it. It also means that any non-hadoop java code should use the > nio > > > read API directly.

Re: Reading corrupted parquet files

2024-04-03 Thread Gang Wu
Hi Cindy, >From what I can tell, these were some discussions in the community on the next release: [1] and [2]. [1] https://lists.apache.org/thread/bgmpmrqqcsqlbgqd16cjryc0gvzj9kbx [2] https://lists.apache.org/thread/kttwbl5l7opz6nwb5bck2gghc2y3td0o Best, Gang On Wed, Apr 3, 2024 at 7:11 AM

Re: Removal of deprecated code in parquet-format

2024-03-27 Thread Gang Wu
Thanks for the effort! +1 for removing these deprecated code if there is no objection. I took a glimpse at the public downstream of parquet-format at [1]. It seems the risk is low for the removal. [1] https://mvnrepository.com/artifact/org.apache.parquet/parquet-format/usages Best, Gang On

Re: Selecting format_version=2.6 ?

2024-03-17 Thread Gang Wu
'../src/sys/xen_execute.cpp', 'L': '12414', 'R': 'pg_throw'} > > Is there any documentation on the configuration you mention below? Could > that have any impact on date columns? > > Any other suggestions welcome. > > Stephen > > > > > On Fri, 15 Mar 2024, 16:07 Gang Wu,

Re: Selecting format_version=2.6 ?

2024-03-15 Thread Gang Wu
Hi Stephen, Thanks for raising the issue! You are right that the version is always 1 written by parquet-mr. This is something we need to fix. However, IMHO, the community does not have a clear answer on the definition of parquet format v2. Which feature are you referring to specifically in the

Re: [VOTE] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64

2024-03-07 Thread Gang Wu
+1 (non-binding) Best, Gang On Fri, Mar 8, 2024 at 5:05 AM Edward Seidl wrote: > +1 (non-binding) > > Thanks for your work on this! > Ed > > From: Antoine Pitrou > Sent: Thursday, March 7, 2024 5:15 AM > To: d...@parquet.incubator.apache.org > Subject: [VOTE]

Re: parquet-format status

2024-03-05 Thread Gang Wu
Hi Vinoo, IMO, we cannot do this because the parquet-format repo serves as the dedicated place to hold the parquet specs, which includes the thrift definition file and a set of documents tagged for all versions. Some projects also directly reference the link of the markdown files, which will be

Re: newbie looking to learn the format

2024-03-05 Thread Gang Wu
Hi Mark, To answer the 1st question, you may want to take a look at the rewrite command in the parquet-cli [1] which concatenates a set of parquet files with the same schema into a larger one. For nginx access logs to parquet conversion, AFAIK I don't know any existing solution. We do have a

Re: Discrepancy in parquet format documentation

2024-03-04 Thread Gang Wu
posts for now > https://parquet.apache.org/blog/, but if that's not the best way to handle > versioned docs, we can explore adopting Iceberg's model. > > > > On Mon, Mar 4, 2024 at 8:50 PM Gang Wu wrote: > > > Hi Vinoo, > > > > Thanks for the reply! How do you want to embed t

Re: Discrepancy in parquet format documentation

2024-03-04 Thread Gang Wu
u for looking into this. Updating the description on > > > parquet.apache.org will save everyone searching for this information a > > > few hours of head scratching. It is unfortunate that the slightly > > > out-of-date spec features more prominently in Google results. > >

Re: Parquet Encoding - Enable DELTA_BINARY_PACKED

2024-02-27 Thread Gang Wu
Hi Ridha, DELTA_BINARY_PACKED is enabled for parquet v2 in the parquet-mr implementation. Have you tried to set `parquet.writer.version` [1] to PARQUET_2_0 in the Spark job? I'm not sure if this helps. [1]

Re: [DISCUSS] Parquet 1.14.0 and looking forward

2024-02-21 Thread Gang Wu
Hi, Thanks for bringing this up! For the 1.14.0 release, I think it would be good to include some open PRs, e.g. [1]. Thanks Gabor for the idea of new APIs! I agree that we need to clean up some misused APIs and remove the Hadoop dependencies. In the meanwhile, I actually have some concerns.

Re: [WIP][Proposal] PARQUET-2430: Add parquet joiner

2024-02-19 Thread Gang Wu
iner.java#L390 > > >. > > ParquetRewriter’s version originally failed in my tests, it originally > had > > a different schema, I will try to reproduce it later, right now it > works, I > > need to go back and check the commit history. > > > >

Re: [WIP][Proposal] PARQUET-2430: Add parquet joiner

2024-02-17 Thread Gang Wu
Hi Max, Thanks for proposing the joiner! I simply took a glimpse of the PR and it looks promising to me. My general question is on the possibility of consolidating the work with ParquetRewriter, which shares a lot of common rewriting logic. Best, Gang On Tue, Feb 13, 2024 at 9:27 AM Max

Re: Possible bug in parquet-mr

2024-01-28 Thread Gang Wu
Hi Sky, Thanks for reporting the issue! Could you please open a PR to parquet-mr with a minimal reproducible test case? That would be a lot easier for the investigation. Best, Gang On Tue, Jan 23, 2024 at 4:23 PM Sky Brewer wrote: > Hi all, > > I found a possible bug in parquet-mr. I found

Re: Error building with IntelliJ

2024-01-18 Thread Gang Wu
Usually I will try to reload the maven project [1] after a full build. If it doesn't work, my last resort is to run the following steps: - mvn install -DskipTests All built jars will be installed in local ~/.m2/repository/xxx - mvn dependency:copy-dependencies All dependencies will be

Re: Discrepancy in parquet format documentation

2024-01-14 Thread Gang Wu
t.apache.org will save everyone searching for this information a > few hours of head scratching. It is unfortunate that the slightly > out-of-date spec features more prominently in Google results. > > Kind regards > > Kaili > ____ > From: Gang Wu

Re: Pitch for Pcodec Encoding in Parquet

2024-01-13 Thread Gang Wu
Hi Martin, Sorry for chiming in late. I have just read your blog post and the format specs. Below are my two cents: - The PCO spec is a good starting point with good explanation on the format definition. For people unfamiliar with the background, it would be good to also include the

Re: Guidelines for working on parquet-mr?

2024-01-11 Thread Gang Wu
Hi Antoine, I agree that I have suffered the same thing while developing on parquet-mr. Usually I don't make the full build and test unless for the release process. It would be much easier to use IntelliJ IDEA and run selected tests. Best, Gang On Fri, Jan 12, 2024 at 1:56 AM Antoine Pitrou

Re: Discrepancy in parquet format documentation

2024-01-09 Thread Gang Wu
Hi Kaili, You're right. Please refer to the parquet-format repo for specs. The site is unfortunately out of sync for a long time and there isn't any automatic process to update it. Let me update the site manually to be in sync with the latest format release. Best, Gang On Sun, Jan 7, 2024 at

Re: [Format] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY

2024-01-08 Thread Gang Wu
+1 > What should be the way forward? Should I submit a format update and then one or two implementations thereof? Based on my observation of recent format changes, it usually follows the steps below: (1) A PR for a format change. (2) Two PRs of PoC implementation for feature and

Re: Files with inconsistent num_rows and num_values?

2023-12-05 Thread Gang Wu
uot;parquet.avro.schema" key and a corresponding schema are present in > their metadata). > > Thanks, > Micah > > On Tue, Nov 28, 2023 at 6:30 PM Gang Wu wrote: > > > Hi Micah, > > > > Does the FileMetaData.version [1] provide any information abou

Re: Fast nullify of columns?

2023-12-05 Thread Gang Wu
Hi Paul, I agree there are better ways to do this, e.g. we can prepare encoded definition levels and repetition levels (if they exist) and directly write the page. However, we need to take care of other rewrite configurations including data page version (v1 or v2), compression, page statistics

Re: JIRA work log updates

2023-12-05 Thread Gang Wu
Hi Atour, Recently I tried to migrate notifications to different mailing lists by adding a customized asf.yaml file [1]. I may have added too verbose settings to the jira_options key. Let me fix this. [1] https://github.com/apache/parquet-mr/commit/2d10c282f14b6b34a7f3d0a6bf227b881bef5ad5

[jira] [Created] (PARQUET-2408) Fix license header in .gitattributes

2023-12-04 Thread Gang Wu (Jira)
Gang Wu created PARQUET-2408: Summary: Fix license header in .gitattributes Key: PARQUET-2408 URL: https://issues.apache.org/jira/browse/PARQUET-2408 Project: Parquet Issue Type: Bug

[jira] [Created] (PARQUET-2407) Add custom .asf.yaml for finer-grained control of email notifications

2023-12-04 Thread Gang Wu (Jira)
Gang Wu created PARQUET-2407: Summary: Add custom .asf.yaml for finer-grained control of email notifications Key: PARQUET-2407 URL: https://issues.apache.org/jira/browse/PARQUET-2407 Project: Parquet

[jira] [Resolved] (PARQUET-2385) Don't initialize CodecFactory in ParquetWriter

2023-12-03 Thread Gang Wu (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu resolved PARQUET-2385. -- Fix Version/s: 1.14.0 Assignee: Atour Mousavi Gourabi Resolution: Fixed > Do

[jira] [Resolved] (PARQUET-2400) Update Spotless command in PR prompt to include vector plugins

2023-12-03 Thread Gang Wu (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu resolved PARQUET-2400. -- Fix Version/s: 1.14.0 Assignee: Atour Mousavi Gourabi Resolution: Fixed > Upd

[jira] [Resolved] (PARQUET-2386) More consistent code style in parquet-mr

2023-11-30 Thread Gang Wu (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu resolved PARQUET-2386. -- Fix Version/s: 1.14.0 Resolution: Fixed > More consistent code style in parquet

Re: Files with inconsistent num_rows and num_values?

2023-11-28 Thread Gang Wu
Hi Micah, Does the FileMetaData.version [1] provide any information about the writer? What about the num_values in each page header? Is the actual number of values consistent with num_values in the ColumnMetaData? [1]

Re: [Request] Send automated notifications to a separate mailing-list

2023-11-23 Thread Gang Wu
PMC member to make this change? > > On Tue, Aug 29, 2023 at 7:27 PM Gang Wu wrote: > > > I think we can send a notification email to the dev@ so that > > people can know what is going on and subscribe to what they > > want after the split. We should also update the web

[jira] [Resolved] (PARQUET-2383) Bump parquet-format to 2.10.0

2023-11-21 Thread Gang Wu (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu resolved PARQUET-2383. -- Fix Version/s: 1.14.0 Resolution: Fixed > Bump parquet-format to 2.1

[jira] [Commented] (PARQUET-2378) Problem with a cat

2023-11-21 Thread Gang Wu (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17788344#comment-17788344 ] Gang Wu commented on PARQUET-2378: -- Sorry for the late reply. I'm not sure if it is a good idea to add

[jira] [Created] (PARQUET-2383) Bump parquet-format to 2.10.0

2023-11-20 Thread Gang Wu (Jira)
Gang Wu created PARQUET-2383: Summary: Bump parquet-format to 2.10.0 Key: PARQUET-2383 URL: https://issues.apache.org/jira/browse/PARQUET-2383 Project: Parquet Issue Type: Improvement

[jira] [Resolved] (PARQUET-2380) Decouple RewriteOptions from Hadoop classes

2023-11-20 Thread Gang Wu (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu resolved PARQUET-2380. -- Fix Version/s: 1.14.0 Assignee: Atour Mousavi Gourabi Resolution: Fixed > Decou

[ANNOUNCE] Apache Parquet Format release 2.10.0

2023-11-20 Thread Gang Wu
for this release is available below: https://github.com/apache/parquet-format/blob/master/CHANGES.md#version-2100 This release can be downloaded from: https://parquet.apache.org/blog/2023/11/20/2.10.0/ Java artifacts are available from Maven Central. Thanks to everyone for contributing! Best, Gang Wu

[VOTE][RESULT] Release Apache Parquet Format 2.10.0 RC0

2023-11-20 Thread Gang Wu
The vote passes with 7 +1 votes: - Xuwei Fu - Xinli Shang (binding) - Ed Seidl - Gidon Gershinsky (binding) - Gábor Szádovszky (binding) - Gang Wu - Fokko Driesprong and no -1 votes. Thank you all who have voted! Cheers, Gang

Re: [VOTE] Release Apache Parquet Format 2.10.0 RC0

2023-11-20 Thread Gang Wu
t; > > +1 (binding) > > > > > > Verified the signature. Thanks Gang for leading the effort! > > > > > > On Thu, Nov 16, 2023 at 9:41 PM wish maple > wrote: > > > > > > > +1 (no-binding) > > > > > > > > Thanks

[jira] [Resolved] (PARQUET-2375) Extend vectorized bit unpacking benchmark for various bit sizes.

2023-11-16 Thread Gang Wu (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu resolved PARQUET-2375. -- Fix Version/s: 1.14.0 Assignee: JATIN BHATEJA Resolution: Fixed > Extend vectori

[jira] [Commented] (PARQUET-2378) Problem with a cat

2023-11-16 Thread Gang Wu (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17787036#comment-17787036 ] Gang Wu commented on PARQUET-2378: -- Can we get rid of the schema conversion via AvroSchemaConverter

[jira] [Commented] (PARQUET-2378) Problem with a cat

2023-11-16 Thread Gang Wu (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17786777#comment-17786777 ] Gang Wu commented on PARQUET-2378: -- Thanks for reporting the issue! I can reproduce it on my end. Let

[VOTE] Release Apache Parquet Format 2.10.0 RC0

2023-11-15 Thread Gang Wu
Hi everyone, I propose the following RC to be released as the official Apache Parquet Format 2.10.0 release. The commit id is b9c4fa81c3be13dc98760c92b037fa4dd465cef8 * This corresponds to the tag: apache-parquet-format-2.10.0-rc0 *

[jira] [Resolved] (PARQUET-2379) [Format] Update changelog for 2.10.0

2023-11-15 Thread Gang Wu (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu resolved PARQUET-2379. -- Fix Version/s: format-2.10.0 Resolution: Fixed > [Format] Update changelog for 2.1

[jira] [Updated] (PARQUET-2221) [Format] Encoding spec incorrect for dictionary fallback

2023-11-15 Thread Gang Wu (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu updated PARQUET-2221: - Fix Version/s: (was: format-2.10.0) > [Format] Encoding spec incorrect for dictionary fallb

[jira] [Resolved] (PARQUET-2313) Bump actions/setup-java from 1 to 3

2023-11-15 Thread Gang Wu (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu resolved PARQUET-2313. -- Assignee: Gang Wu Resolution: Fixed > Bump actions/setup-java from 1 t

[jira] [Resolved] (PARQUET-2344) Bump to Thirft 0.19.0

2023-11-15 Thread Gang Wu (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu resolved PARQUET-2344. -- Resolution: Fixed > Bump to Thirft 0.19.0 > - > >

  1   2   3   4   5   >