CVE-2021-41561: Apache Parquet-MR potential DoS in case of malicious Parquet file

2021-12-20 Thread Gábor Szádovszky
Description: Improper Input Validation vulnerability in Parquet-MR of Apache Parquet allows an attacker to DoS by malicious Parquet files. This issue affects Apache Parquet-MR version 1.9.0 and later versions. This issue is being tracked as PARQUET-2094 Mitigation: 1.12.x users should

Re: [VOTE][Format] Add Float16 type to specification

2023-10-06 Thread Gábor Szádovszky
+1 About the naming. We already use INT_8, INT_16 etc. for logical types for integer values. What do you think about FLOAT_16 to be consistent? Cheers, Gabor On 2023/10/05 22:17:13 Ryan Blue wrote: > +1 > > I'm all for adding a 2-byte floating point representation since even 4-byte > floats

Re: [VOTE][FORMAT] Add repetition, definition and variable length size metadata statistics

2023-11-08 Thread Gábor Szádovszky
+1 (binding) Cheers, Gabor On 2023/11/07 02:46:37 Xinli shang wrote: > +1 (binding) > > On Mon, Nov 6, 2023 at 4:56 PM Gang Wu wrote: > > > +1 (non-binding) > > > > Best, > > Gang > > > > On Tue, Nov 7, 2023 at 3:57 AM Ed Seidl wrote: > > > > > +1 (non-binding) > > > > > > Thanks! > > > Ed >

Re: [VOTE] Release Apache Parquet 1.12.4 RC0

2023-03-31 Thread Gábor Szádovszky
Sorry for the late response. Verified checksum and signature, diffed tarball and repo content, build/unit tests pass. I'm a bit confused of our current branching, though. We have one for parquet-1.12.x. Every 1.12 release should be built/tagged there. Meanwhile we have a separate branch for

Re: [VOTE] Release Apache Parquet 1.12.4 RC0

2023-03-31 Thread Gábor Szádovszky
l new commits after v1.12.3 release are in the master branch. >- I did check that commits in the v1.12.2 are included in the v1.12.3 >release (as well as the master branch) > > So I think we are good. > > Best, > Gang > > On Fri, Mar 31, 2023 at 9:49 PM Gábor

Re: [VOTE] Release Apache Parquet 1.13.0 RC0

2023-04-03 Thread Gábor Szádovszky
Verified checksum and signature, diffed tarball and repo content, build/unit tests pass. +1 (binding) for releasing this content as 1.13.0 NOTE: It is completely fine or even a good practice to release the first minor release from its separate branch (instead of master). Do not forget to merge

Re: [VOTE] Release Apache Parquet 1.12.4 RC0

2023-04-01 Thread Gábor Szádovszky
gt; it takes to cherry-pick them to the 1.12.x branch. I would prefer the > option one. > > WDYT? > > Best, > Gang > > > On Fri, Mar 31, 2023 at 11:17 PM Gábor Szádovszky wrote: > > > I think we are about the release under the wrong number then. We s

Re: [VOTE] Release Apache Parquet 1.12.4 RC0

2023-04-02 Thread Gábor Szádovszky
to release a 1.12.4 version until we have received sufficient > feedback and requests from users. > > > On Sat, Apr 1, 2023 at 2:39 PM Gábor Szádovszky wrote: > > > In the past we did not backport every bugfix for previous branches only > > the serious ones that have no workarou

Re: [DISCUSS] Time to release parquet format 2.10.0?

2023-05-12 Thread Gábor Szádovszky
Thanks a lot for volunteering, Gang! However it is more than 2 years indeed since the last release I think the actual changes since then are more important. There are lots of additions/corrections in the spec docs and the thrift file comments which are very important but not tightly attached

Re: [DISCUSS] Parquet 1.14.0 and looking forward

2024-02-21 Thread Gábor Szádovszky
Thanks for bringing this up, Fokko. Unfortunately, I won't be able to join next week. (Hopefully I will be there at the one after.) So, let me write my thoughts here. I agree it is time to start preparing the next parquet-mr release. I have some thoughts: - We should check that parquet-mr

Re: Discrepancy in parquet format documentation

2024-01-15 Thread Gábor Szádovszky
Hey Gang, Kaili, I think the easiest way to solve this issue is to completely remove the spec from the site and add a reference to the parquet-format repo instead. We should probably add the release tag links when we make a release of parquet-format with a "latest" link. This way we would also

Re: [VOTE] Release Apache Parquet Format 2.10.0 RC0

2023-11-20 Thread Gábor Szádovszky
Thanks a lot Gang, for dealing with the release! Checked checksum and signature; content of the tarball looks good; unit tests pass +1 (binding) Cheers, Gabor On 2023/11/19 16:37:51 Gidon Gershinsky wrote: > +1 (binding). > > Thanks Gang. > > Cheers, Gidon > > > On Fri, Nov 17, 2023 at

Re: Question about read granularity in ParquetFileReader

2024-03-04 Thread Gábor Szádovszky
Hi Claire, I think you read it correctly. Your proposal sounds good to me but you need to make it a separate way of reading instead of rewriting the current behavior. The current implementation figures out the consecutive parts in the file (multiple pages or even column chunks written after each

Re: Question about read granularity in ParquetFileReader

2024-03-05 Thread Gábor Szádovszky
for how many pages, or page bytes, to buffer at a time, so that users can > balance IO speed with memory usage. I'll try out a few approaches and aim > to update this thread when I have something. > > Best, > Claire > > > > On Tue, Mar 5, 2024 at 2:55 AM Gábor Szádovszky

Re: Newly-registered IANA Media Type for Parquet

2024-03-06 Thread Gábor Szádovszky
Thank you, Bryce, for working on this! Let me forward this to the private channel as well. @Xinli, @Julien, do you have access to the twitter account to spread this? Bryce Mecum ezt írta (időpont: 2024. márc. 5., K, 20:38): > Hi all, the Parquet format now has an official IANA media type: >

Re: [VOTE] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64

2024-03-07 Thread Gábor Szádovszky
+1 (binding) - Not sure if "binding" matters for this case Thanks, Antoine, for working on this! Antoine Pitrou ezt írta (időpont: 2024. márc. 7., Cs, 14:18): > > Hello, > > As discussed previously on this ML [1], I am proposing to expand > the types supported by the BYTE_STREAM_SPLIT encoding.

Re: parquet-format status

2024-03-07 Thread Gábor Szádovszky
There is a big difference between the repos of Arrow, Avro, Iceberg etc. and Parquet. The mentioned projects have everything in one repo including the different language bindings etc. so it is natural to have the specs there as well and having universal releases. Meanwhile Parquet has different

Re: How to differentiate between Parquet V1 and V2

2024-04-26 Thread Gábor Szádovszky
ween Parquet files written thru V2 or V1 , no > one in the community has a clear idea about this which is a bit > astonishing . > > if any one is aware , it will be highly appreciated. > > > > On Thu, Apr 25, 2024 at 10:32 AM Gábor Szádovszky > wrote: > > > I am n

Re: How to differentiate between Parquet V1 and V2

2024-04-25 Thread Gábor Szádovszky
I am not sure what "Parquet community V2 is not final yet" means. We are now at parquet-format 2.10.0. The current parquet-mr supports most (if not all) of its features. I agree the current mechanism in parquet-mr of setting the writer version PARQUET_1_0 and PARQUET_2_0 is not clear/misleading.

Re: INFO :: which version of Parquet jar supports Parquet V2 encoding

2024-04-25 Thread Gábor Szádovszky
y Spark + > Dremio). > > > In the last Parquet meeting, I brought up discussing / planning for a > parquet-mr 2.0 release which I think should at least establish a parquet-mr > release as the "formal implementation" of the standard (even if it's mostly > a vanity

Re: INFO :: which version of Parquet jar supports Parquet V2 encoding

2024-04-25 Thread Gábor Szádovszky
Hey, I don't think we should call Parquet v2.x features unstable. Since they were released officially, we maintain backward compatibility. So, from Parquet format point of view, these features are stable. It is another question whether a Parquet implementation supports all of these features or

Re: Parquet Sync meeting notes - April 23 2024

2024-04-25 Thread Gábor Szádovszky
Sorry, I was not able to attend the meeting. Let me put some notes here: 2. We have been fighting with compatibility issues for a while now. That's why we introduced japicmp. I can see many exclusions in the master pom. I think we should investigate if these exclusions cause any issues before the

Re: [VOTE] Release Apache Parquet 1.14.0 RC0

2024-04-30 Thread Gábor Szádovszky
Hi Gang, Thank you for taking care of the release! Unfortunately, the .asc check fails for me even after importing the KEYS file. Could you double check if you signed it with the correct key? No other issues were discovered, so no RC1 is required for now if you can change the .asc file for the

Re: [VOTE] Release Apache Parquet 1.14.0 RC0

2024-04-30 Thread Gábor Szádovszky
nse to add my new key to the KEYS file instead? > > Best, > Gang > > On Tue, Apr 30, 2024 at 3:11 PM Gábor Szádovszky wrote: > > > Hi Gang, > > > > Thank you for taking care of the release! > > > > Unfortunately, the .asc check fails for me even af

Re: [VOTE] Release Apache Parquet 1.14.0 RC0

2024-04-30 Thread Gábor Szádovszky
lease/parquet/KEYS > > On Tue, Apr 30, 2024 at 3:45 PM Gábor Szádovszky wrote: > > > Sure, please add your new public key to the referenced KEYS file then we > > should be good. (The previous one would still be required to check the > > previous releases, so do not remove

Re: [VOTE] Release Apache Parquet 1.14.0 RC1

2024-05-06 Thread Gábor Szádovszky
Thanks Fokko, Gang for working on this. I have some findings: * nit correction in the original mail: tag is apache-parquet-1.14.0-rc1 (not apache-parquet-1.4.0-rc1) * The CHANGES.md should have been updated with the one fix you've mentioned (PARQUET-2465) Since I've never used CHANGES.md to

Re: [DISCUSS] Arrow dropping Java 8 support

2024-05-27 Thread Gábor Szádovszky
Thanks a lot Weston for bringing this up. Last time we discussed a potential java upgrade, Hadoop was the one not allowing us to do so. Hadoop is still on java 8. If we want to keep Arrow on the latest version, we will need to upgrade to java 11. In this case we won't be able to support Hadoop

Re: [DISCUSS] Extension types in Parquet?

2024-05-28 Thread Gábor Szádovszky
Hi Antoine, One quick note about this. Parquet min/max statistics need a total ordering for each logical type. Without that we either use some default based on the primitive type (that might not be suitable for the related extension type) or we won't store min/max statistics for the related