[jira] [Commented] (PARQUET-2346) Bump org.slf4j:slf4j-api from 1.7.12 to 2.0.9
[ https://issues.apache.org/jira/browse/PARQUET-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764774#comment-17764774 ] Steve Loughran commented on PARQUET-2346: - I don't know what happens, but do know that (a;) depedabot is overaggressive and (b) hadoop 3.3.x is still using the 1.7 apis with reload4j, so a move is at risk of breaking everywhere > Bump org.slf4j:slf4j-api from 1.7.12 to 2.0.9 > - > > Key: PARQUET-2346 > URL: https://issues.apache.org/jira/browse/PARQUET-2346 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > Fix For: format-2.10.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (PARQUET-2346) Bump org.slf4j:slf4j-api from 1.7.12 to 2.0.9
[ https://issues.apache.org/jira/browse/PARQUET-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764774#comment-17764774 ] Steve Loughran edited comment on PARQUET-2346 at 9/13/23 4:24 PM: -- I don't know what happens, but do know that (a) depedabot is overaggressive and (b) hadoop 3.3.x is still using the 1.7 apis with reload4j, so a move is at risk of breaking everywhere was (Author: ste...@apache.org): I don't know what happens, but do know that (a;) depedabot is overaggressive and (b) hadoop 3.3.x is still using the 1.7 apis with reload4j, so a move is at risk of breaking everywhere > Bump org.slf4j:slf4j-api from 1.7.12 to 2.0.9 > - > > Key: PARQUET-2346 > URL: https://issues.apache.org/jira/browse/PARQUET-2346 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > Fix For: format-2.10.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2346) Bump org.slf4j:slf4j-api from 1.7.12 to 2.0.9
[ https://issues.apache.org/jira/browse/PARQUET-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764248#comment-17764248 ] Steve Loughran commented on PARQUET-2346: - what is this going to do in terms of trying to use parquet in apps which aren't on the v2 apis themselves? > Bump org.slf4j:slf4j-api from 1.7.12 to 2.0.9 > - > > Key: PARQUET-2346 > URL: https://issues.apache.org/jira/browse/PARQUET-2346 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > Fix For: format-2.10.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2338) CVE-2022-25168 in hadoop-common
[ https://issues.apache.org/jira/browse/PARQUET-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17756912#comment-17756912 ] Steve Loughran commented on PARQUET-2338: - pr #1065 did this in 53ea34ac7eb98432a72e3c37cd48e4f02baf65ea ; anything wrong with that commit? or is just not the right branch? > CVE-2022-25168 in hadoop-common > --- > > Key: PARQUET-2338 > URL: https://issues.apache.org/jira/browse/PARQUET-2338 > Project: Parquet > Issue Type: Bug > Components: parquet-hadoop >Affects Versions: 1.13.1 >Reporter: jincongho >Priority: Major > > [CVE-2022-25168|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25168] > requires updating hadoop-common to 3.2.4 or 3.3.3. > Although `FileUtils.untar` isnt used inparquet-hadoop, will appreciate if we > can release a new parquet-hadoop soon with these newer version. Otherwise > parquet-hadoop will be flagged as security concern too. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2128) Bump Thrift to 0.16.0
[ https://issues.apache.org/jira/browse/PARQUET-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732509#comment-17732509 ] Steve Loughran commented on PARQUET-2128: - homebrew doesn't have anything < 0..18.0, which is java11+ only, so not something parquet can switch to. which means that we have to stop using homebrew here and take control of our build dependencies ourselves. I've already done that with maven and openjdk as brew is too enthusiastic about breaking my workflow. *none of us can rely on homebrew or use "homebrew doesn't have this" as a reason for reverting a change. All old thrift releases can be found at https://archive.apache.org/dist/thrift/ > Bump Thrift to 0.16.0 > - > > Key: PARQUET-2128 > URL: https://issues.apache.org/jira/browse/PARQUET-2128 > Project: Parquet > Issue Type: Improvement >Reporter: Vinoo Ganesh >Assignee: Vinoo Ganesh >Priority: Minor > Fix For: 1.12.3 > > > Thrift 0.16.0 has been released > https://github.com/apache/thrift/releases/tag/v0.16.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2171) Implement vectored IO in parquet file format
[ https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724339#comment-17724339 ] Steve Loughran commented on PARQUET-2171: - mukund, is there a PR up for this? even though it's not going to be merged, it needs to be shared for others to pick up > Implement vectored IO in parquet file format > > > Key: PARQUET-2171 > URL: https://issues.apache.org/jira/browse/PARQUET-2171 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Mukund Thakur >Priority: Major > > We recently added a new feature called vectored IO in Hadoop for improving > read performance for seek heavy readers. Spark Jobs and others which uses > parquet will greatly benefit from this api. Details can be found here > [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5] > https://issues.apache.org/jira/browse/HADOOP-18103 > https://issues.apache.org/jira/browse/HADOOP-11867 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5
[ https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718203#comment-17718203 ] Steve Loughran commented on PARQUET-2276: - [~a2l] really? hadoop 2.8? why haven't they upgraded yet? that is a long way behind on any form of security updates, doesn't come with any guarantees of java8+ support etc. Even hadoop 2.9.x only gets CVE updates for hadoop's own code so that those people running their own clusters with private hadoop-2 forks know what to pick up. The PARQUET-2134 patch did not break Hadoop 2 compatibility; it used APIs it's which were in the version of Hadoop that parquet compiled against. What it did do was "explicitly break compatibility with a version of hadoop older than the one parquet was built against" That patch may have been the one to show the problem but the reality is there are many other places where incompatibilities could've surfaced. If you actually want to support hadoop-2.8.5 then the pom needs to be downgraded before anything else. You also need to worry about Java8/7 compatibility. We're already in a problem where some of the java.nio classes in the java8 SDKs you can get have added more overridden bytebuffer methods then were in the original Oracle Java8, and https algorithms I've been another moving target. so even within "java8" there is "original java8" and the openjdk/corretto/azul versions. Well you can get away with building a modern library with a recent open JDK build, if you really are planning on supporting hadoop 2.8 suddenly all these issues surface. [I know this as when i have to go near the hadoop-2 line i have to use a docker image with java7/, and since moving to macbook m1 I can't do that any more. > ParquetReader reads do not work with Hadoop version 2.8.5 > - > > Key: PARQUET-2276 > URL: https://issues.apache.org/jira/browse/PARQUET-2276 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Atul Mohan >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0, 1.13.1 > > > {{ParquetReader.read() fails with the following exception on parquet-mr > version 1.13.0 when using hadoop version 2.8.5:}} > {code:java} > java.lang.NoSuchMethodError: 'boolean > org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' > at > org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74) > > at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) > at > org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69) > > at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787) > > at > org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) > at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) > {code} > > > > From an initial investigation, it looks like HadoopStreams has started using > [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74] > but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop > 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2289) Avoid using hasCapability
[ https://issues.apache.org/jira/browse/PARQUET-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713993#comment-17713993 ] Steve Loughran commented on PARQUET-2289: - I'm not convinced here. the JIRA was about things not working against hadoop 2.8, but as it was built on 2.9.x, there's no guarantee anything will link. in fact, as hadoop 2.8.x doesn't officially support java (kerberos, s3a+joda time,...), the release would have to be built with java7 to claim compatibility. spark has just gone to hadoop 3.3+ only on their trunk. consider it too > Avoid using hasCapability > - > > Key: PARQUET-2289 > URL: https://issues.apache.org/jira/browse/PARQUET-2289 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5
[ https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712325#comment-17712325 ] Steve Loughran commented on PARQUET-2276: - hadoop 2.8 shipped 5 years ago. trying to build against any version of hadoop 2 cripples you, and as hadoop 2.8 *doesn't even work reliably on java 8*, What exactly are you trying to run, and why? > ParquetReader reads do not work with Hadoop version 2.8.5 > - > > Key: PARQUET-2276 > URL: https://issues.apache.org/jira/browse/PARQUET-2276 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Atul Mohan >Priority: Major > > {{ParquetReader.read() fails with the following exception on parquet-mr > version 1.13.0 when using hadoop version 2.8.5:}} > {code:java} > java.lang.NoSuchMethodError: 'boolean > org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' > at > org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74) > > at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) > at > org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69) > > at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787) > > at > org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) > at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) > {code} > > > > From an initial investigation, it looks like HadoopStreams has started using > [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74] > but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop > 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2277) Bump hadoop.version from 3.2.3 to 3.3.5
[ https://issues.apache.org/jira/browse/PARQUET-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712323#comment-17712323 ] Steve Loughran commented on PARQUET-2277: - happy. Have you considered cutting hadoop 2 support entirely? Because with parquet building on 3.3.5 you are in a position to take up Mukund's vector IO patch PARQUET-2171 and see significant speedup in local IO reads (java nio at work) and on s3 through the s3a connector (parallel range requests) targeting hadoop 3.3.x only gives you an openfile call where you can skip all HEAD probes and ask for random IO too. cloud speedup all round. > Bump hadoop.version from 3.2.3 to 3.3.5 > --- > > Key: PARQUET-2277 > URL: https://issues.apache.org/jira/browse/PARQUET-2277 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.0 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1989) Deep verification of encrypted files
[ https://issues.apache.org/jira/browse/PARQUET-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711367#comment-17711367 ] Steve Loughran commented on PARQUET-1989: - you might want to have a design which can do the scan on a spark rdd, where the rdd is simply the deep listFiles(path) scan of the directory tree. This would give the best scale for a massive dataset compared to even some parallelised scan in a single process. I do have an RDD which can do line-by-line work, with locality of work determined on each file, which lets you schedule the work on the relevant hdfs nodes with the data; unfortunately it needs to be in the o.a.spark package to build https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/cloudera/ParallelizedWithLocalityRDD.scala ...that could maybe be added to spark itself. > Deep verification of encrypted files > > > Key: PARQUET-1989 > URL: https://issues.apache.org/jira/browse/PARQUET-1989 > Project: Parquet > Issue Type: New Feature > Components: parquet-cli >Reporter: Gidon Gershinsky >Assignee: Maya Anderson >Priority: Major > Fix For: 1.14.0 > > > A tools that verifies encryption of parquet files in a given folder. Analyzes > the footer, and then every module (page headers, pages, column indexes, bloom > filters) - making sure they are encrypted (in relevant columns). Potentially > checking the encryption keys. > We'll start with a design doc, open for discussion. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts
[ https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705940#comment-17705940 ] Steve Loughran commented on PARQUET-2224: - it's not spark, its a cyclone/maven thing. ironically, i only hit it because spark builds were failing on my laptop, upgraded maven and got a version which wasn't compatible. The maven version in the hadoop docker builds is compatible. maybe the solution is to make this a profile and required -Psbom to enable it if you know you are on a compatible maven version and want the files? > Publish SBOM artifacts > -- > > Key: PARQUET-2224 > URL: https://issues.apache.org/jira/browse/PARQUET-2224 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts
[ https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705646#comment-17705646 ] Steve Loughran commented on PARQUET-2224: - +SPARK-42380 > Publish SBOM artifacts > -- > > Key: PARQUET-2224 > URL: https://issues.apache.org/jira/browse/PARQUET-2224 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts
[ https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705645#comment-17705645 ] Steve Loughran commented on PARQUET-2224: - HADOOP-18641. didnt' actually break the build, just printed stack traces and didn't do the manifests > Publish SBOM artifacts > -- > > Key: PARQUET-2224 > URL: https://issues.apache.org/jira/browse/PARQUET-2224 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts
[ https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705327#comment-17705327 ] Steve Loughran commented on PARQUET-2224: - we had to roll this back from hadoop as the maven plugin didn't work with maven 3.3.9. is it better now? > Publish SBOM artifacts > -- > > Key: PARQUET-2224 > URL: https://issues.apache.org/jira/browse/PARQUET-2224 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2239) Replace log4j1 with reload4j
[ https://issues.apache.org/jira/browse/PARQUET-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684607#comment-17684607 ] Steve Loughran commented on PARQUET-2239: - good, but trickier than you think as you have to do lots of excludes of other imports of log4j, slf4j look at HADOOP-18088 as an example of what to do > Replace log4j1 with reload4j > > > Key: PARQUET-2239 > URL: https://issues.apache.org/jira/browse/PARQUET-2239 > Project: Parquet > Issue Type: Improvement >Reporter: Akshat Mathur >Priority: Major > Labels: pick-me-up > > Due to multiple CVE in log4j1, replace log4j dependency with reload4j. > More about reload4j: https://reload4j.qos.ch/ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (PARQUET-2173) Fix parquet build against hadoop 3.3.3+
[ https://issues.apache.org/jira/browse/PARQUET-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran resolved PARQUET-2173. - Fix Version/s: 1.13.0 Resolution: Fixed > Fix parquet build against hadoop 3.3.3+ > --- > > Key: PARQUET-2173 > URL: https://issues.apache.org/jira/browse/PARQUET-2173 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Affects Versions: 1.13.0 >Reporter: Steve Loughran >Priority: Major > Fix For: 1.13.0 > > > parquet won't build against hadoop 3.3.3+ because it swapped out log4j 1.17 > for reload4j, and this creates maven dependency problems in parquet cli > {code} > [INFO] --- maven-dependency-plugin:3.1.1:analyze-only (default) @ parquet-cli > --- > [WARNING] Used undeclared dependencies found: > [WARNING]ch.qos.reload4j:reload4j:jar:1.2.22:provided > {code} > the hadoop common dependencies need to exclude this jar and any changed slf4j > ones. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2216) Parquet writer classes don't close underlying output stream in case of errors.
[ https://issues.apache.org/jira/browse/PARQUET-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642453#comment-17642453 ] Steve Loughran commented on PARQUET-2216: - * OutputFile may not implement Closeable, but the {{PositionOutputStream}} returned by the {{create())) method does. * the writer close chain seems to go all the way through too. looking at the code, one place for improvement would be for {{ParquetFileWriter.end(Map extraMetaData)}} to close its output stream in a finally() clause, so even if the write of the footer failed, the local fs client would do all it can to cleanup, release connections etc. > Parquet writer classes don't close underlying output stream in case of errors. > -- > > Key: PARQUET-2216 > URL: https://issues.apache.org/jira/browse/PARQUET-2216 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.12.3 >Reporter: Andrei Lopukhov >Priority: Major > Attachments: TestExample.java > > > org.apache.parquet.io.OutputFile interface does not implement Closeable. > In my opinion it implies that created streams are fully managed by parquet-mr > classes. > Unfortunately opened stream will not be closed in case of IO or other failure. > There are two places I can find for this problem: > * During writer creation > (org.apache.parquet.hadoop.ParquetWriter.Builder#build()) - created stream > should be closed if writer creation fails. > * During writer close(org.apache.parquet.hadoop.ParquetWriter#close) - > underlying stream should be closed regardless of any faced failures. > Although I didn't examine ParquetReaded that much. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2173) Fix parquet build against hadoop 3.3.3+
Steve Loughran created PARQUET-2173: --- Summary: Fix parquet build against hadoop 3.3.3+ Key: PARQUET-2173 URL: https://issues.apache.org/jira/browse/PARQUET-2173 Project: Parquet Issue Type: Bug Components: parquet-cli Affects Versions: 1.13.0 Reporter: Steve Loughran parquet won't build against hadoop 3.3.3+ because it swapped out log4j 1.17 for reload4j, and this creates maven dependency problems in parquet cli {code} [INFO] --- maven-dependency-plugin:3.1.1:analyze-only (default) @ parquet-cli --- [WARNING] Used undeclared dependencies found: [WARNING]ch.qos.reload4j:reload4j:jar:1.2.22:provided {code} the hadoop common dependencies need to exclude this jar and any changed slf4j ones. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2171) Implement vectored IO in parquet file format
[ https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578637#comment-17578637 ] Steve Loughran commented on PARQUET-2171: - bq. I have found ByteBuffer to impose a nontrivial amount of overhead, and you might want to consider providing array-based methods as well. mixed feelings. its hard to work with but some libraries (parquet...) love it, which partly drove our use of it. if you use on heap buffers is just arrays with more hassle. FwIW, i was looking at some of the parquet read code and concluding that the s3a FS should implement read(bytebyffer) as a single vectored IO read. currently the base class implementation reads into a temp byte array and so breaks prefetching...the s3afs only sees the read(bytes) of the shorter array, not the full amount wanted > Implement vectored IO in parquet file format > > > Key: PARQUET-2171 > URL: https://issues.apache.org/jira/browse/PARQUET-2171 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Mukund Thakur >Priority: Major > > We recently added a new feature called vectored IO in Hadoop for improving > read performance for seek heavy readers. Spark Jobs and others which uses > parquet will greatly benefit from this api. Details can be found here > [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5] > https://issues.apache.org/jira/browse/HADOOP-18103 > https://issues.apache.org/jira/browse/HADOOP-11867 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (PARQUET-2158) Upgrade Hadoop dependency to version 3.2.0
[ https://issues.apache.org/jira/browse/PARQUET-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran resolved PARQUET-2158. - Fix Version/s: 1.13.0 Resolution: Fixed > Upgrade Hadoop dependency to version 3.2.0 > -- > > Key: PARQUET-2158 > URL: https://issues.apache.org/jira/browse/PARQUET-2158 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Steve Loughran >Priority: Major > Fix For: 1.13.0 > > > Parquet still builds against Hadoop 2.10. This is very out of date and does > not work with java 11, let alone later releases. > Upgrading the dependency to Hadoop 3.2.0 makes the release compatible with > java 11, and lines up with active work on HADOOP-18287, _Provide a shim > library for modern FS APIs_ > This will significantly speed up access to columnar data, especially in > cloud stores. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (PARQUET-2150) parquet-protobuf to compile on mac M1
[ https://issues.apache.org/jira/browse/PARQUET-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran resolved PARQUET-2150. - Resolution: Not A Problem with PARQUET-2155 this problem is implicitly fixed. > parquet-protobuf to compile on mac M1 > - > > Key: PARQUET-2150 > URL: https://issues.apache.org/jira/browse/PARQUET-2150 > Project: Parquet > Issue Type: Improvement > Components: parquet-protobuf >Affects Versions: 1.13.0 >Reporter: Steve Loughran >Priority: Major > > parquet-protobuf module fails to compile on Mac M1 because the maven protoc > plugin cannot find the native osx-aarch_64:3.16.1 binary. > the build needs to be tweaked to pick up the x86 binaries -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (PARQUET-2165) remove deprecated PathGlobPattern and DeprecatedFieldProjectionFilter to compile on hadoop 3.2+
[ https://issues.apache.org/jira/browse/PARQUET-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated PARQUET-2165: Summary: remove deprecated PathGlobPattern and DeprecatedFieldProjectionFilter to compile on hadoop 3.2+ (was: remove deprecated PathGlobPattern to compile on hadoop 3.2+) > remove deprecated PathGlobPattern and DeprecatedFieldProjectionFilter to > compile on hadoop 3.2+ > --- > > Key: PARQUET-2165 > URL: https://issues.apache.org/jira/browse/PARQUET-2165 > Project: Parquet > Issue Type: Improvement > Components: parquet-thrift >Affects Versions: 1.12.3 >Reporter: Steve Loughran >Priority: Major > > remove the deprecated PathGlobPattern class and its uses from parquet-thrift > The return types from the hadoop GlobPattern code changed in HADOOP-12436; > in the class as is will not compile against hadoop 3.x > Parquet releases compiled against hadoop 2.x will not be able to instantiate > these classes on a hadoop 3 release, because things will not link. > Nobody appears to have complained about the linkage problem to the extent of > filing a JIRA. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2165) remove deprecated PathGlobPattern to compile on hadoop 3.2+
Steve Loughran created PARQUET-2165: --- Summary: remove deprecated PathGlobPattern to compile on hadoop 3.2+ Key: PARQUET-2165 URL: https://issues.apache.org/jira/browse/PARQUET-2165 Project: Parquet Issue Type: Improvement Components: parquet-thrift Affects Versions: 1.12.3 Reporter: Steve Loughran remove the deprecated PathGlobPattern class and its uses from parquet-thrift The return types from the hadoop GlobPattern code changed in HADOOP-12436; in the class as is will not compile against hadoop 3.x Parquet releases compiled against hadoop 2.x will not be able to instantiate these classes on a hadoop 3 release, because things will not link. Nobody appears to have complained about the linkage problem to the extent of filing a JIRA. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2158) Upgrade Hadoop dependency to version 3.2.0
[ https://issues.apache.org/jira/browse/PARQUET-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553552#comment-17553552 ] Steve Loughran commented on PARQUET-2158: - build is broken by HADOOP-12436 {code} Error: Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project parquet-thrift: Compilation failure Error: /home/runner/work/parquet-mr/parquet-mr/parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/deprecated/PathGlobPattern.java:[55,49] incompatible types: com.google.re2j.Pattern cannot be converted to java.util.regex.Pattern {code} That was a change to a class believed to be private; clearly not. > Upgrade Hadoop dependency to version 3.2.0 > -- > > Key: PARQUET-2158 > URL: https://issues.apache.org/jira/browse/PARQUET-2158 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Steve Loughran >Priority: Major > > Parquet still builds against Hadoop 2.10. This is very out of date and does > not work with java 11, let alone later releases. > Upgrading the dependency to Hadoop 3.2.0 makes the release compatible with > java 11, and lines up with active work on HADOOP-18287, _Provide a shim > library for modern FS APIs_ > This will significantly speed up access to columnar data, especially in > cloud stores. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (PARQUET-2158) Upgrade Hadoop dependency to version 3.2.0
Steve Loughran created PARQUET-2158: --- Summary: Upgrade Hadoop dependency to version 3.2.0 Key: PARQUET-2158 URL: https://issues.apache.org/jira/browse/PARQUET-2158 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.13.0 Reporter: Steve Loughran Parquet still builds against Hadoop 2.10. This is very out of date and does not work with java 11, let alone later releases. Upgrading the dependency to Hadoop 3.2.0 makes the release compatible with java 11, and lines up with active work on HADOOP-18287, _Provide a shim library for modern FS APIs_ This will significantly speed up access to columnar data, especially in cloud stores. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (PARQUET-2151) Drop Hadoop 2 input stream reflection from parquet-hadoop
[ https://issues.apache.org/jira/browse/PARQUET-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated PARQUET-2151: Description: Parquet uses reflection to load a hadoop2 input stream, falling back to a hadoop-1 compatible client if not found. All hadoop 2.0.2+ releases work with H2SeekableInputStream, so the binding to H2SeekableInputStream reworked to avoid needing reflection. This would make it a lot easier to probe for/use the bytebuffer input, and line the code up for more recent hadoop releases. H1SeekableInputStream is still needed to handle streams without ByteBufferReadable. At some poiint support for ByteBufferPositionedReadable is needed, because that is really what parquet wants. that's where reflection will be needed was: Parquet uses reflection to load a hadoop2 input stream, falling back to a hadoop-1 compatible client if not found. All hadoop 2.0.2+ releases work with H2SeekableInputStream, so H1SeekableInputStream can be cut and the binding to H2SeekableInputStream reworked to avoid needing reflection. This would make it a lot easier to probe for/use the bytebuffer input, and line the code up for more recent hadoop releases. > Drop Hadoop 2 input stream reflection from parquet-hadoop > -- > > Key: PARQUET-2151 > URL: https://issues.apache.org/jira/browse/PARQUET-2151 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Steve Loughran >Priority: Minor > > Parquet uses reflection to load a hadoop2 input stream, falling back to a > hadoop-1 compatible client if not found. > All hadoop 2.0.2+ releases work with H2SeekableInputStream, so the binding to > H2SeekableInputStream reworked to avoid needing reflection. This would make > it a lot easier to probe for/use the bytebuffer input, and line the code up > for more recent hadoop releases. > H1SeekableInputStream is still needed to handle streams without > ByteBufferReadable. > At some poiint support for ByteBufferPositionedReadable is needed, because > that is really what parquet wants. that's where reflection will be needed -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (PARQUET-2151) Drop Hadoop 2 input stream reflection from parquet-hadoop
[ https://issues.apache.org/jira/browse/PARQUET-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated PARQUET-2151: Summary: Drop Hadoop 2 input stream reflection from parquet-hadoop (was: Drop Hadoop 1 input stream support from parquet-hadoop ) > Drop Hadoop 2 input stream reflection from parquet-hadoop > -- > > Key: PARQUET-2151 > URL: https://issues.apache.org/jira/browse/PARQUET-2151 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Steve Loughran >Priority: Minor > > Parquet uses reflection to load a hadoop2 input stream, falling back to a > hadoop-1 compatible client if not found. > All hadoop 2.0.2+ releases work with H2SeekableInputStream, so > H1SeekableInputStream can be cut and the binding to H2SeekableInputStream > reworked to avoid needing reflection. This would make it a lot easier to > probe for/use the bytebuffer input, and line the code up for more recent > hadoop releases. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (PARQUET-2151) Drop Hadoop 1 input stream support from parquet-hadoop
[ https://issues.apache.org/jira/browse/PARQUET-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated PARQUET-2151: Description: Parquet uses reflection to load a hadoop2 input stream, falling back to a hadoop-1 compatible client if not found. All hadoop 2.0.2+ releases work with H2SeekableInputStream, so H1SeekableInputStream can be cut and the binding to H2SeekableInputStream reworked to avoid needing reflection. This would make it a lot easier to probe for/use the bytebuffer input, and line the code up for more recent hadoop releases. was: Parquet uses reflection to load a hadoop2 input stream, falling back to a hadoop-1 compatible client if not found. All hadoop 2.0.2+ releases work with H2SeekableInputStream, so H1SeekableInputStream can be cut and the binding to H2SeekableInputStream reworked to avoid needing reflection. This would make it a lot easier to probe for/use the bytebuffer input, and line the code up for more recent hadoop releases. One thing H1SeekableInputStream does do is read into a temp array if the FSDataInputStream doesn't support , that is, doesn't implement ByteBufferReadable. but FSDataInputStream simply forwards that to the inner stream, if it too implements ByteBufferReadable. Filesystems which don't (the cloud stores) can't be read through H2SeekableInputStream.read(ByteBufferReadable). If this desired, H2SeekableInputStream will need to dynamically downgrade to DelegatingSeekableInputStream's base methods if a call to FSDataInputStream.read(ByteBuffer) fails. > Drop Hadoop 1 input stream support from parquet-hadoop > --- > > Key: PARQUET-2151 > URL: https://issues.apache.org/jira/browse/PARQUET-2151 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Steve Loughran >Priority: Minor > > Parquet uses reflection to load a hadoop2 input stream, falling back to a > hadoop-1 compatible client if not found. > All hadoop 2.0.2+ releases work with H2SeekableInputStream, so > H1SeekableInputStream can be cut and the binding to H2SeekableInputStream > reworked to avoid needing reflection. This would make it a lot easier to > probe for/use the bytebuffer input, and line the code up for more recent > hadoop releases. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (PARQUET-2151) Drop Hadoop 1 input stream support from parquet-hadoop
[ https://issues.apache.org/jira/browse/PARQUET-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated PARQUET-2151: Summary: Drop Hadoop 1 input stream support from parquet-hadoop (was: parquet-hadoop to drop Hadoop 1 input stream support) > Drop Hadoop 1 input stream support from parquet-hadoop > --- > > Key: PARQUET-2151 > URL: https://issues.apache.org/jira/browse/PARQUET-2151 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.0 >Reporter: Steve Loughran >Priority: Minor > > Parquet uses reflection to load a hadoop2 input stream, falling back to a > hadoop-1 compatible client if not found. > All hadoop 2.0.2+ releases work with H2SeekableInputStream, so > H1SeekableInputStream can be cut and the binding to H2SeekableInputStream > reworked to avoid needing reflection. This would make it a lot easier to > probe for/use the bytebuffer input, and line the code up for more recent > hadoop releases. > One thing H1SeekableInputStream does do is read into a temp array if the > FSDataInputStream doesn't support , that is, doesn't implement > ByteBufferReadable. > but FSDataInputStream simply forwards that to the inner stream, if it too > implements ByteBufferReadable. Filesystems which don't (the cloud stores) > can't be read through H2SeekableInputStream.read(ByteBufferReadable). If this > desired, H2SeekableInputStream will need to dynamically downgrade to > DelegatingSeekableInputStream's base methods if a call to > FSDataInputStream.read(ByteBuffer) fails. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (PARQUET-2134) Incorrect type checking in HadoopStreams.wrap
[ https://issues.apache.org/jira/browse/PARQUET-2134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550592#comment-17550592 ] Steve Loughran commented on PARQUET-2134: - have you a full stack trace. as IMO the issue isn't type checking, it is handling streams to filesystem clients which don't support the ByteBufferReadable interface > Incorrect type checking in HadoopStreams.wrap > - > > Key: PARQUET-2134 > URL: https://issues.apache.org/jira/browse/PARQUET-2134 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.8.3, 1.10.1, 1.11.2, 1.12.2 >Reporter: Todd Gao >Priority: Minor > > The method > [HadoopStreams.wrap|https://github.com/apache/parquet-mr/blob/4d062dc37577e719dcecc666f8e837843e44a9be/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L51] > wraps an FSDataInputStream to a SeekableInputStream. > It checks whether the underlying stream of the passed FSDataInputStream > implements ByteBufferReadable: if true, wraps the FSDataInputStream to > H2SeekableInputStream; otherwise, wraps to H1SeekableInputStream. > In some cases, we may add another wrapper over FSDataInputStream. For > example, > {code:java} > class CustomDataInputStream extends FSDataInputStream { > public CustomDataInputStream(FSDataInputStream original) { > super(original); > } > } > {code} > When we create an FSDataInputStream, whose underlying stream does not > implements ByteBufferReadable, and then creates a CustomDataInputStream with > it. If we use HadoopStreams.wrap to create a SeekableInputStream, we may get > an error like > {quote}java.lang.UnsupportedOperationException: Byte-buffer read unsupported > by input stream{quote} > We can fix this by taking recursive checks over the underlying stream of > FSDataInputStream. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (PARQUET-2151) parquet-hadoop to drop Hadoop 1 input stream support
Steve Loughran created PARQUET-2151: --- Summary: parquet-hadoop to drop Hadoop 1 input stream support Key: PARQUET-2151 URL: https://issues.apache.org/jira/browse/PARQUET-2151 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.13.0 Reporter: Steve Loughran Parquet uses reflection to load a hadoop2 input stream, falling back to a hadoop-1 compatible client if not found. All hadoop 2.0.2+ releases work with H2SeekableInputStream, so H1SeekableInputStream can be cut and the binding to H2SeekableInputStream reworked to avoid needing reflection. This would make it a lot easier to probe for/use the bytebuffer input, and line the code up for more recent hadoop releases. One thing H1SeekableInputStream does do is read into a temp array if the FSDataInputStream doesn't support , that is, doesn't implement ByteBufferReadable. but FSDataInputStream simply forwards that to the inner stream, if it too implements ByteBufferReadable. Filesystems which don't (the cloud stores) can't be read through H2SeekableInputStream.read(ByteBufferReadable). If this desired, H2SeekableInputStream will need to dynamically downgrade to DelegatingSeekableInputStream's base methods if a call to FSDataInputStream.read(ByteBuffer) fails. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (PARQUET-2150) parquet-protobuf to compile on mac M1
[ https://issues.apache.org/jira/browse/PARQUET-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17541068#comment-17541068 ] Steve Loughran commented on PARQUET-2150: - same issue and solution as HADOOP-17939 > parquet-protobuf to compile on mac M1 > - > > Key: PARQUET-2150 > URL: https://issues.apache.org/jira/browse/PARQUET-2150 > Project: Parquet > Issue Type: Improvement > Components: parquet-protobuf >Affects Versions: 1.13.0 >Reporter: Steve Loughran >Priority: Major > > parquet-protobuf module fails to compile on Mac M1 because the maven protoc > plugin cannot find the native osx-aarch_64:3.16.1 binary. > the build needs to be tweaked to pick up the x86 binaries -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (PARQUET-2150) parquet-protobuf to compile on mac M1
Steve Loughran created PARQUET-2150: --- Summary: parquet-protobuf to compile on mac M1 Key: PARQUET-2150 URL: https://issues.apache.org/jira/browse/PARQUET-2150 Project: Parquet Issue Type: Improvement Components: parquet-protobuf Affects Versions: 1.13.0 Reporter: Steve Loughran parquet-protobuf module fails to compile on Mac M1 because the maven protoc plugin cannot find the native osx-aarch_64:3.16.1 binary. the build needs to be tweaked to pick up the x86 binaries -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (PARQUET-1615) getRecordWriter shouldn't hardcode CREAT mode when new ParquetFileWriter
[ https://issues.apache.org/jira/browse/PARQUET-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17391699#comment-17391699 ] Steve Loughran commented on PARQUET-1615: - just looking at this. Any specific reason not to have overwrite the default? it saves the overhead of an HTTP head probe against any of the object stores, and for any job where the output committer has created a unique dir for that task attempt, no risk of conflict unless somehow the task really has decided to create two files with the same name. Has anyone ever encountered conflicts? > getRecordWriter shouldn't hardcode CREAT mode when new ParquetFileWriter > > > Key: PARQUET-1615 > URL: https://issues.apache.org/jira/browse/PARQUET-1615 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Lantao Jin >Assignee: Lantao Jin >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > getRecordWriter shouldn't hardcode CREAT mode when new ParquetFileWriter. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (PARQUET-1984) Some tests fail on windows
[ https://issues.apache.org/jira/browse/PARQUET-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377466#comment-17377466 ] Steve Loughran edited comment on PARQUET-1984 at 7/8/21, 3:40 PM: -- FYI, this change stops the test building against Hadoop-3.3.x as that's dropped commons-lang from its transient dependencies. (note: The relevant import of a commons-lang class isn't actually used; removing the line fixes things) was (Author: ste...@apache.org): FYI, this change stops the test building against Hadoop-3.3.x as that's dropped commons-lang from its transient dependencies. > Some tests fail on windows > -- > > Key: PARQUET-1984 > URL: https://issues.apache.org/jira/browse/PARQUET-1984 > Project: Parquet > Issue Type: Bug > Components: parquet-mr, parquet-thrift >Affects Versions: 1.12.0 > Environment: Windows 10 >Reporter: Felix Schmalzel >Assignee: Felix Schmalzel >Priority: Minor > Fix For: 1.12.0 > > > Reasons: > * Expecting \n and getting \r\n > * Unclosed streams preventing a temporary file from being deleted > * File layout differences \ and / > * No native library for brotli, because the brotli-codec dependency only > shadows macos and linux native libraries. > > I've already developed a patch that would fix all the problems excluding the > brotli one. For that one we would have to wait until this > [https://github.com/rdblue/brotli-codec/pull/2] request is merged. I will > link the merge request for the other problems in the next few days. > Is there a related ticket that i have overlooked? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1984) Some tests fail on windows
[ https://issues.apache.org/jira/browse/PARQUET-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377466#comment-17377466 ] Steve Loughran commented on PARQUET-1984: - FYI, this change stops the test building against Hadoop-3.3.x as that's dropped commons-lang from its transient dependencies. > Some tests fail on windows > -- > > Key: PARQUET-1984 > URL: https://issues.apache.org/jira/browse/PARQUET-1984 > Project: Parquet > Issue Type: Bug > Components: parquet-mr, parquet-thrift >Affects Versions: 1.12.0 > Environment: Windows 10 >Reporter: Felix Schmalzel >Assignee: Felix Schmalzel >Priority: Minor > Fix For: 1.12.0 > > > Reasons: > * Expecting \n and getting \r\n > * Unclosed streams preventing a temporary file from being deleted > * File layout differences \ and / > * No native library for brotli, because the brotli-codec dependency only > shadows macos and linux native libraries. > > I've already developed a patch that would fix all the problems excluding the > brotli one. For that one we would have to wait until this > [https://github.com/rdblue/brotli-codec/pull/2] request is merged. I will > link the merge request for the other problems in the next few days. > Is there a related ticket that i have overlooked? -- This message was sent by Atlassian Jira (v8.3.4#803005)