[jira] [Commented] (PARQUET-2346) Bump org.slf4j:slf4j-api from 1.7.12 to 2.0.9

2023-09-13 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764774#comment-17764774
 ] 

Steve Loughran commented on PARQUET-2346:
-

I don't know what happens, but do know that (a;) depedabot is overaggressive 
and (b) hadoop 3.3.x is still using the 1.7 apis with reload4j, so a move is at 
risk of breaking everywhere

> Bump org.slf4j:slf4j-api from 1.7.12 to 2.0.9
> -
>
> Key: PARQUET-2346
> URL: https://issues.apache.org/jira/browse/PARQUET-2346
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
> Fix For: format-2.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (PARQUET-2346) Bump org.slf4j:slf4j-api from 1.7.12 to 2.0.9

2023-09-13 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764774#comment-17764774
 ] 

Steve Loughran edited comment on PARQUET-2346 at 9/13/23 4:24 PM:
--

I don't know what happens, but do know that (a) depedabot is overaggressive and 
(b) hadoop 3.3.x is still using the 1.7 apis with reload4j, so a move is at 
risk of breaking everywhere


was (Author: ste...@apache.org):
I don't know what happens, but do know that (a;) depedabot is overaggressive 
and (b) hadoop 3.3.x is still using the 1.7 apis with reload4j, so a move is at 
risk of breaking everywhere

> Bump org.slf4j:slf4j-api from 1.7.12 to 2.0.9
> -
>
> Key: PARQUET-2346
> URL: https://issues.apache.org/jira/browse/PARQUET-2346
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
> Fix For: format-2.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2346) Bump org.slf4j:slf4j-api from 1.7.12 to 2.0.9

2023-09-12 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764248#comment-17764248
 ] 

Steve Loughran commented on PARQUET-2346:
-

what is this going to do in terms of trying to use parquet in apps which aren't 
on the v2 apis themselves?

> Bump org.slf4j:slf4j-api from 1.7.12 to 2.0.9
> -
>
> Key: PARQUET-2346
> URL: https://issues.apache.org/jira/browse/PARQUET-2346
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
> Fix For: format-2.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2338) CVE-2022-25168 in hadoop-common

2023-08-21 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17756912#comment-17756912
 ] 

Steve Loughran commented on PARQUET-2338:
-

pr #1065 did this in 53ea34ac7eb98432a72e3c37cd48e4f02baf65ea ; anything wrong 
with that commit? or is just not the right branch?

> CVE-2022-25168 in hadoop-common
> ---
>
> Key: PARQUET-2338
> URL: https://issues.apache.org/jira/browse/PARQUET-2338
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-hadoop
>Affects Versions: 1.13.1
>Reporter: jincongho
>Priority: Major
>
> [CVE-2022-25168|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25168]
>  requires updating hadoop-common to 3.2.4 or 3.3.3.
> Although `FileUtils.untar` isnt used inparquet-hadoop, will appreciate if we 
> can release a new parquet-hadoop soon with these newer version. Otherwise 
> parquet-hadoop will be flagged as security concern too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2128) Bump Thrift to 0.16.0

2023-06-14 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732509#comment-17732509
 ] 

Steve Loughran commented on PARQUET-2128:
-

homebrew doesn't have anything < 0..18.0, which is java11+ only, so not 
something parquet can switch to. 

which means that we have to stop using homebrew here and take control of our 
build dependencies ourselves. I've already done that with maven and openjdk as 
brew is too enthusiastic about breaking my workflow.

*none of us can rely on homebrew or use "homebrew doesn't have this" as a 
reason for reverting a change. 

All old thrift releases can be found at https://archive.apache.org/dist/thrift/

> Bump Thrift to 0.16.0
> -
>
> Key: PARQUET-2128
> URL: https://issues.apache.org/jira/browse/PARQUET-2128
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Vinoo Ganesh
>Assignee: Vinoo Ganesh
>Priority: Minor
> Fix For: 1.12.3
>
>
> Thrift 0.16.0 has been released 
> https://github.com/apache/thrift/releases/tag/v0.16.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2171) Implement vectored IO in parquet file format

2023-05-19 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724339#comment-17724339
 ] 

Steve Loughran commented on PARQUET-2171:
-

mukund, is there a PR up for this? even though it's not going to be merged, it 
needs to be shared for others to pick up

> Implement vectored IO in parquet file format
> 
>
> Key: PARQUET-2171
> URL: https://issues.apache.org/jira/browse/PARQUET-2171
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Mukund Thakur
>Priority: Major
>
> We recently added a new feature called vectored IO in Hadoop for improving 
> read performance for seek heavy readers. Spark Jobs and others which uses 
> parquet will greatly benefit from this api. Details can be found here 
> [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5]
> https://issues.apache.org/jira/browse/HADOOP-18103
> https://issues.apache.org/jira/browse/HADOOP-11867



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-05-01 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17718203#comment-17718203
 ] 

Steve Loughran commented on PARQUET-2276:
-

[~a2l] really? hadoop 2.8? why haven't they upgraded yet? that is a long way 
behind on any form of security updates, doesn't come with any guarantees of 
java8+ support etc. Even hadoop 2.9.x only gets CVE updates for hadoop's own 
code so that those people running their own clusters with private hadoop-2 
forks know what to pick up.

The PARQUET-2134 patch did not break Hadoop 2 compatibility; it used APIs it's 
which were in the version of Hadoop that parquet compiled against. What it did 
do was "explicitly break compatibility with a version of hadoop older than the 
one parquet was built against" That patch may have been the one to show the 
problem but the reality is there are many other places where incompatibilities 
could've surfaced.

If you actually want to support hadoop-2.8.5 then the pom needs to be 
downgraded before anything else.

You also need to worry about Java8/7 compatibility. We're already in a problem 
where some of the java.nio classes in the java8 SDKs you can get have added 
more overridden bytebuffer methods then were in the original Oracle Java8, and 
https algorithms I've been another moving target. so even within "java8" there 
is "original java8" and the openjdk/corretto/azul versions. Well you can get 
away with building a modern library with a recent open JDK build, if you really 
are planning on supporting hadoop 2.8 suddenly all these issues surface. [I 
know this as when i have to go near the hadoop-2 line i have to use a docker 
image with java7/, and since moving to macbook m1 I can't do that any more.


> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0, 1.13.1
>
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2289) Avoid using hasCapability

2023-04-19 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713993#comment-17713993
 ] 

Steve Loughran commented on PARQUET-2289:
-

I'm not convinced here. the JIRA was about things not working against hadoop 
2.8, but as it was built on 2.9.x, there's no guarantee anything will link. in 
fact, as hadoop 2.8.x doesn't officially support java (kerberos, s3a+joda 
time,...), the release would have to be built with java7 to claim compatibility.

spark has just gone to hadoop 3.3+ only on their trunk. consider it too

> Avoid using hasCapability
> -
>
> Key: PARQUET-2289
> URL: https://issues.apache.org/jira/browse/PARQUET-2289
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2276) ParquetReader reads do not work with Hadoop version 2.8.5

2023-04-14 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712325#comment-17712325
 ] 

Steve Loughran commented on PARQUET-2276:
-

hadoop 2.8 shipped 5 years ago. trying to build against any version of hadoop 2 
cripples you, and as hadoop 2.8 *doesn't even work reliably on java 8*,

What exactly are you trying to run, and why?



> ParquetReader reads do not work with Hadoop version 2.8.5
> -
>
> Key: PARQUET-2276
> URL: https://issues.apache.org/jira/browse/PARQUET-2276
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Atul Mohan
>Priority: Major
>
> {{ParquetReader.read() fails with the following exception on parquet-mr 
> version 1.13.0 when using hadoop version 2.8.5:}}
> {code:java}
>  java.lang.NoSuchMethodError: 'boolean 
> org.apache.hadoop.fs.FSDataInputStream.hasCapability(java.lang.String)' 
> at 
> org.apache.parquet.hadoop.util.HadoopStreams.isWrappedStreamByteBufferReadable(HadoopStreams.java:74)
>  
> at org.apache.parquet.hadoop.util.HadoopStreams.wrap(HadoopStreams.java:49) 
> at 
> org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:787)
>  
> at 
> org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) 
> at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:162) 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> {code}
>  
>  
>  
> From an initial investigation, it looks like HadoopStreams has started using 
> [FSDataInputStream.hasCapability|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L74]
>  but _FSDataInputStream_ does not have the _hasCapability_ API in [hadoop 
> 2.8.x|https://hadoop.apache.org/docs/r2.8.3/api/org/apache/hadoop/fs/FSDataInputStream.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2277) Bump hadoop.version from 3.2.3 to 3.3.5

2023-04-14 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712323#comment-17712323
 ] 

Steve Loughran commented on PARQUET-2277:
-

happy. Have you considered cutting hadoop 2 support entirely? Because with 
parquet building on 3.3.5 you are in a position to take up Mukund's vector IO 
patch PARQUET-2171 and see significant speedup in local IO reads (java nio at 
work) and on s3 through the s3a connector (parallel range requests)

targeting hadoop 3.3.x only gives you an openfile call where you can skip all 
HEAD probes and ask for random IO too. cloud speedup all round.

> Bump hadoop.version from 3.2.3 to 3.3.5
> ---
>
> Key: PARQUET-2277
> URL: https://issues.apache.org/jira/browse/PARQUET-2277
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 1.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1989) Deep verification of encrypted files

2023-04-12 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711367#comment-17711367
 ] 

Steve Loughran commented on PARQUET-1989:
-

you might want to have a design which can do the scan on a spark rdd, where the 
rdd is simply the deep listFiles(path) scan of the directory tree. This would 
give the best scale for a massive dataset compared to even some parallelised 
scan in a single process.

I do have an RDD which can do line-by-line work, with locality of work 
determined on each file, which lets you schedule the work on the relevant hdfs 
nodes with the data; unfortunately it needs to be in the o.a.spark package to 
build
https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/cloudera/ParallelizedWithLocalityRDD.scala

...that could maybe be added to spark itself.



> Deep verification of encrypted files
> 
>
> Key: PARQUET-1989
> URL: https://issues.apache.org/jira/browse/PARQUET-1989
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cli
>Reporter: Gidon Gershinsky
>Assignee: Maya Anderson
>Priority: Major
> Fix For: 1.14.0
>
>
> A tools that verifies encryption of parquet files in a given folder. Analyzes 
> the footer, and then every module (page headers, pages, column indexes, bloom 
> filters) - making sure they are encrypted (in relevant columns). Potentially 
> checking the encryption keys.
> We'll start with a design doc, open for discussion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts

2023-03-28 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705940#comment-17705940
 ] 

Steve Loughran commented on PARQUET-2224:
-

it's not spark, its a cyclone/maven thing. ironically, i only hit it because 
spark builds were failing on my laptop, upgraded maven and got a version which 
wasn't compatible. The maven version in the hadoop docker builds is compatible.

maybe the solution is to make this a profile and required -Psbom to enable it 
if you know you are on a compatible maven version and want the files?

> Publish SBOM artifacts
> --
>
> Key: PARQUET-2224
> URL: https://issues.apache.org/jira/browse/PARQUET-2224
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts

2023-03-27 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705646#comment-17705646
 ] 

Steve Loughran commented on PARQUET-2224:
-

+SPARK-42380

> Publish SBOM artifacts
> --
>
> Key: PARQUET-2224
> URL: https://issues.apache.org/jira/browse/PARQUET-2224
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts

2023-03-27 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705645#comment-17705645
 ] 

Steve Loughran commented on PARQUET-2224:
-

HADOOP-18641. didnt' actually break the build, just printed stack traces and 
didn't do the manifests

> Publish SBOM artifacts
> --
>
> Key: PARQUET-2224
> URL: https://issues.apache.org/jira/browse/PARQUET-2224
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2224) Publish SBOM artifacts

2023-03-27 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705327#comment-17705327
 ] 

Steve Loughran commented on PARQUET-2224:
-

we had to roll this back from hadoop as the maven plugin didn't work with maven 
3.3.9. is it better now?

> Publish SBOM artifacts
> --
>
> Key: PARQUET-2224
> URL: https://issues.apache.org/jira/browse/PARQUET-2224
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2239) Replace log4j1 with reload4j

2023-02-06 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684607#comment-17684607
 ] 

Steve Loughran commented on PARQUET-2239:
-

good, but trickier than you think as you have to do lots of excludes of other 
imports of log4j, slf4j

look at HADOOP-18088 as an example of what to do

> Replace log4j1 with reload4j
> 
>
> Key: PARQUET-2239
> URL: https://issues.apache.org/jira/browse/PARQUET-2239
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Akshat Mathur
>Priority: Major
>  Labels: pick-me-up
>
> Due to multiple CVE in log4j1, replace log4j dependency with reload4j.
> More about reload4j: https://reload4j.qos.ch/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2173) Fix parquet build against hadoop 3.3.3+

2023-02-02 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran resolved PARQUET-2173.
-
Fix Version/s: 1.13.0
   Resolution: Fixed

> Fix parquet build against hadoop 3.3.3+
> ---
>
> Key: PARQUET-2173
> URL: https://issues.apache.org/jira/browse/PARQUET-2173
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.13.0
>Reporter: Steve Loughran
>Priority: Major
> Fix For: 1.13.0
>
>
> parquet won't build against hadoop 3.3.3+ because it swapped out log4j 1.17 
> for reload4j, and this creates maven dependency problems in parquet cli
> {code}
> [INFO] --- maven-dependency-plugin:3.1.1:analyze-only (default) @ parquet-cli 
> ---
> [WARNING] Used undeclared dependencies found:
> [WARNING]ch.qos.reload4j:reload4j:jar:1.2.22:provided
> {code}
> the hadoop common dependencies need to exclude this jar and any changed slf4j 
> ones.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2216) Parquet writer classes don't close underlying output stream in case of errors.

2022-12-02 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17642453#comment-17642453
 ] 

Steve Loughran commented on PARQUET-2216:
-

* OutputFile may not implement Closeable, but the {{PositionOutputStream}} 
returned by the {{create())) method does.
* the writer close chain seems to go all the way through too.

looking at the code, one place for improvement would be for 
{{ParquetFileWriter.end(Map extraMetaData)}} to close  its 
output stream in  a finally() clause, so even if the write of the footer 
failed, the local fs client would do all it can to cleanup, release connections 
etc.

> Parquet writer classes don't close underlying output stream in case of errors.
> --
>
> Key: PARQUET-2216
> URL: https://issues.apache.org/jira/browse/PARQUET-2216
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Andrei Lopukhov
>Priority: Major
> Attachments: TestExample.java
>
>
> org.apache.parquet.io.OutputFile interface does not implement Closeable.
> In my opinion it implies that created streams are fully managed by parquet-mr 
> classes.
> Unfortunately opened stream will not be closed in case of IO or other failure.
> There are two places I can find for this problem:
> * During writer creation 
> (org.apache.parquet.hadoop.ParquetWriter.Builder#build()) - created stream 
> should be closed if writer creation fails.
> * During writer close(org.apache.parquet.hadoop.ParquetWriter#close) - 
> underlying stream should be closed regardless of any faced failures.
> Although I didn't examine ParquetReaded that much.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2173) Fix parquet build against hadoop 3.3.3+

2022-08-16 Thread Steve Loughran (Jira)
Steve Loughran created PARQUET-2173:
---

 Summary: Fix parquet build against hadoop 3.3.3+
 Key: PARQUET-2173
 URL: https://issues.apache.org/jira/browse/PARQUET-2173
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cli
Affects Versions: 1.13.0
Reporter: Steve Loughran


parquet won't build against hadoop 3.3.3+ because it swapped out log4j 1.17 for 
reload4j, and this creates maven dependency problems in parquet cli


{code}
[INFO] --- maven-dependency-plugin:3.1.1:analyze-only (default) @ parquet-cli 
---
[WARNING] Used undeclared dependencies found:
[WARNING]ch.qos.reload4j:reload4j:jar:1.2.22:provided

{code}

the hadoop common dependencies need to exclude this jar and any changed slf4j 
ones.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2171) Implement vectored IO in parquet file format

2022-08-11 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578637#comment-17578637
 ] 

Steve Loughran commented on PARQUET-2171:
-

bq. I have found ByteBuffer to impose a nontrivial amount of overhead, and you 
might want to consider providing array-based methods as well.

mixed feelings. its hard to work with but some libraries (parquet...) love it, 
which partly drove our use of it. if you use on heap buffers is just arrays 
with more hassle.

FwIW, i was looking at some of the parquet read code and concluding that the 
s3a FS should implement read(bytebyffer)  as a single vectored IO read. 
currently the base class implementation reads into a temp byte array and so 
breaks prefetching...the s3afs only sees the read(bytes) of the shorter array, 
not the full amount wanted

> Implement vectored IO in parquet file format
> 
>
> Key: PARQUET-2171
> URL: https://issues.apache.org/jira/browse/PARQUET-2171
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Mukund Thakur
>Priority: Major
>
> We recently added a new feature called vectored IO in Hadoop for improving 
> read performance for seek heavy readers. Spark Jobs and others which uses 
> parquet will greatly benefit from this api. Details can be found here 
> [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5]
> https://issues.apache.org/jira/browse/HADOOP-18103
> https://issues.apache.org/jira/browse/HADOOP-11867



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2158) Upgrade Hadoop dependency to version 3.2.0

2022-08-01 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran resolved PARQUET-2158.
-
Fix Version/s: 1.13.0
   Resolution: Fixed

> Upgrade Hadoop dependency to version 3.2.0
> --
>
> Key: PARQUET-2158
> URL: https://issues.apache.org/jira/browse/PARQUET-2158
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Steve Loughran
>Priority: Major
> Fix For: 1.13.0
>
>
> Parquet still builds against Hadoop 2.10. This is very out of date and does 
> not work with java 11, let alone later releases.
> Upgrading the dependency to Hadoop 3.2.0 makes the release compatible with 
> java 11, and lines up with active work on  HADOOP-18287,  _Provide a shim 
> library for modern FS APIs_ 
> This will significantly speed up access to columnar data, especially  in 
> cloud stores.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2150) parquet-protobuf to compile on mac M1

2022-07-19 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran resolved PARQUET-2150.
-
Resolution: Not A Problem

with PARQUET-2155 this problem is implicitly fixed.

> parquet-protobuf to compile on mac M1
> -
>
> Key: PARQUET-2150
> URL: https://issues.apache.org/jira/browse/PARQUET-2150
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-protobuf
>Affects Versions: 1.13.0
>Reporter: Steve Loughran
>Priority: Major
>
> parquet-protobuf module fails to compile on Mac M1 because the maven protoc 
> plugin cannot find the native osx-aarch_64:3.16.1  binary.
> the build needs to be tweaked to pick up the x86 binaries



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2165) remove deprecated PathGlobPattern and DeprecatedFieldProjectionFilter to compile on hadoop 3.2+

2022-07-12 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated PARQUET-2165:

Summary: remove deprecated PathGlobPattern and 
DeprecatedFieldProjectionFilter to compile on hadoop 3.2+  (was: remove 
deprecated PathGlobPattern to compile on hadoop 3.2+)

> remove deprecated PathGlobPattern and DeprecatedFieldProjectionFilter to 
> compile on hadoop 3.2+
> ---
>
> Key: PARQUET-2165
> URL: https://issues.apache.org/jira/browse/PARQUET-2165
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-thrift
>Affects Versions: 1.12.3
>Reporter: Steve Loughran
>Priority: Major
>
> remove the deprecated PathGlobPattern class and its uses from parquet-thrift
> The return types from the hadoop  GlobPattern code changed in HADOOP-12436; 
> in the class as is will not compile against hadoop 3.x
> Parquet releases compiled against hadoop 2.x will not be able to instantiate 
> these classes on a hadoop 3 release, because things will not link.
> Nobody appears to have complained about the linkage problem to the extent of 
> filing a JIRA. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2165) remove deprecated PathGlobPattern to compile on hadoop 3.2+

2022-07-12 Thread Steve Loughran (Jira)
Steve Loughran created PARQUET-2165:
---

 Summary: remove deprecated PathGlobPattern to compile on hadoop 
3.2+
 Key: PARQUET-2165
 URL: https://issues.apache.org/jira/browse/PARQUET-2165
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-thrift
Affects Versions: 1.12.3
Reporter: Steve Loughran


remove the deprecated PathGlobPattern class and its uses from parquet-thrift

The return types from the hadoop  GlobPattern code changed in HADOOP-12436; in 
the class as is will not compile against hadoop 3.x

Parquet releases compiled against hadoop 2.x will not be able to instantiate 
these classes on a hadoop 3 release, because things will not link.

Nobody appears to have complained about the linkage problem to the extent of 
filing a JIRA. 





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2158) Upgrade Hadoop dependency to version 3.2.0

2022-06-13 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553552#comment-17553552
 ] 

Steve Loughran commented on PARQUET-2158:
-

build is broken by  HADOOP-12436


{code}
Error:  Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) 
on project parquet-thrift: Compilation failure
Error:  
/home/runner/work/parquet-mr/parquet-mr/parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/deprecated/PathGlobPattern.java:[55,49]
 incompatible types: com.google.re2j.Pattern cannot be converted to 
java.util.regex.Pattern
{code}

That was a change to a class believed to be private; clearly not. 



> Upgrade Hadoop dependency to version 3.2.0
> --
>
> Key: PARQUET-2158
> URL: https://issues.apache.org/jira/browse/PARQUET-2158
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Steve Loughran
>Priority: Major
>
> Parquet still builds against Hadoop 2.10. This is very out of date and does 
> not work with java 11, let alone later releases.
> Upgrading the dependency to Hadoop 3.2.0 makes the release compatible with 
> java 11, and lines up with active work on  HADOOP-18287,  _Provide a shim 
> library for modern FS APIs_ 
> This will significantly speed up access to columnar data, especially  in 
> cloud stores.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (PARQUET-2158) Upgrade Hadoop dependency to version 3.2.0

2022-06-13 Thread Steve Loughran (Jira)
Steve Loughran created PARQUET-2158:
---

 Summary: Upgrade Hadoop dependency to version 3.2.0
 Key: PARQUET-2158
 URL: https://issues.apache.org/jira/browse/PARQUET-2158
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.13.0
Reporter: Steve Loughran



Parquet still builds against Hadoop 2.10. This is very out of date and does not 
work with java 11, let alone later releases.

Upgrading the dependency to Hadoop 3.2.0 makes the release compatible with java 
11, and lines up with active work on  HADOOP-18287,  _Provide a shim library 
for modern FS APIs_ 

This will significantly speed up access to columnar data, especially  in cloud 
stores.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (PARQUET-2151) Drop Hadoop 2 input stream reflection from parquet-hadoop

2022-06-07 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated PARQUET-2151:

Description: 
Parquet uses reflection to load a hadoop2 input stream, falling back to a 
hadoop-1 compatible client if not found.

All hadoop 2.0.2+ releases work with H2SeekableInputStream, so the binding to 
H2SeekableInputStream reworked to avoid needing reflection. This would make it 
a lot easier to probe for/use the bytebuffer input, and line the code up for 
more recent hadoop releases.

H1SeekableInputStream is still needed to handle streams without 
ByteBufferReadable.

At some poiint support for ByteBufferPositionedReadable is needed, because that 
is really what parquet wants. that's where reflection will be needed




  was:
Parquet uses reflection to load a hadoop2 input stream, falling back to a 
hadoop-1 compatible client if not found.

All hadoop 2.0.2+ releases work with H2SeekableInputStream, so 
H1SeekableInputStream can be cut and the binding to H2SeekableInputStream 
reworked to avoid needing reflection. This would make it a lot easier to probe 
for/use the bytebuffer input, and line the code up for more recent hadoop 
releases.





> Drop Hadoop 2 input stream reflection from parquet-hadoop 
> --
>
> Key: PARQUET-2151
> URL: https://issues.apache.org/jira/browse/PARQUET-2151
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Parquet uses reflection to load a hadoop2 input stream, falling back to a 
> hadoop-1 compatible client if not found.
> All hadoop 2.0.2+ releases work with H2SeekableInputStream, so the binding to 
> H2SeekableInputStream reworked to avoid needing reflection. This would make 
> it a lot easier to probe for/use the bytebuffer input, and line the code up 
> for more recent hadoop releases.
> H1SeekableInputStream is still needed to handle streams without 
> ByteBufferReadable.
> At some poiint support for ByteBufferPositionedReadable is needed, because 
> that is really what parquet wants. that's where reflection will be needed



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (PARQUET-2151) Drop Hadoop 2 input stream reflection from parquet-hadoop

2022-06-07 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated PARQUET-2151:

Summary: Drop Hadoop 2 input stream reflection from parquet-hadoop   (was: 
Drop Hadoop 1 input stream support from parquet-hadoop )

> Drop Hadoop 2 input stream reflection from parquet-hadoop 
> --
>
> Key: PARQUET-2151
> URL: https://issues.apache.org/jira/browse/PARQUET-2151
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Parquet uses reflection to load a hadoop2 input stream, falling back to a 
> hadoop-1 compatible client if not found.
> All hadoop 2.0.2+ releases work with H2SeekableInputStream, so 
> H1SeekableInputStream can be cut and the binding to H2SeekableInputStream 
> reworked to avoid needing reflection. This would make it a lot easier to 
> probe for/use the bytebuffer input, and line the code up for more recent 
> hadoop releases.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (PARQUET-2151) Drop Hadoop 1 input stream support from parquet-hadoop

2022-06-07 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated PARQUET-2151:

Description: 
Parquet uses reflection to load a hadoop2 input stream, falling back to a 
hadoop-1 compatible client if not found.

All hadoop 2.0.2+ releases work with H2SeekableInputStream, so 
H1SeekableInputStream can be cut and the binding to H2SeekableInputStream 
reworked to avoid needing reflection. This would make it a lot easier to probe 
for/use the bytebuffer input, and line the code up for more recent hadoop 
releases.




  was:
Parquet uses reflection to load a hadoop2 input stream, falling back to a 
hadoop-1 compatible client if not found.

All hadoop 2.0.2+ releases work with H2SeekableInputStream, so 
H1SeekableInputStream can be cut and the binding to H2SeekableInputStream 
reworked to avoid needing reflection. This would make it a lot easier to probe 
for/use the bytebuffer input, and line the code up for more recent hadoop 
releases.

One thing H1SeekableInputStream does do is read into a temp array if the 
FSDataInputStream doesn't support , that is, doesn't implement 
ByteBufferReadable.
but FSDataInputStream simply forwards that to the inner stream, if it too 
implements ByteBufferReadable. Filesystems which don't (the cloud stores) can't 
be read through H2SeekableInputStream.read(ByteBufferReadable). If this 
desired, H2SeekableInputStream will need to dynamically downgrade to 
DelegatingSeekableInputStream's base methods if a call to 
FSDataInputStream.read(ByteBuffer) fails.




> Drop Hadoop 1 input stream support from parquet-hadoop 
> ---
>
> Key: PARQUET-2151
> URL: https://issues.apache.org/jira/browse/PARQUET-2151
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Parquet uses reflection to load a hadoop2 input stream, falling back to a 
> hadoop-1 compatible client if not found.
> All hadoop 2.0.2+ releases work with H2SeekableInputStream, so 
> H1SeekableInputStream can be cut and the binding to H2SeekableInputStream 
> reworked to avoid needing reflection. This would make it a lot easier to 
> probe for/use the bytebuffer input, and line the code up for more recent 
> hadoop releases.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (PARQUET-2151) Drop Hadoop 1 input stream support from parquet-hadoop

2022-06-06 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated PARQUET-2151:

Summary: Drop Hadoop 1 input stream support from parquet-hadoop   (was: 
parquet-hadoop to drop Hadoop 1 input stream support)

> Drop Hadoop 1 input stream support from parquet-hadoop 
> ---
>
> Key: PARQUET-2151
> URL: https://issues.apache.org/jira/browse/PARQUET-2151
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Parquet uses reflection to load a hadoop2 input stream, falling back to a 
> hadoop-1 compatible client if not found.
> All hadoop 2.0.2+ releases work with H2SeekableInputStream, so 
> H1SeekableInputStream can be cut and the binding to H2SeekableInputStream 
> reworked to avoid needing reflection. This would make it a lot easier to 
> probe for/use the bytebuffer input, and line the code up for more recent 
> hadoop releases.
> One thing H1SeekableInputStream does do is read into a temp array if the 
> FSDataInputStream doesn't support , that is, doesn't implement 
> ByteBufferReadable.
> but FSDataInputStream simply forwards that to the inner stream, if it too 
> implements ByteBufferReadable. Filesystems which don't (the cloud stores) 
> can't be read through H2SeekableInputStream.read(ByteBufferReadable). If this 
> desired, H2SeekableInputStream will need to dynamically downgrade to 
> DelegatingSeekableInputStream's base methods if a call to 
> FSDataInputStream.read(ByteBuffer) fails.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (PARQUET-2134) Incorrect type checking in HadoopStreams.wrap

2022-06-06 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550592#comment-17550592
 ] 

Steve Loughran commented on PARQUET-2134:
-

have you a full stack trace. as IMO the issue isn't type checking, it is 
handling streams to filesystem clients which don't support the 
ByteBufferReadable interface

> Incorrect type checking in HadoopStreams.wrap
> -
>
> Key: PARQUET-2134
> URL: https://issues.apache.org/jira/browse/PARQUET-2134
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.3, 1.10.1, 1.11.2, 1.12.2
>Reporter: Todd Gao
>Priority: Minor
>
> The method 
> [HadoopStreams.wrap|https://github.com/apache/parquet-mr/blob/4d062dc37577e719dcecc666f8e837843e44a9be/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L51]
>  wraps an FSDataInputStream to a SeekableInputStream. 
> It checks whether the underlying stream of the passed  FSDataInputStream 
> implements ByteBufferReadable: if true, wraps the FSDataInputStream to 
> H2SeekableInputStream; otherwise, wraps to H1SeekableInputStream.
> In some cases, we may add another wrapper over FSDataInputStream. For 
> example, 
> {code:java}
> class CustomDataInputStream extends FSDataInputStream {
> public CustomDataInputStream(FSDataInputStream original) {
> super(original);
> }
> }
> {code}
> When we create an FSDataInputStream, whose underlying stream does not 
> implements ByteBufferReadable, and then creates a CustomDataInputStream with 
> it. If we use HadoopStreams.wrap to create a SeekableInputStream, we may get 
> an error like 
> {quote}java.lang.UnsupportedOperationException: Byte-buffer read unsupported 
> by input stream{quote}
> We can fix this by taking recursive checks over the underlying stream of 
> FSDataInputStream.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (PARQUET-2151) parquet-hadoop to drop Hadoop 1 input stream support

2022-06-06 Thread Steve Loughran (Jira)
Steve Loughran created PARQUET-2151:
---

 Summary: parquet-hadoop to drop Hadoop 1 input stream support
 Key: PARQUET-2151
 URL: https://issues.apache.org/jira/browse/PARQUET-2151
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.13.0
Reporter: Steve Loughran


Parquet uses reflection to load a hadoop2 input stream, falling back to a 
hadoop-1 compatible client if not found.

All hadoop 2.0.2+ releases work with H2SeekableInputStream, so 
H1SeekableInputStream can be cut and the binding to H2SeekableInputStream 
reworked to avoid needing reflection. This would make it a lot easier to probe 
for/use the bytebuffer input, and line the code up for more recent hadoop 
releases.

One thing H1SeekableInputStream does do is read into a temp array if the 
FSDataInputStream doesn't support , that is, doesn't implement 
ByteBufferReadable.
but FSDataInputStream simply forwards that to the inner stream, if it too 
implements ByteBufferReadable. Filesystems which don't (the cloud stores) can't 
be read through H2SeekableInputStream.read(ByteBufferReadable). If this 
desired, H2SeekableInputStream will need to dynamically downgrade to 
DelegatingSeekableInputStream's base methods if a call to 
FSDataInputStream.read(ByteBuffer) fails.





--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (PARQUET-2150) parquet-protobuf to compile on mac M1

2022-05-23 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17541068#comment-17541068
 ] 

Steve Loughran commented on PARQUET-2150:
-

 same issue and solution as HADOOP-17939

> parquet-protobuf to compile on mac M1
> -
>
> Key: PARQUET-2150
> URL: https://issues.apache.org/jira/browse/PARQUET-2150
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-protobuf
>Affects Versions: 1.13.0
>Reporter: Steve Loughran
>Priority: Major
>
> parquet-protobuf module fails to compile on Mac M1 because the maven protoc 
> plugin cannot find the native osx-aarch_64:3.16.1  binary.
> the build needs to be tweaked to pick up the x86 binaries



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (PARQUET-2150) parquet-protobuf to compile on mac M1

2022-05-23 Thread Steve Loughran (Jira)
Steve Loughran created PARQUET-2150:
---

 Summary: parquet-protobuf to compile on mac M1
 Key: PARQUET-2150
 URL: https://issues.apache.org/jira/browse/PARQUET-2150
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-protobuf
Affects Versions: 1.13.0
Reporter: Steve Loughran


parquet-protobuf module fails to compile on Mac M1 because the maven protoc 
plugin cannot find the native osx-aarch_64:3.16.1  binary.

the build needs to be tweaked to pick up the x86 binaries



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (PARQUET-1615) getRecordWriter shouldn't hardcode CREAT mode when new ParquetFileWriter

2021-08-02 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17391699#comment-17391699
 ] 

Steve Loughran commented on PARQUET-1615:
-

just looking at this. Any specific reason not to have overwrite the default?

it saves the overhead of an HTTP head probe against any of the object stores, 
and for any job where the  output committer has created a unique dir for that 
task attempt, no risk of conflict unless somehow the task really has decided to 
create two files with the same name.

Has anyone ever encountered conflicts?

> getRecordWriter shouldn't hardcode CREAT mode when new ParquetFileWriter
> 
>
> Key: PARQUET-1615
> URL: https://issues.apache.org/jira/browse/PARQUET-1615
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> getRecordWriter shouldn't hardcode CREAT mode when new ParquetFileWriter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (PARQUET-1984) Some tests fail on windows

2021-07-08 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377466#comment-17377466
 ] 

Steve Loughran edited comment on PARQUET-1984 at 7/8/21, 3:40 PM:
--

FYI, this change stops the test building against Hadoop-3.3.x as that's dropped 
commons-lang from its transient dependencies.

(note: The relevant import of a commons-lang class isn't actually used; 
removing the line fixes things)


was (Author: ste...@apache.org):
FYI, this change stops the test building against Hadoop-3.3.x as that's dropped 
commons-lang from its transient dependencies.

> Some tests fail on windows
> --
>
> Key: PARQUET-1984
> URL: https://issues.apache.org/jira/browse/PARQUET-1984
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr, parquet-thrift
>Affects Versions: 1.12.0
> Environment: Windows 10
>Reporter: Felix Schmalzel
>Assignee: Felix Schmalzel
>Priority: Minor
> Fix For: 1.12.0
>
>
> Reasons:
>  * Expecting \n and getting \r\n
>  * Unclosed streams preventing a temporary file from being deleted
>  * File layout differences \ and /
>  * No native library for brotli, because the brotli-codec dependency only 
> shadows macos and linux native libraries.
>  
> I've already developed a patch that would fix all the problems excluding the 
> brotli one. For that one we would have to wait until this 
> [https://github.com/rdblue/brotli-codec/pull/2] request is merged. I will 
> link the merge request for the other problems in the next few days.
> Is there a related ticket that i have overlooked?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1984) Some tests fail on windows

2021-07-08 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377466#comment-17377466
 ] 

Steve Loughran commented on PARQUET-1984:
-

FYI, this change stops the test building against Hadoop-3.3.x as that's dropped 
commons-lang from its transient dependencies.

> Some tests fail on windows
> --
>
> Key: PARQUET-1984
> URL: https://issues.apache.org/jira/browse/PARQUET-1984
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr, parquet-thrift
>Affects Versions: 1.12.0
> Environment: Windows 10
>Reporter: Felix Schmalzel
>Assignee: Felix Schmalzel
>Priority: Minor
> Fix For: 1.12.0
>
>
> Reasons:
>  * Expecting \n and getting \r\n
>  * Unclosed streams preventing a temporary file from being deleted
>  * File layout differences \ and /
>  * No native library for brotli, because the brotli-codec dependency only 
> shadows macos and linux native libraries.
>  
> I've already developed a patch that would fix all the problems excluding the 
> brotli one. For that one we would have to wait until this 
> [https://github.com/rdblue/brotli-codec/pull/2] request is merged. I will 
> link the merge request for the other problems in the next few days.
> Is there a related ticket that i have overlooked?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)