[jira] [Commented] (PARQUET-2212) Add ByteBuffer api for decryptors to allow direct memory to be decrypted
[ https://issues.apache.org/jira/browse/PARQUET-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711641#comment-17711641 ] ASF GitHub Bot commented on PARQUET-2212: - wgtmac commented on PR #1008: URL: https://github.com/apache/parquet-mr/pull/1008#issuecomment-1506231596 > Sure. What does one need to do? I believe all the comments are addressed and CI seems to be failing for unrelated reasons (is there a way to re-trigger the failed tests?). Let's try to rebase and force push it to see if the CIs are green then? > Add ByteBuffer api for decryptors to allow direct memory to be decrypted > > > Key: PARQUET-2212 > URL: https://issues.apache.org/jira/browse/PARQUET-2212 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.3 >Reporter: Parth Chandra >Priority: Major > > The decrypt API in BlockCipher.Decryptor currently only provides an api that > takes in a byte array > {code:java} > byte[] decrypt(byte[] lengthAndCiphertext, byte[] AAD);{code} > A parquet reader that uses the DirectByteBufferAllocator has to incur the > cost of copying the data into a byte array (and sometimes back to a > DirectByteBuffer) to decrypt data. > This proposes adding a new API that accepts ByteBuffer as input and avoids > the data copy. > {code:java} > ByteBuffer decrypt(ByteBuffer from, byte[] AAD);{code} > The decryption in ColumnChunkPageReadStore can also be updated to use the > ByteBuffer based api if the buffer is a DirectByteBuffer. If the buffer is a > HeapByteBuffer, then we can continue to use the byte array API since that > does not incur a copy when the underlying byte array is accessed. > Also, some investigation has shown that decryption with ByteBuffers is not > able to use hardware acceleration in JVM's before JDK17. In those cases, the > overall decryption speed is faster with byte arrays even after incurring the > overhead of making a copy. > The proposal, then, is to enable the use of the ByteBuffer api for > DirectByteBuffers only, and only if the JDK is JDK17 or higher or the user > explicitly configures it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] wgtmac commented on pull request #1008: PARQUET-2212: Add ByteBuffer api for decryptors to allow direct memory to be decrypted
wgtmac commented on PR #1008: URL: https://github.com/apache/parquet-mr/pull/1008#issuecomment-1506231596 > Sure. What does one need to do? I believe all the comments are addressed and CI seems to be failing for unrelated reasons (is there a way to re-trigger the failed tests?). Let's try to rebase and force push it to see if the CIs are green then? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2266) Fix support for files without ColumnIndexes
[ https://issues.apache.org/jira/browse/PARQUET-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711640#comment-17711640 ] ASF GitHub Bot commented on PARQUET-2266: - wgtmac commented on PR #1048: URL: https://github.com/apache/parquet-mr/pull/1048#issuecomment-1506230575 @richardkerr Could you please provide your email address and expected user name to use? I can create a JIRA account for you and assign the JIRA to you. > Fix support for files without ColumnIndexes > --- > > Key: PARQUET-2266 > URL: https://issues.apache.org/jira/browse/PARQUET-2266 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.12.3 >Reporter: Gang Wu >Priority: Major > Fix For: 1.12.4, 1.14.0, 1.13.1 > > > Fix for failure when writing ColumnChunks that do not have a ColumnIndex > populated. > This is introduced by PARQUET-2081 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] wgtmac commented on pull request #1048: PARQUET-2266: Fix support for files without ColumnIndexes
wgtmac commented on PR #1048: URL: https://github.com/apache/parquet-mr/pull/1048#issuecomment-1506230575 @richardkerr Could you please provide your email address and expected user name to use? I can create a JIRA account for you and assign the JIRA to you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2266) Fix support for files without ColumnIndexes
[ https://issues.apache.org/jira/browse/PARQUET-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711638#comment-17711638 ] ASF GitHub Bot commented on PARQUET-2266: - wgtmac commented on PR #1048: URL: https://github.com/apache/parquet-mr/pull/1048#issuecomment-1506227667 @richardkerr Sorry for the bad experience! Unfortunately, I do not have the permission to receive the email for JIRA account request. Can you help? @gszadovszky @shangxinli @ggershinsky I have created a new JIRA and updated the title. Will backport it into 1.12.4 and 1.13.1 branch as well. > Fix support for files without ColumnIndexes > --- > > Key: PARQUET-2266 > URL: https://issues.apache.org/jira/browse/PARQUET-2266 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.12.3 >Reporter: Gang Wu >Priority: Major > Fix For: 1.12.4, 1.14.0, 1.13.1 > > > Fix for failure when writing ColumnChunks that do not have a ColumnIndex > populated. > This is introduced by PARQUET-2081 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] wgtmac commented on pull request #1048: PARQUET-2266: Fix support for files without ColumnIndexes
wgtmac commented on PR #1048: URL: https://github.com/apache/parquet-mr/pull/1048#issuecomment-1506227667 @richardkerr Sorry for the bad experience! Unfortunately, I do not have the permission to receive the email for JIRA account request. Can you help? @gszadovszky @shangxinli @ggershinsky I have created a new JIRA and updated the title. Will backport it into 1.12.4 and 1.13.1 branch as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (PARQUET-2266) Fix support for files without ColumnIndexes
Gang Wu created PARQUET-2266: Summary: Fix support for files without ColumnIndexes Key: PARQUET-2266 URL: https://issues.apache.org/jira/browse/PARQUET-2266 Project: Parquet Issue Type: Bug Affects Versions: 1.12.3 Reporter: Gang Wu Fix For: 1.12.4, 1.14.0, 1.13.1 Fix for failure when writing ColumnChunks that do not have a ColumnIndex populated. This is introduced by PARQUET-2081 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2149) Implement async IO for Parquet file reader
[ https://issues.apache.org/jira/browse/PARQUET-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711588#comment-17711588 ] ASF GitHub Bot commented on PARQUET-2149: - parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1506058505 FWIW the hadoop 3.3.5 vector io changes might make this PR redundant. > Implement async IO for Parquet file reader > -- > > Key: PARQUET-2149 > URL: https://issues.apache.org/jira/browse/PARQUET-2149 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Parth Chandra >Priority: Major > > ParquetFileReader's implementation has the following flow (simplified) - > - For every column -> Read from storage in 8MB blocks -> Read all > uncompressed pages into output queue > - From output queues -> (downstream ) decompression + decoding > This flow is serialized, which means that downstream threads are blocked > until the data has been read. Because a large part of the time spent is > waiting for data from storage, threads are idle and CPU utilization is really > low. > There is no reason why this cannot be made asynchronous _and_ parallel. So > For Column _i_ -> reading one chunk until end, from storage -> intermediate > output queue -> read one uncompressed page until end -> output queue -> > (downstream ) decompression + decoding > Note that this can be made completely self contained in ParquetFileReader and > downstream implementations like Iceberg and Spark will automatically be able > to take advantage without code change as long as the ParquetFileReader apis > are not changed. > In past work with async io [Drill - async page reader > |https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java] > , I have seen 2x-3x improvement in reading speed for Parquet files. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1506058505 FWIW the hadoop 3.3.5 vector io changes might make this PR redundant. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2212) Add ByteBuffer api for decryptors to allow direct memory to be decrypted
[ https://issues.apache.org/jira/browse/PARQUET-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711587#comment-17711587 ] ASF GitHub Bot commented on PARQUET-2212: - parthchandra commented on PR #1008: URL: https://github.com/apache/parquet-mr/pull/1008#issuecomment-1506055435 Sure. What does one need to do? I believe all the comments are addressed and CI seems to be failing for unrelated reasons (is there a way to re-trigger the failed tests?). > Add ByteBuffer api for decryptors to allow direct memory to be decrypted > > > Key: PARQUET-2212 > URL: https://issues.apache.org/jira/browse/PARQUET-2212 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.3 >Reporter: Parth Chandra >Priority: Major > > The decrypt API in BlockCipher.Decryptor currently only provides an api that > takes in a byte array > {code:java} > byte[] decrypt(byte[] lengthAndCiphertext, byte[] AAD);{code} > A parquet reader that uses the DirectByteBufferAllocator has to incur the > cost of copying the data into a byte array (and sometimes back to a > DirectByteBuffer) to decrypt data. > This proposes adding a new API that accepts ByteBuffer as input and avoids > the data copy. > {code:java} > ByteBuffer decrypt(ByteBuffer from, byte[] AAD);{code} > The decryption in ColumnChunkPageReadStore can also be updated to use the > ByteBuffer based api if the buffer is a DirectByteBuffer. If the buffer is a > HeapByteBuffer, then we can continue to use the byte array API since that > does not incur a copy when the underlying byte array is accessed. > Also, some investigation has shown that decryption with ByteBuffers is not > able to use hardware acceleration in JVM's before JDK17. In those cases, the > overall decryption speed is faster with byte arrays even after incurring the > overhead of making a copy. > The proposal, then, is to enable the use of the ByteBuffer api for > DirectByteBuffers only, and only if the JDK is JDK17 or higher or the user > explicitly configures it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] parthchandra commented on pull request #1008: PARQUET-2212: Add ByteBuffer api for decryptors to allow direct memory to be decrypted
parthchandra commented on PR #1008: URL: https://github.com/apache/parquet-mr/pull/1008#issuecomment-1506055435 Sure. What does one need to do? I believe all the comments are addressed and CI seems to be failing for unrelated reasons (is there a way to re-trigger the failed tests?). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
RE: [C++] Parquet and Arrow overlap
On 2023/02/01 19:27:22 Will Jones wrote: > Hello, > > A while back, the Parquet C++ implementation was merged into the Apache > Arrow monorepo [1]. As I understand it, this helped the development process > immensely. However, I am noticing some governance issues because of it. > > First, it's not obvious where issues are supposed to be open: In Parquet > Jira or Arrow GitHub issues. Looking back at some of the original > discussion, it looks like the intention was > > * use PARQUET-XXX for issues relating to Parquet core > > * use ARROW-XXX for issues relation to Arrow's consumption of Parquet > > core (e.g. changes that are in parquet/arrow right now) > > > > The README for the old parquet-cpp repo [3] states instead in it's > migration note: > > JIRA issues should continue to be opened in the PARQUET JIRA project. > > > Either way, it doesn't seem like this process is obvious to people. Perhaps > we could clarify this and add notices to Arrow's GitHub issues template? > > Second, committer status is a little unclear. I am a committer on Arrow, > but not on Parquet right now. Does that mean I should only merge Parquet > C++ PRs for code changes in parquet/arrow? Or that I shouldn't merge > Parquet changes at all? > > Also, are the contributions to Arrow C++ Parquet being actively reviewed > for potential new committers? > > Best, > > Will Jones > > [1] https://lists.apache.org/thread/76wzx2lsbwjl363bg066g8kdsocd03rw > [2] https://lists.apache.org/thread/dkh6vjomcfyjlvoy83qdk9j5jgxk7n4j > [3] https://github.com/apache/parquet-cpp > Personally, I think jira for Parquet is de facto only for Parquet format and Java Parquet. The implementation for C++/Rust parquet is discussed in their own repo now. Best, Xuwei Fu
[GitHub] [parquet-site] charlesmahler commented on pull request #30: add InfluxDB blog post link about Parquet catalog
charlesmahler commented on PR #30: URL: https://github.com/apache/parquet-site/pull/30#issuecomment-1505370275 @wgtmac yes please, no worries about the delay -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2081) Encryption translation tool - Parquet-hadoop
[ https://issues.apache.org/jira/browse/PARQUET-2081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711417#comment-17711417 ] ASF GitHub Bot commented on PARQUET-2081: - richardkerr commented on PR #1048: URL: https://github.com/apache/parquet-mr/pull/1048#issuecomment-1505354950 I started committing against PARQUET-2081 because it was still open at the time and this issue was introduced by those changes, it seemed reasonable to me to attribute this fix to that ticket. Otherwise. I put in a request to get a jira account almost a week ago with no feedback since so unable to create anything. Can you create the ticket for me in the meantime? Or approve the jira access request? > Encryption translation tool - Parquet-hadoop > > > Key: PARQUET-2081 > URL: https://issues.apache.org/jira/browse/PARQUET-2081 > Project: Parquet > Issue Type: Task > Components: parquet-mr >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.3 > > > This is the implement the core part of the Encryption translation tool in > parquet-hadoop. After this, we will have another Jira/PR for parquet-cli to > integrate with key tools for encryption properties.. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] richardkerr commented on pull request #1048: PARQUET-2081: Fix support for files without ColumnIndexes
richardkerr commented on PR #1048: URL: https://github.com/apache/parquet-mr/pull/1048#issuecomment-1505354950 I started committing against PARQUET-2081 because it was still open at the time and this issue was introduced by those changes, it seemed reasonable to me to attribute this fix to that ticket. Otherwise. I put in a request to get a jira account almost a week ago with no feedback since so unable to create anything. Can you create the ticket for me in the meantime? Or approve the jira access request? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2149) Implement async IO for Parquet file reader
[ https://issues.apache.org/jira/browse/PARQUET-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711388#comment-17711388 ] ASF GitHub Bot commented on PARQUET-2149: - steveloughran commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1505269164 @hazelnutsgz hadoop 3.3.5 supports vector IO on an s3 stream; async parallel fetch of blocks, which also works on local fs (and with gcs, abfs TODO items). we see significant performance increases there. There's a PR for it, though as it is 3.3.5+ only, not merged in to asf parquet branches unless the move or we finish a shim library to offer (serialized) support for the api on older releases. > Implement async IO for Parquet file reader > -- > > Key: PARQUET-2149 > URL: https://issues.apache.org/jira/browse/PARQUET-2149 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Parth Chandra >Priority: Major > > ParquetFileReader's implementation has the following flow (simplified) - > - For every column -> Read from storage in 8MB blocks -> Read all > uncompressed pages into output queue > - From output queues -> (downstream ) decompression + decoding > This flow is serialized, which means that downstream threads are blocked > until the data has been read. Because a large part of the time spent is > waiting for data from storage, threads are idle and CPU utilization is really > low. > There is no reason why this cannot be made asynchronous _and_ parallel. So > For Column _i_ -> reading one chunk until end, from storage -> intermediate > output queue -> read one uncompressed page until end -> output queue -> > (downstream ) decompression + decoding > Note that this can be made completely self contained in ParquetFileReader and > downstream implementations like Iceberg and Spark will automatically be able > to take advantage without code change as long as the ParquetFileReader apis > are not changed. > In past work with async io [Drill - async page reader > |https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java] > , I have seen 2x-3x improvement in reading speed for Parquet files. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] steveloughran commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
steveloughran commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1505269164 @hazelnutsgz hadoop 3.3.5 supports vector IO on an s3 stream; async parallel fetch of blocks, which also works on local fs (and with gcs, abfs TODO items). we see significant performance increases there. There's a PR for it, though as it is 3.3.5+ only, not merged in to asf parquet branches unless the move or we finish a shim library to offer (serialized) support for the api on older releases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1989) Deep verification of encrypted files
[ https://issues.apache.org/jira/browse/PARQUET-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711367#comment-17711367 ] Steve Loughran commented on PARQUET-1989: - you might want to have a design which can do the scan on a spark rdd, where the rdd is simply the deep listFiles(path) scan of the directory tree. This would give the best scale for a massive dataset compared to even some parallelised scan in a single process. I do have an RDD which can do line-by-line work, with locality of work determined on each file, which lets you schedule the work on the relevant hdfs nodes with the data; unfortunately it needs to be in the o.a.spark package to build https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/cloudera/ParallelizedWithLocalityRDD.scala ...that could maybe be added to spark itself. > Deep verification of encrypted files > > > Key: PARQUET-1989 > URL: https://issues.apache.org/jira/browse/PARQUET-1989 > Project: Parquet > Issue Type: New Feature > Components: parquet-cli >Reporter: Gidon Gershinsky >Assignee: Maya Anderson >Priority: Major > Fix For: 1.14.0 > > > A tools that verifies encryption of parquet files in a given folder. Analyzes > the footer, and then every module (page headers, pages, column indexes, bloom > filters) - making sure they are encrypted (in relevant columns). Potentially > checking the encryption keys. > We'll start with a design doc, open for discussion. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2198) Vulnerabilities in jackson-databind
[ https://issues.apache.org/jira/browse/PARQUET-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711234#comment-17711234 ] ASF GitHub Bot commented on PARQUET-2198: - nikhilenr commented on PR #1005: URL: https://github.com/apache/parquet-mr/pull/1005#issuecomment-1504802025 Hi All, New parquet-jackson version is released and reported cves are resolved with v 1.13.0. https://mvnrepository.com/artifact/org.apache.parquet/parquet-jackson/1.13.0 Thanks @shangxinli for fixing. > Vulnerabilities in jackson-databind > --- > > Key: PARQUET-2198 > URL: https://issues.apache.org/jira/browse/PARQUET-2198 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.12.3 >Reporter: Łukasz Dziedziul >Priority: Major > Labels: jackson-databind, security, vulnerabilities > Fix For: 1.13.0 > > > Update jackson-databind to mitigate CVEs: > * [CVE-2022-42003|https://github.com/advisories/GHSA-jjjh-jjxp-wpff] - > [https://nvd.nist.gov/vuln/detail/CVE-2022-42003] > * [CVE-2022-42004|https://github.com/advisories/GHSA-rgv9-q543-rqg4] - > [https://nvd.nist.gov/vuln/detail/CVE-2022-42004 (fixed in > 2.13.4)|https://nvd.nist.gov/vuln/detail/CVE-2022-42004] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] nikhilenr commented on pull request #1005: PARQUET-2198 : Updating jackson data bind version to fix CVEs
nikhilenr commented on PR #1005: URL: https://github.com/apache/parquet-mr/pull/1005#issuecomment-1504802025 Hi All, New parquet-jackson version is released and reported cves are resolved with v 1.13.0. https://mvnrepository.com/artifact/org.apache.parquet/parquet-jackson/1.13.0 Thanks @shangxinli for fixing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org