Re: [jira] [Created] (PARQUET-2135) Performance optimizations: Merged all LittleEndianDataInputStream functionality into ByteBufferInputStream

2022-04-05 Thread Miller, Tim
Hi, I was wondering if anyone had any concerns or things they wanted to discuss about this proposed patch. Would you like some benchmarking results? I'm currently running the whole TPCDS quite in Trino, where I'm comparing with and without this patch. Also, are there any bugs in the JIRA that

Don't want to spam the JIRA

2022-04-05 Thread Miller, Tim
Hi, I had meant to just discuss my PR on the mailing list, but the mailing list software evidently detected that the email was associated with my JIRA entry and posted my email as a comment. I don't want to spam the JIRA. I'd delete the comment, but I'm not sure that I can. Sorry about

Re: Writing min/max for rowgroup to Parquet

2022-04-15 Thread Miller, Tim
Can you provide more information about how you're writing the files? Also, does this help? https://stackoverflow.com/questions/41700231/spark-parquet-statisticsmin-max-integration On 4/14/22, 8:42 PM, "p_agar...@yahoo.com.INVALID" wrote: CAUTION: This email originated from outside of the

Re: Parquet-cli throws on reading UUID values

2022-04-22 Thread Miller, Tim
Hi, This reminds me of some similar problems I've seen in the bug tracker. I suggest creating a JIRA ticket, with some instructions, and attaching a parquet file for others to look at. Also include how you did the writing. If you're linking ParquetMR to your own code, please include minimal

Re: Any doc/wiki/contribution guide?

2022-04-26 Thread Miller, Tim
Also, using the API is a pain, because you have to use Hadoop. Various people have found work-arounds for this, such as: Comments on: https://issues.apache.org/jira/browse/PARQUET-1822 I also assembled a minimal reader myself (from code I found elsewhere on github, which I should add

Re: Any doc/wiki/contribution guide?

2022-04-26 Thread Miller, Tim
not exist and that is not the purpose of parquet-mr? Thanks On Tue, Apr 26, 2022 at 9:37 PM Miller, Tim wrote: > > Also, using the API is a pain, because you have to use Hadoop. Various people have found work-arounds for this, such as: > Comments on: https://issues.apa

Re: Forward & Backwards Compatibility

2022-05-31 Thread Miller, Tim
You might also consider looking for fallback options. For instance, in https://github.com/apache/parquet-mr/pull/957, I figured out a good spot to catch the exception and then fall-back to a converted schema. On 5/29/22, 1:53 PM, "Micah Kornfield" wrote: CAUTION: This email originated

Re: Bit-packing decode optimization on Parquet-mr

2022-05-26 Thread Miller, Tim
In my own profiling of ParquetMR (as it is used by Trino), I have also found these bit-packing methods to be a performance bottleneck. Of the existing ones, the ones that take an array are faster than the one that take a ByteBuffer. It sure would be nice to have even faster ones! From: "Xie,

Spin off CLI into separate project?

2022-05-27 Thread Miller, Tim
I just wanted to bounce an idea off of everyone. One thing I notice is that there are certain bugs that show up when using the parquet-cli that don't show up when using it as an SDK in a Java program, even when reading the same files. There appears to be some duplicated code between the CLI and

Design doc on performance optimizations for ParquetMR byte buffer I/O

2022-04-28 Thread Miller, Tim
Hi, everyone, I've been working on adding some performance improvements to ParquetMR. During the last sync meeting, I was asked to write up a design doc that describes my plans for PRs related to this. Please feel free to email me or add comments to the google doc.

Re: Today's sync meeting will start ~15 minutes late

2022-04-27 Thread Miller, Tim
Is this something anyone can join? How? Thanks. On 4/27/22, 11:19 AM, "Xinli shang" wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Hi all, Sorry

[jira] [Commented] (PARQUET-2135) Performance optimizations: Merged all LittleEndianDataInputStream functionality into ByteBufferInputStream

2022-04-05 Thread Miller, Tim (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17517681#comment-17517681 ] Miller, Tim commented on PARQUET-2135: -- Hi, I was wondering if anyone had any concerns or things