Missing sync-up this week; Bloom filter status

2017-08-14 Thread Jim Apple
I'll be missing the Google Meet sync-up this week, so I wanted to share briefly where I think we are on PARQUET-41: I think we have agreement on the form that the filters will take, including the hash functions, but I believe we still don't have a benchmark against dictionary filtering. Last I

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-20 Thread Jim Apple
, but -0 and +0 are >> implementation-dependent, az Zoltan Borok-Nagy pointed it out to me: "This >> function is not required to be sensitive to the sign of zero, although some >> implementations additionally enforce that if one argument is +0 and the >> other is -0, then +0

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Jim Apple
> We could have a similar problem > with not finding +0.0 values because a -0.0 is written to the max_value > field by some component that considers them the same. My hope is that the filtering would behave sanely, since -0.0 == +0.0 under the real-number-inspired ordering, which is distinguished

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

2018-02-16 Thread Jim Apple
On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy wrote: > I would just like to mention that the fmax() / fmin() functions in C/C++ > Math library follow the aforementioned IEEE 754-2008 min and max > specification: > http://en.cppreference.com/w/c/numeric/math/fmax > >

Re: [VOTE] Finalizing the design and moving forward to read/write implementation

2018-08-30 Thread Jim Apple
+0, non-binding. Junjie and I spent a lot of time getting the C++ code to where it is now, but all three patches (Java, -format, C++) could use some more work before I'm fully confident we're in a good place. In particular, the code for integrating the existing patches in with readers and

Re: BitWeaving in Parquet?

2018-10-14 Thread Jim Apple
On 2018/10/08 22:08:16, Julien Le Dem wrote: > it's a variation of bit packing. right? I looked into it on https://github.com/apache/parquet-format/blob/master/Encodings.md and I believe that the Horizontal Bit-Parallel encoding in the paper is a variant on bit packing. There are three

Re: BitWeaving in Parquet?

2018-10-23 Thread Jim Apple
> For Vertical Bit-Parallel (VBP), I think the reason why I didn't think it > would be useful for Parquet is that it is really expensive to produce and > really expensive to reconstruct values that aren't filtered out. Yes - you can see in Figure 12(a) that the aggregation time went up for the

Re: BitWeaving in Parquet?

2018-10-23 Thread Jim Apple
> For Vertical Bit-Parallel (VBP), I think the reason why I didn't think it > would be useful for Parquet is that it is really expensive to produce and > really expensive to reconstruct values that aren't filtered out. Julien, this would be a thing I think the list would love to hear from Jignesh

BitWeaving in Parquet?

2018-10-08 Thread Jim Apple
The BitWeaving paper from a few years ago demonstrates some large performance wins in predicate evaluation based partially on reconfiguring the storage layout: http://pages.cs.wisc.edu/~jignesh/publ/BitWeaving.pdf Is it technically possible for Parquet to support "Vertical Bit-Parallel"

Re: BitWeaving in Parquet?

2018-10-08 Thread Jim Apple
> That sounds like an interesting possibility. It's not that fresh in my mind > but I'd say from the storage perspective it's a variation of bit packing. > right? I'm not familiar with bit packing, so I'd have to look into that. I found the paper readable enough at the time that I didn't end up

Re: [VOTE] Finalizing the design and moving forward to read/write implementation

2018-08-31 Thread Jim Apple
On 2018/08/30 19:41:59, Ryan Blue wrote: > Jim, do you think that the implementation is going to make major changes to > the design of how bloom filters are stored in files? I don't foresee any problems with the current layout.

Re: [Discussion] How to build bloom filter in parquet

2019-03-04 Thread Jim Apple
> A bit of brainstorming (just some ideas that may or may not be useful): One > more thing to consider is whether some smart encoding of the bit vector > would help saving space. I expect the entropy of a nearly empty or nearly > full bloom filter to be relatively low, because they consist mostly

Re: [vote] Merge bloom-filter branch to master

2019-06-04 Thread Jim Apple
> For Wes's concern, I think since the implementation is not yet ready, only What part of the implementation is not yet ready? It's all checked in to the bloom-filter branches, right?

Re: Plan to merge bloom filter branch

2019-05-30 Thread Jim Apple
> 2 things need to happen first: > #1, we need to vote on the proposed format changes to adopt this as part of > the spec, and #2, we are trying to get a release out that writes the new > column index structures. > > If you think the proposal is ready, we could start a vote for it. I think the

Re: [vote] Merge bloom-filter branch to master

2019-06-07 Thread Jim Apple
On 2019/05/31 16:01:54, Ryan Blue wrote: > -1 > > Junjie, I think we need to vote to adopt the proposed spec before > committing code that implements it. Ryan, it seems like Junjie and I think that the spec has already been adopted and is in the repo:

Re: [vote] Merge bloom-filter branch to master

2019-06-03 Thread Jim Apple
I think this message and your last message to the list had no text in them. You can see at https://lists.apache.org/list.html?dev@parquet.apache.org On 2019/05/31 17:03:21, cjjnj...@gmail.com wrote: >

Re: [vote] Merge bloom-filter branch to master

2019-06-14 Thread Jim Apple
https://github.com/apache/parquet-format/commit/f0eab9d64c3563e14cf2c4959f345372e1ba0c8f is now merged. Todd's xxhash proposal hasn't received an update recently; I was convinced by Junjie's argument that extensibility allows us to add it later. Should we discuss a parquet-format release?

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

2019-06-24 Thread Jim Apple
> Actually there is a repo at https://github.com/apache/parquet-testing that > may be used for making sure that the Java, C++ and other implementations > are interoperable. Ah, yes, and it looks like a Bloom filter data file is present:

[DISCUSS] Prepare release for parquet-format 2.7.0?

2019-06-19 Thread Jim Apple
This is a thread for discussing a release of parquet-format. The last release appears to be 2.6.0 from September 2018: https://github.com/apache/parquet-format/releases The diff from then until now is

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

2019-06-20 Thread Jim Apple
> Regarding your question, I don't have an opinion on 1, but I think 2 is > very important. In the end, the parquet format is nothing more than a > couple of Thrift definitions. I would suggest writing good unit tests to > ensure that the bloom filters behave in the same manner. I agree, it is

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

2019-06-27 Thread Jim Apple
> Looks like we don't have any blocking issue since there is no update in the > Jira(https://jira.apache.org/jira/browse/PARQUET-1608) about one week. Can > we start a release vote? Even _starting_ a release vote appears to require a committer to do some prep work first.

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

2019-06-27 Thread Jim Apple
> I think we need to have a vote on the bloom filter > structures first. We need to make sure that the community has vetted the > design and is comfortable with adding this, just like we did with the > Parquet encryption design and the page index design. Thank you for the note, Ryan. Based on my

Re: [VOTE] Parquet Bloom filter spec sign-off

2019-08-28 Thread Jim Apple
We've got +1's from Zoltan and Gabor. Ryan, you've committed a few BF patches that were written in response to your feedback on this list. Are you in a position to vote +1 now, or do you have further concerns we could address? On 2019/07/31 02:17:15, 俊杰陈 wrote: > Dear Parquet developers > >

[DISCUSS] Release Apache Parquet Format Version 2.7.0

2019-09-16 Thread Jim Apple
Hello! The patches we were waiting for on format 2.7.0 are now in: Bloom filters (with modifications during voting process), encryption, and the release notes: https://github.com/apache/parquet-format/commit/6d91c8839a712a87161547b85d99344b87d2e394 I'd like to start a release discussion. Note

Re: [VOTE] Parquet Bloom filter spec sign-off

2019-09-18 Thread Jim Apple
On 2019/09/18 20:52:30, Wes McKinney wrote: > We have just worked to move almost all hashing in Apache Arrow to xxh3 > -- I may have lost it in the mix, but are we dropping murmur3? Yep:

[VOTE] Parquet Bloom filter spec sign-off

2019-08-04 Thread Jim Apple
On 2019/08/03 20:42:10, Ryan Blue wrote: >- Should the bloom filters support compression? If the strategy for a >lower false-positive rate is to under-fill the multi-block filter, then >would compression help reduce the storage cost? Eventually, it might be useful to support

Re: [VOTE] Parquet Bloom filter spec sign-off

2019-08-06 Thread Jim Apple
On 2019/08/05 18:05:53, Ryan Blue wrote: > At least getting a compression union into the bloom filter header > will help us with compatibility later if we choose to add compression > schemes. That's very reasonable. I'll send a PR for that after

Re: Parquet Sync Meeting Notes

2019-07-19 Thread Jim Apple
I believe this has now been both voted on a few months ago and approved by Zoltan last week. If someone could merge it, that would get us one step closer to a parquet-format release. On 2019/07/17 17:59:06, Xinli shang wrote: > Gidon pointed out that the encryption parquet-format PR is the

[RESULT] Release Apache Parquet Format 2.7.0 RC0

2019-09-29 Thread Jim Apple
The vote to release parquet-format 2.7.0 passed with three binding +1 votes and no binding -1 votes or 0 votes: https://lists.apache.org/thread.html/ccd7db9fa89e5d34e18a84a74cf823f8eacd4474c25cadd07d30cdbe@%3Cdev.parquet.apache.org%3E I'll now be following

[VOTE] Release Apache Parquet Format 2.7.0 RC0

2019-09-25 Thread Jim Apple
Hi everyone, I propose the following RC to be released as the official Apache Parquet Format 2.7.0 release. The commit id is ee5cae066ed602bd969024eb308c5262c451b6cd * This corresponds to the tag: apache-parquet-format-2.7.0 *

Re: [DISCUSS] Prepare release for parquet-format 2.7.0?

2019-06-30 Thread Jim Apple
On 2019/06/28 16:43:23, Ryan Blue wrote: > I agree with Zoltan. Since we want to ensure compatibility, it would be > better to choose the best option now instead of making everyone support two > options forever. I'd guess there probably isn't a single best option. I suspect there's a tradeoff

Re: Parquet sync zoom - invalid meeting ID

2019-11-21 Thread Jim Apple
The same is happening to me. Additionally, one of the toll-free phone numbers did not pick up. No outages I see: https://statusgator.com/services/zoom, https://status.zoom.us/ On 2019/11/21 17:06:56, Gabor Szadovszky wrote: > Hi, > > Is it just me who cannot join to the meeting? It says

Most recent Bloom filter format changes: into parquet-mr

2019-12-15 Thread Jim Apple
Hello! Can a committer take a look at https://github.com/apache/parquet-mr/pull/686? This makes some changes to get the most recent changes to the Bloom filters that went into the 2.7.0 format release into parquet-mr. It LGTM, but a committer is needed here, and someone who knows the codebase

[ANNOUNCE] Apache Parquet format release 2.7.0

2019-10-04 Thread Jim Apple
I'm pleased to announce the release of Parquet format 2.7.0! Parquet is a general-purpose columnar file format for nested data. It uses space-efficient encodings and a compressed and splittable structure for processing frameworks like Hadoop. Changes are listed at:

[jira] [Commented] (PARQUET-319) Define the parquet bloom filter statistics in parquet format

2017-07-13 Thread Jim Apple (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086000#comment-16086000 ] Jim Apple commented on PARQUET-319: --- [~Ferd], there is also https://docs.google.com/document/d

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2017-07-19 Thread Jim Apple (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16093432#comment-16093432 ] Jim Apple commented on PARQUET-41: -- We might want to consider "Cache-, Hash- and Space-Efficient

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2017-07-20 Thread Jim Apple (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095685#comment-16095685 ] Jim Apple commented on PARQUET-41: -- In response to your request for a benchmark, see https://github.com

[jira] [Commented] (PARQUET-1125) Add UUID logical type

2017-10-09 Thread Jim Apple (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197185#comment-16197185 ] Jim Apple commented on PARQUET-1125: Or maybe a 16-byte type, generally, not just for UUIDs. >

[jira] [Commented] (PARQUET-1125) Add UUID logical type

2017-10-09 Thread Jim Apple (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197218#comment-16197218 ] Jim Apple commented on PARQUET-1125: What kind of expectations would a hash digest type have

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2018-05-23 Thread Jim Apple (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486884#comment-16486884 ] Jim Apple commented on PARQUET-41: -- In response to [~junjie]'s question above, "Sure, it is fea

[jira] [Comment Edited] (PARQUET-41) Add bloom filters to parquet statistics

2018-06-15 Thread Jim Apple (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513883#comment-16513883 ] Jim Apple edited comment on PARQUET-41 at 6/15/18 2:28 PM: --- I took a look

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2018-06-19 Thread Jim Apple (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517298#comment-16517298 ] Jim Apple commented on PARQUET-41: -- Is there and updated PR for parquet-format that matches the open PRs

[jira] [Comment Edited] (PARQUET-41) Add bloom filters to parquet statistics

2018-06-19 Thread Jim Apple (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517298#comment-16517298 ] Jim Apple edited comment on PARQUET-41 at 6/19/18 8:19 PM: --- [~junjie

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2018-01-24 Thread Jim Apple (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338244#comment-16338244 ] Jim Apple commented on PARQUET-41: -- IIRC, there was a plan to create an end-to-end benchmark of an MR

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2018-01-24 Thread Jim Apple (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338680#comment-16338680 ] Jim Apple commented on PARQUET-41: -- Could you elaborate on "A column with large cardinality can not

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

2018-01-24 Thread Jim Apple (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338709#comment-16338709 ] Jim Apple commented on PARQUET-41: -- Why not tweak that logic in parquet-mr to allow dictionary encoding

[jira] [Commented] (PARQUET-1222) Definition of float and double sort order is ambigious

2018-02-22 Thread Jim Apple (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373429#comment-16373429 ] Jim Apple commented on PARQUET-1222: I do not think the order proposed matches the IEEE 754

[jira] [Commented] (PARQUET-319) Define the parquet bloom filter statistics in parquet format

2019-06-25 Thread Jim Apple (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872878#comment-16872878 ] Jim Apple commented on PARQUET-319: --- [~rdblue], could you resolve this with the fix version "f

[jira] [Commented] (PARQUET-1630) Resolve Bloom filter spec concerns

2019-08-04 Thread Jim Apple (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899713#comment-16899713 ] Jim Apple commented on PARQUET-1630: The post in question:  [https://lists.apache.org/thread.html

[jira] [Created] (PARQUET-1674) The announcement email on the web site does not comply with ASF rules

2019-10-04 Thread Jim Apple (Jira)
Jim Apple created PARQUET-1674: -- Summary: The announcement email on the web site does not comply with ASF rules Key: PARQUET-1674 URL: https://issues.apache.org/jira/browse/PARQUET-1674 Project: Parquet

[jira] [Created] (PARQUET-1675) Switch to git for website

2019-10-04 Thread Jim Apple (Jira)
Jim Apple created PARQUET-1675: -- Summary: Switch to git for website Key: PARQUET-1675 URL: https://issues.apache.org/jira/browse/PARQUET-1675 Project: Parquet Issue Type: Improvement