I'll be missing the Google Meet sync-up this week, so I wanted to
share briefly where I think we are on PARQUET-41:
I think we have agreement on the form that the filters will take,
including the hash functions, but I believe we still don't have a
benchmark against dictionary filtering. Last I
, but -0 and +0 are
>> implementation-dependent, az Zoltan Borok-Nagy pointed it out to me: "This
>> function is not required to be sensitive to the sign of zero, although some
>> implementations additionally enforce that if one argument is +0 and the
>> other is -0, then +0
> We could have a similar problem
> with not finding +0.0 values because a -0.0 is written to the max_value
> field by some component that considers them the same.
My hope is that the filtering would behave sanely, since -0.0 == +0.0
under the real-number-inspired ordering, which is distinguished
On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
wrote:
> I would just like to mention that the fmax() / fmin() functions in C/C++
> Math library follow the aforementioned IEEE 754-2008 min and max
> specification:
> http://en.cppreference.com/w/c/numeric/math/fmax
>
>
+0, non-binding.
Junjie and I spent a lot of time getting the C++ code to where it is now, but
all three patches (Java, -format, C++) could use some more work before I'm
fully confident we're in a good place. In particular, the code for integrating
the existing patches in with readers and
On 2018/10/08 22:08:16, Julien Le Dem wrote:
> it's a variation of bit packing. right?
I looked into it on
https://github.com/apache/parquet-format/blob/master/Encodings.md and I believe
that the Horizontal Bit-Parallel encoding in the paper is a variant on bit
packing. There are three
> For Vertical Bit-Parallel (VBP), I think the reason why I didn't think it
> would be useful for Parquet is that it is really expensive to produce and
> really expensive to reconstruct values that aren't filtered out.
Yes - you can see in Figure 12(a) that the aggregation time went up for the
> For Vertical Bit-Parallel (VBP), I think the reason why I didn't think it
> would be useful for Parquet is that it is really expensive to produce and
> really expensive to reconstruct values that aren't filtered out.
Julien, this would be a thing I think the list would love to hear from Jignesh
The BitWeaving paper from a few years ago demonstrates some large performance
wins in predicate evaluation based partially on reconfiguring the storage
layout:
http://pages.cs.wisc.edu/~jignesh/publ/BitWeaving.pdf
Is it technically possible for Parquet to support "Vertical Bit-Parallel"
> That sounds like an interesting possibility. It's not that fresh in my mind
> but I'd say from the storage perspective it's a variation of bit packing.
> right?
I'm not familiar with bit packing, so I'd have to look into that. I found the
paper readable enough at the time that I didn't end up
On 2018/08/30 19:41:59, Ryan Blue wrote:
> Jim, do you think that the implementation is going to make major changes to
> the design of how bloom filters are stored in files?
I don't foresee any problems with the current layout.
> A bit of brainstorming (just some ideas that may or may not be useful): One
> more thing to consider is whether some smart encoding of the bit vector
> would help saving space. I expect the entropy of a nearly empty or nearly
> full bloom filter to be relatively low, because they consist mostly
> For Wes's concern, I think since the implementation is not yet ready, only
What part of the implementation is not yet ready? It's all checked in to the
bloom-filter branches, right?
> 2 things need to happen first:
> #1, we need to vote on the proposed format changes to adopt this as part of
> the spec, and #2, we are trying to get a release out that writes the new
> column index structures.
>
> If you think the proposal is ready, we could start a vote for it.
I think the
On 2019/05/31 16:01:54, Ryan Blue wrote:
> -1
>
> Junjie, I think we need to vote to adopt the proposed spec before
> committing code that implements it.
Ryan, it seems like Junjie and I think that the spec has already been adopted
and is in the repo:
I think this message and your last message to the list had no text in them. You
can see at https://lists.apache.org/list.html?dev@parquet.apache.org
On 2019/05/31 17:03:21, cjjnj...@gmail.com wrote:
>
https://github.com/apache/parquet-format/commit/f0eab9d64c3563e14cf2c4959f345372e1ba0c8f
is now merged. Todd's xxhash proposal hasn't received an update recently; I
was convinced by Junjie's argument that extensibility allows us to add it later.
Should we discuss a parquet-format release?
> Actually there is a repo at https://github.com/apache/parquet-testing that
> may be used for making sure that the Java, C++ and other implementations
> are interoperable.
Ah, yes, and it looks like a Bloom filter data file is present:
This is a thread for discussing a release of parquet-format. The last release
appears to be 2.6.0 from September 2018:
https://github.com/apache/parquet-format/releases
The diff from then until now is
> Regarding your question, I don't have an opinion on 1, but I think 2 is
> very important. In the end, the parquet format is nothing more than a
> couple of Thrift definitions. I would suggest writing good unit tests to
> ensure that the bloom filters behave in the same manner.
I agree, it is
> Looks like we don't have any blocking issue since there is no update in the
> Jira(https://jira.apache.org/jira/browse/PARQUET-1608) about one week. Can
> we start a release vote?
Even _starting_ a release vote appears to require a committer to do some prep
work first.
> I think we need to have a vote on the bloom filter
> structures first. We need to make sure that the community has vetted the
> design and is comfortable with adding this, just like we did with the
> Parquet encryption design and the page index design.
Thank you for the note, Ryan. Based on my
We've got +1's from Zoltan and Gabor. Ryan, you've committed a few BF patches
that were written in response to your feedback on this list. Are you in a
position to vote +1 now, or do you have further concerns we could address?
On 2019/07/31 02:17:15, 俊杰陈 wrote:
> Dear Parquet developers
>
>
Hello! The patches we were waiting for on format 2.7.0 are now in: Bloom
filters (with modifications during voting process), encryption, and the release
notes:
https://github.com/apache/parquet-format/commit/6d91c8839a712a87161547b85d99344b87d2e394
I'd like to start a release discussion. Note
On 2019/09/18 20:52:30, Wes McKinney wrote:
> We have just worked to move almost all hashing in Apache Arrow to xxh3
> -- I may have lost it in the mix, but are we dropping murmur3?
Yep:
On 2019/08/03 20:42:10, Ryan Blue wrote:
>- Should the bloom filters support compression? If the strategy for a
>lower false-positive rate is to under-fill the multi-block filter, then
>would compression help reduce the storage cost?
Eventually, it might be useful to support
On 2019/08/05 18:05:53, Ryan Blue wrote:
> At least getting a compression union into the bloom filter header
> will help us with compatibility later if we choose to add compression
> schemes.
That's very reasonable. I'll send a PR for that after
I believe this has now been both voted on a few months ago and approved by
Zoltan last week.
If someone could merge it, that would get us one step closer to a
parquet-format release.
On 2019/07/17 17:59:06, Xinli shang wrote:
> Gidon pointed out that the encryption parquet-format PR is the
The vote to release parquet-format 2.7.0 passed with three binding +1 votes and
no binding -1 votes or 0 votes:
https://lists.apache.org/thread.html/ccd7db9fa89e5d34e18a84a74cf823f8eacd4474c25cadd07d30cdbe@%3Cdev.parquet.apache.org%3E
I'll now be following
Hi everyone,
I propose the following RC to be released as the official Apache Parquet Format
2.7.0 release.
The commit id is ee5cae066ed602bd969024eb308c5262c451b6cd
* This corresponds to the tag: apache-parquet-format-2.7.0
*
On 2019/06/28 16:43:23, Ryan Blue wrote:
> I agree with Zoltan. Since we want to ensure compatibility, it would be
> better to choose the best option now instead of making everyone support two
> options forever.
I'd guess there probably isn't a single best option. I suspect there's a
tradeoff
The same is happening to me. Additionally, one of the toll-free phone numbers
did not pick up.
No outages I see: https://statusgator.com/services/zoom, https://status.zoom.us/
On 2019/11/21 17:06:56, Gabor Szadovszky wrote:
> Hi,
>
> Is it just me who cannot join to the meeting? It says
Hello! Can a committer take a look at
https://github.com/apache/parquet-mr/pull/686? This makes some changes to get
the most recent changes to the Bloom filters that went into the 2.7.0 format
release into parquet-mr.
It LGTM, but a committer is needed here, and someone who knows the codebase
I'm pleased to announce the release of Parquet format 2.7.0!
Parquet is a general-purpose columnar file format for nested data. It uses
space-efficient encodings and a compressed and splittable structure for
processing frameworks like Hadoop.
Changes are listed at:
[
https://issues.apache.org/jira/browse/PARQUET-319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086000#comment-16086000
]
Jim Apple commented on PARQUET-319:
---
[~Ferd], there is also
https://docs.google.com/document/d
[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16093432#comment-16093432
]
Jim Apple commented on PARQUET-41:
--
We might want to consider "Cache-, Hash- and Space-Efficient
[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095685#comment-16095685
]
Jim Apple commented on PARQUET-41:
--
In response to your request for a benchmark, see
https://github.com
[
https://issues.apache.org/jira/browse/PARQUET-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197185#comment-16197185
]
Jim Apple commented on PARQUET-1125:
Or maybe a 16-byte type, generally, not just for UUIDs.
>
[
https://issues.apache.org/jira/browse/PARQUET-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197218#comment-16197218
]
Jim Apple commented on PARQUET-1125:
What kind of expectations would a hash digest type have
[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486884#comment-16486884
]
Jim Apple commented on PARQUET-41:
--
In response to [~junjie]'s question above, "Sure, it is fea
[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513883#comment-16513883
]
Jim Apple edited comment on PARQUET-41 at 6/15/18 2:28 PM:
---
I took a look
[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517298#comment-16517298
]
Jim Apple commented on PARQUET-41:
--
Is there and updated PR for parquet-format that matches the open PRs
[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517298#comment-16517298
]
Jim Apple edited comment on PARQUET-41 at 6/19/18 8:19 PM:
---
[~junjie
[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338244#comment-16338244
]
Jim Apple commented on PARQUET-41:
--
IIRC, there was a plan to create an end-to-end benchmark of an MR
[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338680#comment-16338680
]
Jim Apple commented on PARQUET-41:
--
Could you elaborate on "A column with large cardinality can not
[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338709#comment-16338709
]
Jim Apple commented on PARQUET-41:
--
Why not tweak that logic in parquet-mr to allow dictionary encoding
[
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373429#comment-16373429
]
Jim Apple commented on PARQUET-1222:
I do not think the order proposed matches the IEEE 754
[
https://issues.apache.org/jira/browse/PARQUET-319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872878#comment-16872878
]
Jim Apple commented on PARQUET-319:
---
[~rdblue], could you resolve this with the fix version "f
[
https://issues.apache.org/jira/browse/PARQUET-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899713#comment-16899713
]
Jim Apple commented on PARQUET-1630:
The post in question:
[https://lists.apache.org/thread.html
Jim Apple created PARQUET-1674:
--
Summary: The announcement email on the web site does not comply
with ASF rules
Key: PARQUET-1674
URL: https://issues.apache.org/jira/browse/PARQUET-1674
Project: Parquet
Jim Apple created PARQUET-1675:
--
Summary: Switch to git for website
Key: PARQUET-1675
URL: https://issues.apache.org/jira/browse/PARQUET-1675
Project: Parquet
Issue Type: Improvement
51 matches
Mail list logo