[jira] [Commented] (PARQUET-1950) Define core features / compliance level
[ https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276914#comment-17276914 ] ASF GitHub Bot commented on PARQUET-1950: - ggershinsky commented on pull request #164: URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771442874 to add the parquet encryption angle to this discussion. This feature adds protection of confidentiality and integrity of parquet files (when they have columns with sensitive data). These security layers will make it difficult to support many of the legacy features mentioned above, like external chunks or merging multiple files into a single master file (this interferes with definition of file integrity). Reading encrypted data is also difficult before file writing is finished. All of these are not impossible, but challenging, and would require an explicit scaffolding plus some Thrift format changes. If there is a strong demand for using encryption with these legacy features, despite them being deprecated (or with some of the mentioned new features), we can plan this for future versions of parquet-format, parquet-mr etc. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Define core features / compliance level > --- > > Key: PARQUET-1950 > URL: https://issues.apache.org/jira/browse/PARQUET-1950 > Project: Parquet > Issue Type: New Feature > Components: parquet-format >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > > Parquet format is getting more and more features while the different > implementations cannot keep the pace and left behind with some features > implemented and some are not. In many cases it is also not clear if the > related feature is mature enough to be used widely or more an experimental > one. > These are huge issues that makes hard ensure interoperability between the > different implementations. > The following idea came up in a > [discussion|https://lists.apache.org/thread.html/rde5cba8443487bccd47593ddf5dfb39f69c729d260165cb936a1a289%40%3Cdev.parquet.apache.org%3E]. > Create a now document in the parquet-format repository that lists the "core > features". This document is versioned by the parquet-format releases. This > way a certain version of "core features" defines a level of compatibility > between the different implementations. This version number can be written to > a new field (e.g. complianceLevel) in the footer. If an implementation writes > a file with a version in the field it must implement all the related "core > features" (read and write) and must not use any other features at write > because it makes the data unreadable by another implementation if only the > same level of "core features" are implemented. > For example if we have encoding A listed in the version 1 "core features" but > encoding B is not then at "complianceLevel = 1" we can use encoding A but we > cannot use encoding B because it would make the related data unreadable. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-format] ggershinsky commented on pull request #164: PARQUET-1950: Define core features
ggershinsky commented on pull request #164: URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771442874 to add the parquet encryption angle to this discussion. This feature adds protection of confidentiality and integrity of parquet files (when they have columns with sensitive data). These security layers will make it difficult to support many of the legacy features mentioned above, like external chunks or merging multiple files into a single master file (this interferes with definition of file integrity). Reading encrypted data is also difficult before file writing is finished. All of these are not impossible, but challenging, and would require an explicit scaffolding plus some Thrift format changes. If there is a strong demand for using encryption with these legacy features, despite them being deprecated (or with some of the mentioned new features), we can plan this for future versions of parquet-format, parquet-mr etc. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (PARQUET-1967) Upgrade Zstd-jni to 1.4.8-3
[ https://issues.apache.org/jira/browse/PARQUET-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated PARQUET-1967: --- Summary: Upgrade Zstd-jni to 1.4.8-3 (was: Upgrade Zstd-jni to 1.4.8-2) > Upgrade Zstd-jni to 1.4.8-3 > --- > > Key: PARQUET-1967 > URL: https://issues.apache.org/jira/browse/PARQUET-1967 > Project: Parquet > Issue Type: Improvement > Components: parquet-cli, parquet-mr >Affects Versions: 1.13.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics
[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276862#comment-17276862 ] Nicholas Chammas commented on PARQUET-41: - Thanks for the link [~yumwang]. That [README|https://github.com/apache/parquet-mr/tree/master/parquet-hadoop#readme] is what I was looking for. Are these docs published on the [documentation site|http://parquet.apache.org/documentation/latest/] anywhere, or is the README file on GitHub the canonical reference? > Add bloom filters to parquet statistics > --- > > Key: PARQUET-41 > URL: https://issues.apache.org/jira/browse/PARQUET-41 > Project: Parquet > Issue Type: New Feature > Components: parquet-format, parquet-mr >Reporter: Alex Levenson >Assignee: Junjie Chen >Priority: Major > Labels: filter2, pull-request-available > Fix For: format-2.7.0, 1.12.0 > > > For row groups with no dictionary, we could still produce a bloom filter. > This could be very useful in filtering entire row groups. > Pull request: > https://github.com/apache/parquet-mr/pull/215 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics
[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276854#comment-17276854 ] Yuming Wang commented on PARQUET-41: [~nchammas] You can check the related configuration parameters: [https://github.com/apache/parquet-mr/tree/master/parquet-hadoop |https://github.com/apache/parquet-mr/tree/master/parquet-hadoop]This is an example: {code:scala} val numRows = 1024 * 1024 * 15 val df = spark.range(numRows).selectExpr( "id", "cast(id as string) as s", "cast(id as timestamp) as ts", "cast(cast(id as timestamp) as date) as td", "cast(id as decimal) as dec") val benchmark = new org.apache.spark.benchmark.Benchmark( "Benchmark bloom filter write", numRows, minNumIters = 5) benchmark.addCase("default") { _ => withSQLConf() { df.write.mode("overwrite").parquet("/tmp/spark/parquet") } } benchmark.addCase("Build bloom filter for ts column") { _ => withSQLConf( org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED -> "false", org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#ts" -> "true") { df.write.mode("overwrite").parquet("/tmp/spark/parquet") } } benchmark.addCase("Build bloom filter for ts and dec column") { _ => withSQLConf( org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED -> "false", org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#ts" -> "true", org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#dec" -> "true") { df.write.mode("overwrite").parquet("/tmp/spark/parquet") } } benchmark.addCase("Build bloom filter for all column") { _ => withSQLConf( org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED -> "true") { df.write.mode("overwrite").parquet("/tmp/spark/parquet") } } benchmark.run() {code} > Add bloom filters to parquet statistics > --- > > Key: PARQUET-41 > URL: https://issues.apache.org/jira/browse/PARQUET-41 > Project: Parquet > Issue Type: New Feature > Components: parquet-format, parquet-mr >Reporter: Alex Levenson >Assignee: Junjie Chen >Priority: Major > Labels: filter2, pull-request-available > Fix For: format-2.7.0, 1.12.0 > > > For row groups with no dictionary, we could still produce a bloom filter. > This could be very useful in filtering entire row groups. > Pull request: > https://github.com/apache/parquet-mr/pull/215 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1950) Define core features / compliance level
[ https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276852#comment-17276852 ] ASF GitHub Bot commented on PARQUET-1950: - timarmstrong commented on pull request #164: URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771353955 +1 to @emkornfield's comment - the intent of this is to establish a clear baseline about what is supported widely in practice - there are a bunch of Parquet features that are in the standard but are hard to use in practice because they don't have read support from other implementatoins. I think it should ultimately make it easier to get adoption on new features cause the status of each feature will be clearer. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Define core features / compliance level > --- > > Key: PARQUET-1950 > URL: https://issues.apache.org/jira/browse/PARQUET-1950 > Project: Parquet > Issue Type: New Feature > Components: parquet-format >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > > Parquet format is getting more and more features while the different > implementations cannot keep the pace and left behind with some features > implemented and some are not. In many cases it is also not clear if the > related feature is mature enough to be used widely or more an experimental > one. > These are huge issues that makes hard ensure interoperability between the > different implementations. > The following idea came up in a > [discussion|https://lists.apache.org/thread.html/rde5cba8443487bccd47593ddf5dfb39f69c729d260165cb936a1a289%40%3Cdev.parquet.apache.org%3E]. > Create a now document in the parquet-format repository that lists the "core > features". This document is versioned by the parquet-format releases. This > way a certain version of "core features" defines a level of compatibility > between the different implementations. This version number can be written to > a new field (e.g. complianceLevel) in the footer. If an implementation writes > a file with a version in the field it must implement all the related "core > features" (read and write) and must not use any other features at write > because it makes the data unreadable by another implementation if only the > same level of "core features" are implemented. > For example if we have encoding A listed in the version 1 "core features" but > encoding B is not then at "complianceLevel = 1" we can use encoding A but we > cannot use encoding B because it would make the related data unreadable. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-format] timarmstrong commented on pull request #164: PARQUET-1950: Define core features
timarmstrong commented on pull request #164: URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771353955 +1 to @emkornfield's comment - the intent of this is to establish a clear baseline about what is supported widely in practice - there are a bunch of Parquet features that are in the standard but are hard to use in practice because they don't have read support from other implementatoins. I think it should ultimately make it easier to get adoption on new features cause the status of each feature will be clearer. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics
[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276842#comment-17276842 ] Nicholas Chammas commented on PARQUET-41: - Where is the user documentation for all the bloom filter-related functionality that will be released as part of parquet-mr 1.12? I'm thinking of user settings like {{parquet.filter.bloom.enabled}} and {{parquet.bloom.filter.*}}, along with anything else a user might care about. For example, if a Spark user wants to use or configure bloom filters on their Parquet data, what documentation should they reference? > Add bloom filters to parquet statistics > --- > > Key: PARQUET-41 > URL: https://issues.apache.org/jira/browse/PARQUET-41 > Project: Parquet > Issue Type: New Feature > Components: parquet-format, parquet-mr >Reporter: Alex Levenson >Assignee: Junjie Chen >Priority: Major > Labels: filter2, pull-request-available > Fix For: format-2.7.0, 1.12.0 > > > For row groups with no dictionary, we could still produce a bloom filter. > This could be very useful in filtering entire row groups. > Pull request: > https://github.com/apache/parquet-mr/pull/215 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1969) Test by GithubAction
[ https://issues.apache.org/jira/browse/PARQUET-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276836#comment-17276836 ] ASF GitHub Bot commented on PARQUET-1969: - wangyum commented on pull request #860: URL: https://github.com/apache/parquet-mr/pull/860#issuecomment-771332722 Tested by https://github.com/wangyum/parquet-mr/runs/1811695243?check_suite_focus=true This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Test by GithubAction > > > Key: PARQUET-1969 > URL: https://issues.apache.org/jira/browse/PARQUET-1969 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] wangyum commented on pull request #860: PARQUET-1969: Test by GithubAction
wangyum commented on pull request #860: URL: https://github.com/apache/parquet-mr/pull/860#issuecomment-771332722 Tested by https://github.com/wangyum/parquet-mr/runs/1811695243?check_suite_focus=true This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1969) Test by GithubAction
[ https://issues.apache.org/jira/browse/PARQUET-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276833#comment-17276833 ] Yuming Wang commented on PARQUET-1969: -- Travis has been broken for several days. I have tested GithubAction: https://github.com/wangyum/parquet-mr/actions/runs/529590762 > Test by GithubAction > > > Key: PARQUET-1969 > URL: https://issues.apache.org/jira/browse/PARQUET-1969 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1950) Define core features / compliance level
[ https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276830#comment-17276830 ] ASF GitHub Bot commented on PARQUET-1950: - emkornfield commented on pull request #164: URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771331050 @raduteo the main driver for this PR is there has been a lot of confusion as what is defined as needing core support. I think once we finish this PR I'm not fully opposed to the idea of supporting this field but I think we need to go into greater detail in the specification of what supporting the individual files actually means (and i think willing to help both Java and C++ support both can go a long way to convincing people that it should become a core feature). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Define core features / compliance level > --- > > Key: PARQUET-1950 > URL: https://issues.apache.org/jira/browse/PARQUET-1950 > Project: Parquet > Issue Type: New Feature > Components: parquet-format >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > > Parquet format is getting more and more features while the different > implementations cannot keep the pace and left behind with some features > implemented and some are not. In many cases it is also not clear if the > related feature is mature enough to be used widely or more an experimental > one. > These are huge issues that makes hard ensure interoperability between the > different implementations. > The following idea came up in a > [discussion|https://lists.apache.org/thread.html/rde5cba8443487bccd47593ddf5dfb39f69c729d260165cb936a1a289%40%3Cdev.parquet.apache.org%3E]. > Create a now document in the parquet-format repository that lists the "core > features". This document is versioned by the parquet-format releases. This > way a certain version of "core features" defines a level of compatibility > between the different implementations. This version number can be written to > a new field (e.g. complianceLevel) in the footer. If an implementation writes > a file with a version in the field it must implement all the related "core > features" (read and write) and must not use any other features at write > because it makes the data unreadable by another implementation if only the > same level of "core features" are implemented. > For example if we have encoding A listed in the version 1 "core features" but > encoding B is not then at "complianceLevel = 1" we can use encoding A but we > cannot use encoding B because it would make the related data unreadable. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-format] emkornfield commented on pull request #164: PARQUET-1950: Define core features
emkornfield commented on pull request #164: URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771331050 @raduteo the main driver for this PR is there has been a lot of confusion as what is defined as needing core support. I think once we finish this PR I'm not fully opposed to the idea of supporting this field but I think we need to go into greater detail in the specification of what supporting the individual files actually means (and i think willing to help both Java and C++ support both can go a long way to convincing people that it should become a core feature). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1967) Upgrade Zstd-jni to 1.4.8-2
[ https://issues.apache.org/jira/browse/PARQUET-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276761#comment-17276761 ] ASF GitHub Bot commented on PARQUET-1967: - dongjoon-hyun commented on pull request #859: URL: https://github.com/apache/parquet-mr/pull/859#issuecomment-771263507 Do I need to rebase this PR to see the green build? > Let's wait for a green build. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade Zstd-jni to 1.4.8-2 > --- > > Key: PARQUET-1967 > URL: https://issues.apache.org/jira/browse/PARQUET-1967 > Project: Parquet > Issue Type: Improvement > Components: parquet-cli, parquet-mr >Affects Versions: 1.13.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] dongjoon-hyun commented on pull request #859: PARQUET-1967: Upgrade Zstd-jni to 1.4.8-2
dongjoon-hyun commented on pull request #859: URL: https://github.com/apache/parquet-mr/pull/859#issuecomment-771263507 Do I need to rebase this PR to see the green build? > Let's wait for a green build. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Updated invitation: Parquet Sync @ Monthly from 9am to 10am on the fourth Tuesday (PST) (dev@parquet.apache.org)
BEGIN:VCALENDAR PRODID:-//Google Inc//Google Calendar 70.9054//EN VERSION:2.0 CALSCALE:GREGORIAN METHOD:REQUEST BEGIN:VTIMEZONE TZID:America/Los_Angeles X-LIC-LOCATION:America/Los_Angeles BEGIN:DAYLIGHT TZOFFSETFROM:-0800 TZOFFSETTO:-0700 TZNAME:PDT DTSTART:19700308T02 RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU END:DAYLIGHT BEGIN:STANDARD TZOFFSETFROM:-0700 TZOFFSETTO:-0800 TZNAME:PST DTSTART:19701101T02 RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU END:STANDARD END:VTIMEZONE BEGIN:VEVENT DTSTART;TZID=America/Los_Angeles:20210223T09 DTEND;TZID=America/Los_Angeles:20210223T10 RRULE:FREQ=MONTHLY;BYDAY=4TU DTSTAMP:20210201T174437Z ORGANIZER;CN=sha...@uber.com:mailto:sha...@uber.com UID:1mslidkh8edtiorvhelvtuslp8_r20210223t170...@google.com ATTENDEE;CUTYPE=RESOURCE;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE;C N=SEA | 1191 2nd Ave-8th-Blakely (7) [Zoom];X-NUM-GUESTS=0:mailto:uber.com_ 53454131313931326e6441766530387468426c616b656c793756432d343836313237@resour ce.calendar.google.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=sha...@uber.com;X-NUM-GUESTS=0:mailto:sha...@uber.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=Matthew Turner;X-NUM-GUESTS=0:mailto:matthew.m.tur...@outlook.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE ;CN=wesmck...@gmail.com;X-NUM-GUESTS=0:mailto:wesmck...@gmail.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=gabor.szadovs...@cloudera.com;X-NUM-GUESTS=0:mailto:gabor.szadovszky@cl oudera.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=gg5...@gmail.com;X-NUM-GUESTS=0:mailto:gg5...@gmail.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=gwali...@gmail.com;X-NUM-GUESTS=0:mailto:gwali...@gmail.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU E;CN=emkornfi...@gmail.com;X-NUM-GUESTS=0:mailto:emkornfi...@gmail.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN="Lekshmi Narayanan, Arun Balajiee";X-NUM-GUESTS=0:mailto:arl...@pitt.ed u ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=altekruseja...@gmail.com;X-NUM-GUESTS=0:mailto:altekruseja...@gmail.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=dev@parquet.apache.org;X-NUM-GUESTS=0:mailto:dev@parquet.apache.org ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU E;CN=Robert Kruszewski;X-NUM-GUESTS=0:mailto:robe...@palantir.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=Ivan Gavryliuk;X-NUM-GUESTS=0:mailto:i...@isolineltd.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=Ryan Blue;X-NUM-GUESTS=0:mailto:rb...@netflix.com X-MICROSOFT-CDO-OWNERAPPTID:-488130593 CREATED:20200210T155820Z DESCRIPTION:Xinli shang is inviting you to a scheduled Zoom meeting.Join Zoom Meeting - password is requiredhttps://uber.zoom.us/ j/3523778975" id="ow4314" __is_owner="true">https://uber.zoom.us/j/35237789 75Meeting ID: 352 377 8975Password: 030115One tap m obile+16699006833\,\,3523778975# US (San Jose)+16468769923\,\,35237 78975# US (New York)Dial by your location \; \; \;& nbsp\; \; \; \; \;+1 669 900 6833 US (San Jose) \;& nbsp\; \; \; \; \; \; \;+1 646 876 9923 US (New Yor k) \; \; \; \; \; \; \; \;877 369 0926 US Toll-free \; \; \; \; \; \; \; \;855 880 1246 US Toll-freeMeeting ID: 352 377 8975Find your local numbe r: https://uber.zoom.us/u/aZKZunOZ9";>https://uber.zoom.us/u/aZKZun OZ9Join by SIP35237 78...@zoomcrc.comJoin by H.323162.255.37.11 (US West)16 2.255.36.11 (US East)221.122.88.195 (China)115.114.131.7 (India Mum bai)115.114.115.7 (India Hyderabad)213.19.144.110 (EMEA)103.122 .166.55 (Australia)209.9.211.110 (Hong Kong)64.211.144.160 (Brazil) 69.174.57.160 (Canada)207.226.132.110 (Japan)Meeting ID: 352 37 7 8975\n\n-::~:~::~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~ :~:~:~:~:~:~:~::~:~::-\nPlease do not edit this section of the description. \n\nView your event at https://calendar.google.com/calendar/event?action=VI EW&eid=MW1zbGlka2g4ZWR0aW9ydmhlbHZ0dXNscDhfUjIwMjEwMjIzVDE3MDAwMCBkZXZAcGFy cXVldC5hcGFjaGUub3Jn&tok=MTUjc2hhbmd4QHViZXIuY29tNWViMDc4NTgwNDE2YTViYzkzOG U0NzVhMmJiZjgzZjU4ZDBhYjc4NA&ctz=America%2FLos_Angeles&hl=en&es=0.\n-::~:~: :~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~::~ :~::- LAST-MODIFIED:20210201T174435Z LOCATION:https://uber.zoom.us/j/3523778975\, SEA | 1191 2nd Ave-8th-Blakely (7) [Zoom] SEQUENCE:1 STATUS:CONFIRMED SUMMARY:Parquet Sync TRANSP:OPAQUE END:VEVENT END:VCALENDAR invite.ics Description: application/ics
Updated invitation: Parquet Sync @ Monthly from 9am to 10am on the fourth Tuesday from Tue Jan 26 to Mon Feb 22 (PST) (dev@parquet.apache.org)
BEGIN:VCALENDAR PRODID:-//Google Inc//Google Calendar 70.9054//EN VERSION:2.0 CALSCALE:GREGORIAN METHOD:REQUEST BEGIN:VTIMEZONE TZID:America/Los_Angeles X-LIC-LOCATION:America/Los_Angeles BEGIN:DAYLIGHT TZOFFSETFROM:-0800 TZOFFSETTO:-0700 TZNAME:PDT DTSTART:19700308T02 RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU END:DAYLIGHT BEGIN:STANDARD TZOFFSETFROM:-0700 TZOFFSETTO:-0800 TZNAME:PST DTSTART:19701101T02 RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU END:STANDARD END:VTIMEZONE BEGIN:VEVENT DTSTART;TZID=America/Los_Angeles:20210126T09 DTEND;TZID=America/Los_Angeles:20210126T10 RRULE:FREQ=MONTHLY;UNTIL=20210223T075959Z;BYDAY=4TU DTSTAMP:20210201T174437Z ORGANIZER;CN=sha...@uber.com:mailto:sha...@uber.com UID:1mslidkh8edtiorvhelvtuslp8_r20210126t170...@google.com ATTENDEE;CUTYPE=RESOURCE;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE;C N=SEA | 1191 2nd Ave-8th-Blakely (7) [Zoom];X-NUM-GUESTS=0:mailto:uber.com_ 53454131313931326e6441766530387468426c616b656c793756432d343836313237@resour ce.calendar.google.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=sha...@uber.com;X-NUM-GUESTS=0:mailto:sha...@uber.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=Matthew Turner;X-NUM-GUESTS=0:mailto:matthew.m.tur...@outlook.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE ;CN=wesmck...@gmail.com;X-NUM-GUESTS=0:mailto:wesmck...@gmail.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=gabor.szadovs...@cloudera.com;X-NUM-GUESTS=0:mailto:gabor.szadovszky@cl oudera.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=gg5...@gmail.com;X-NUM-GUESTS=0:mailto:gg5...@gmail.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=gwali...@gmail.com;X-NUM-GUESTS=0:mailto:gwali...@gmail.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU E;CN=emkornfi...@gmail.com;X-NUM-GUESTS=0:mailto:emkornfi...@gmail.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN="Lekshmi Narayanan, Arun Balajiee";X-NUM-GUESTS=0:mailto:arl...@pitt.ed u ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=altekruseja...@gmail.com;X-NUM-GUESTS=0:mailto:altekruseja...@gmail.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=dev@parquet.apache.org;X-NUM-GUESTS=0:mailto:dev@parquet.apache.org ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU E;CN=Robert Kruszewski;X-NUM-GUESTS=0:mailto:robe...@palantir.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=Ivan Gavryliuk;X-NUM-GUESTS=0:mailto:i...@isolineltd.com X-MICROSOFT-CDO-OWNERAPPTID:-432234048 CREATED:20200210T155820Z DESCRIPTION:Xinli shang is inviting you to a scheduled Zoom meeting.Join Zoom Meeting - password is requiredhttps://uber.zoom.us/ j/3523778975" id="ow4314" __is_owner="true">https://uber.zoom.us/j/35237789 75Meeting ID: 352 377 8975Password: 030115One tap m obile+16699006833\,\,3523778975# US (San Jose)+16468769923\,\,35237 78975# US (New York)Dial by your location \; \; \;& nbsp\; \; \; \; \;+1 669 900 6833 US (San Jose) \;& nbsp\; \; \; \; \; \; \;+1 646 876 9923 US (New Yor k) \; \; \; \; \; \; \; \;877 369 0926 US Toll-free \; \; \; \; \; \; \; \;855 880 1246 US Toll-freeMeeting ID: 352 377 8975Find your local numbe r: https://uber.zoom.us/u/aZKZunOZ9";>https://uber.zoom.us/u/aZKZun OZ9Join by SIP35237 78...@zoomcrc.comJoin by H.323162.255.37.11 (US West)16 2.255.36.11 (US East)221.122.88.195 (China)115.114.131.7 (India Mum bai)115.114.115.7 (India Hyderabad)213.19.144.110 (EMEA)103.122 .166.55 (Australia)209.9.211.110 (Hong Kong)64.211.144.160 (Brazil) 69.174.57.160 (Canada)207.226.132.110 (Japan)Meeting ID: 352 37 7 8975\n\n-::~:~::~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~ :~:~:~:~:~:~:~::~:~::-\nPlease do not edit this section of the description. \n\nView your event at https://calendar.google.com/calendar/event?action=VI EW&eid=MW1zbGlka2g4ZWR0aW9ydmhlbHZ0dXNscDhfUjIwMjEwMTI2VDE3MDAwMCBkZXZAcGFy cXVldC5hcGFjaGUub3Jn&tok=MTUjc2hhbmd4QHViZXIuY29tZGZmZTZlYTJlMGNkOTY5NDc1ND g4NjBlYmM2NTg5ZTJiNGI2YWRhYQ&ctz=America%2FLos_Angeles&hl=en&es=0.\n-::~:~: :~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~::~ :~::- LAST-MODIFIED:20210201T174435Z LOCATION:https://uber.zoom.us/j/3523778975\, SEA | 1191 2nd Ave-8th-Blakely (7) [Zoom] SEQUENCE:1 STATUS:CONFIRMED SUMMARY:Parquet Sync TRANSP:OPAQUE END:VEVENT END:VCALENDAR invite.ics Description: application/ics
[jira] [Commented] (PARQUET-1950) Define core features / compliance level
[ https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276735#comment-17276735 ] ASF GitHub Bot commented on PARQUET-1950: - raduteo commented on pull request #164: URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771239827 @gszadovszky and @emkornfield it's highly coincidental that I was just looking into cleaning up apache/arrow#8130 when I noticed this thread. External column chunks support is one of the key features that attracted me to parquet in the first place and I would like the chance to lobby for keeping it and actually expanding its adoption - I already have the complete PR mentioned above and I can help with supporting it across other implementations. There are a few major domains where I see this as valuable component: 1. Allowing concurrent read to fully flushed row groups while parquet file is still being appended to. A slight variant of this is allowing subsequent row group appends to a parquet file without impacting potential readers. 2. Being able to aggregate multiple data sets in a master parquet file: One scenario if cumulative recordings like stock prices that get collected daily and need to be presented as one unified historical file, another the case of enrichment where we want to add new columns to an existing data set. 3. Allowing for bi-temporal changes to parquet file: External columns chunks allows one to apply small corrections by simply creating delta files and new footers that simply swap out the chunks that require changes and point to the new ones. If the above use cases are addressed by other parquet overlays or they don't line up with the intended usage of parquet I can look elsewhere but it seems like huge opportunity and the development cost for supporting it are quite minor by comparison This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Define core features / compliance level > --- > > Key: PARQUET-1950 > URL: https://issues.apache.org/jira/browse/PARQUET-1950 > Project: Parquet > Issue Type: New Feature > Components: parquet-format >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > > Parquet format is getting more and more features while the different > implementations cannot keep the pace and left behind with some features > implemented and some are not. In many cases it is also not clear if the > related feature is mature enough to be used widely or more an experimental > one. > These are huge issues that makes hard ensure interoperability between the > different implementations. > The following idea came up in a > [discussion|https://lists.apache.org/thread.html/rde5cba8443487bccd47593ddf5dfb39f69c729d260165cb936a1a289%40%3Cdev.parquet.apache.org%3E]. > Create a now document in the parquet-format repository that lists the "core > features". This document is versioned by the parquet-format releases. This > way a certain version of "core features" defines a level of compatibility > between the different implementations. This version number can be written to > a new field (e.g. complianceLevel) in the footer. If an implementation writes > a file with a version in the field it must implement all the related "core > features" (read and write) and must not use any other features at write > because it makes the data unreadable by another implementation if only the > same level of "core features" are implemented. > For example if we have encoding A listed in the version 1 "core features" but > encoding B is not then at "complianceLevel = 1" we can use encoding A but we > cannot use encoding B because it would make the related data unreadable. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-format] raduteo commented on pull request #164: PARQUET-1950: Define core features
raduteo commented on pull request #164: URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771239827 @gszadovszky and @emkornfield it's highly coincidental that I was just looking into cleaning up apache/arrow#8130 when I noticed this thread. External column chunks support is one of the key features that attracted me to parquet in the first place and I would like the chance to lobby for keeping it and actually expanding its adoption - I already have the complete PR mentioned above and I can help with supporting it across other implementations. There are a few major domains where I see this as valuable component: 1. Allowing concurrent read to fully flushed row groups while parquet file is still being appended to. A slight variant of this is allowing subsequent row group appends to a parquet file without impacting potential readers. 2. Being able to aggregate multiple data sets in a master parquet file: One scenario if cumulative recordings like stock prices that get collected daily and need to be presented as one unified historical file, another the case of enrichment where we want to add new columns to an existing data set. 3. Allowing for bi-temporal changes to parquet file: External columns chunks allows one to apply small corrections by simply creating delta files and new footers that simply swap out the chunks that require changes and point to the new ones. If the above use cases are addressed by other parquet overlays or they don't line up with the intended usage of parquet I can look elsewhere but it seems like huge opportunity and the development cost for supporting it are quite minor by comparison This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1968) FilterApi support In predicate
[ https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276664#comment-17276664 ] Xinli Shang commented on PARQUET-1968: -- Sure, will connect with you shortly. > FilterApi support In predicate > -- > > Key: PARQUET-1968 > URL: https://issues.apache.org/jira/browse/PARQUET-1968 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Yuming Wang >Priority: Major > > FilterApi should support native In predicate. > Spark: > https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605 > Impala: > https://issues.apache.org/jira/browse/IMPALA-3654 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1967) Upgrade Zstd-jni to 1.4.8-2
[ https://issues.apache.org/jira/browse/PARQUET-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276646#comment-17276646 ] ASF GitHub Bot commented on PARQUET-1967: - dongjoon-hyun commented on pull request #859: URL: https://github.com/apache/parquet-mr/pull/859#issuecomment-771151698 Thank you for reviews. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade Zstd-jni to 1.4.8-2 > --- > > Key: PARQUET-1967 > URL: https://issues.apache.org/jira/browse/PARQUET-1967 > Project: Parquet > Issue Type: Improvement > Components: parquet-cli, parquet-mr >Affects Versions: 1.13.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] dongjoon-hyun commented on pull request #859: PARQUET-1967: Upgrade Zstd-jni to 1.4.8-2
dongjoon-hyun commented on pull request #859: URL: https://github.com/apache/parquet-mr/pull/859#issuecomment-771151698 Thank you for reviews. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1950) Define core features / compliance level
[ https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276640#comment-17276640 ] ASF GitHub Bot commented on PARQUET-1950: - emkornfield commented on pull request #164: URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771141210 > Because non of the ideas of external column chunks nor the summary files were spread across the different implementations (because of the lack of specification) I think we should not include the usage of the field file_path in this document or even explicitly specify that this field is not supported. Being explicit seems reasonable to me if others are OK with it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Define core features / compliance level > --- > > Key: PARQUET-1950 > URL: https://issues.apache.org/jira/browse/PARQUET-1950 > Project: Parquet > Issue Type: New Feature > Components: parquet-format >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > > Parquet format is getting more and more features while the different > implementations cannot keep the pace and left behind with some features > implemented and some are not. In many cases it is also not clear if the > related feature is mature enough to be used widely or more an experimental > one. > These are huge issues that makes hard ensure interoperability between the > different implementations. > The following idea came up in a > [discussion|https://lists.apache.org/thread.html/rde5cba8443487bccd47593ddf5dfb39f69c729d260165cb936a1a289%40%3Cdev.parquet.apache.org%3E]. > Create a now document in the parquet-format repository that lists the "core > features". This document is versioned by the parquet-format releases. This > way a certain version of "core features" defines a level of compatibility > between the different implementations. This version number can be written to > a new field (e.g. complianceLevel) in the footer. If an implementation writes > a file with a version in the field it must implement all the related "core > features" (read and write) and must not use any other features at write > because it makes the data unreadable by another implementation if only the > same level of "core features" are implemented. > For example if we have encoding A listed in the version 1 "core features" but > encoding B is not then at "complianceLevel = 1" we can use encoding A but we > cannot use encoding B because it would make the related data unreadable. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-format] emkornfield commented on pull request #164: PARQUET-1950: Define core features
emkornfield commented on pull request #164: URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771141210 > Because non of the ideas of external column chunks nor the summary files were spread across the different implementations (because of the lack of specification) I think we should not include the usage of the field file_path in this document or even explicitly specify that this field is not supported. Being explicit seems reasonable to me if others are OK with it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1968) FilterApi support In predicate
[ https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276548#comment-17276548 ] Ryan Blue commented on PARQUET-1968: Thank you! I'm not sure why it was no longer on my calendar. I have the invite now and I plan to attend the sync on the 23rd. If you'd like, we can also set up a time to talk about this integration specifically, since it may take a while. > FilterApi support In predicate > -- > > Key: PARQUET-1968 > URL: https://issues.apache.org/jira/browse/PARQUET-1968 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Yuming Wang >Priority: Major > > FilterApi should support native In predicate. > Spark: > https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605 > Impala: > https://issues.apache.org/jira/browse/IMPALA-3654 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1968) FilterApi support In predicate
[ https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276533#comment-17276533 ] Xinli Shang commented on PARQUET-1968: -- Hi [~rdblue]. We didn't discuss it in last week's Parquet sync meeting since you were not there. The next Parquet sync is Feb 23th 9:00am. I just added you explicitly with your Netflix email account. Hopefully, you can join. > FilterApi support In predicate > -- > > Key: PARQUET-1968 > URL: https://issues.apache.org/jira/browse/PARQUET-1968 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Yuming Wang >Priority: Major > > FilterApi should support native In predicate. > Spark: > https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605 > Impala: > https://issues.apache.org/jira/browse/IMPALA-3654 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1968) FilterApi support In predicate
[ https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276526#comment-17276526 ] Ryan Blue commented on PARQUET-1968: I would really like to see a new Parquet API that can support some of the additional features we needed for Iceberg. I proposed adopting Iceberg's filter expressions a year or two ago, so I'm glad to see that the idea has some support from other PMC members. This is one reason why the API is in a separate module. I think we were planning to talk about this at the next Parquet sync, although I'm not sure when that will be. FYI [~sha...@uber.com]. > FilterApi support In predicate > -- > > Key: PARQUET-1968 > URL: https://issues.apache.org/jira/browse/PARQUET-1968 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Yuming Wang >Priority: Major > > FilterApi should support native In predicate. > Spark: > https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605 > Impala: > https://issues.apache.org/jira/browse/IMPALA-3654 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1969) Test by GithubAction
[ https://issues.apache.org/jira/browse/PARQUET-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276423#comment-17276423 ] Gabor Szadovszky commented on PARQUET-1969: --- [~yumwang], maybe it's only me who don't know much about github actions but could you please describe why it is better than the already existing configuration for Travis? > Test by GithubAction > > > Key: PARQUET-1969 > URL: https://issues.apache.org/jira/browse/PARQUET-1969 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1968) FilterApi support In predicate
[ https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276421#comment-17276421 ] Gabor Szadovszky commented on PARQUET-1968: --- This one sounds great. Meanwhile, we were talking about the filtering APIs between Iceberg and Parquet with [~rdblue]. It seems that Iceberg's API already contains this feature and it seems to be more clear and usable than the one implemented in Parquet. It might be a good idea to separate this filtering API in Iceberg and use/implement it in Parquet. (See https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/expressions/Expression.java for Iceberg's API.) > FilterApi support In predicate > -- > > Key: PARQUET-1968 > URL: https://issues.apache.org/jira/browse/PARQUET-1968 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Yuming Wang >Priority: Major > > FilterApi should support native In predicate. > Spark: > https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605 > Impala: > https://issues.apache.org/jira/browse/IMPALA-3654 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1966) Fix build with JDK11 for JDK8
[ https://issues.apache.org/jira/browse/PARQUET-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276413#comment-17276413 ] ASF GitHub Bot commented on PARQUET-1966: - gszadovszky commented on pull request #858: URL: https://github.com/apache/parquet-mr/pull/858#issuecomment-770954301 Thanks, @dossett. I am not sure about the Travis status either. I've restarted the build, let's see what happens. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Fix build with JDK11 for JDK8 > - > > Key: PARQUET-1966 > URL: https://issues.apache.org/jira/browse/PARQUET-1966 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Blocker > > However the target is set to 1.8 it seems to be not enough as of building > with JDK11 it fails at runtime with the following exception: > {code:java} > ava.lang.NoSuchMethodError: > java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer; > at > org.apache.parquet.bytes.CapacityByteArrayOutputStream.write(CapacityByteArrayOutputStream.java:197) > at > org.apache.parquet.column.values.rle.RunLengthBitPackingHybridEncoder.writeOrAppendBitPackedRun(RunLengthBitPackingHybridEncoder.java:193) > at > org.apache.parquet.column.values.rle.RunLengthBitPackingHybridEncoder.writeInt(RunLengthBitPackingHybridEncoder.java:179) > at > org.apache.parquet.column.values.dictionary.DictionaryValuesWriter.getBytes(DictionaryValuesWriter.java:167) > at > org.apache.parquet.column.values.fallback.FallbackValuesWriter.getBytes(FallbackValuesWriter.java:74) > at > org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:60) > at > org.apache.parquet.column.impl.ColumnWriterBase.writePage(ColumnWriterBase.java:387) > at > org.apache.parquet.column.impl.ColumnWriteStoreBase.sizeCheck(ColumnWriteStoreBase.java:235) > at > org.apache.parquet.column.impl.ColumnWriteStoreBase.endRecord(ColumnWriteStoreBase.java:222) > at > org.apache.parquet.column.impl.ColumnWriteStoreV1.endRecord(ColumnWriteStoreV1.java:29) > at > org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endMessage(MessageColumnIO.java:307) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:465) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:148) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:54) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:138) > {code} > To reproduce execute the following. > {code} > export JAVA_HOME={the path to the JDK11 home} > mvn clean install -Djvm={the path to the JRE8 java executable} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] gszadovszky commented on pull request #858: PARQUET-1966: Fix build with JDK11 for JDK8
gszadovszky commented on pull request #858: URL: https://github.com/apache/parquet-mr/pull/858#issuecomment-770954301 Thanks, @dossett. I am not sure about the Travis status either. I've restarted the build, let's see what happens. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1969) Test by GithubAction
[ https://issues.apache.org/jira/browse/PARQUET-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276403#comment-17276403 ] ASF GitHub Bot commented on PARQUET-1969: - wangyum opened a new pull request #860: URL: https://github.com/apache/parquet-mr/pull/860 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Test by GithubAction > > > Key: PARQUET-1969 > URL: https://issues.apache.org/jira/browse/PARQUET-1969 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] wangyum opened a new pull request #860: PARQUET-1969: Test by GithubAction
wangyum opened a new pull request #860: URL: https://github.com/apache/parquet-mr/pull/860 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (PARQUET-1969) Test by GithubAction
Yuming Wang created PARQUET-1969: Summary: Test by GithubAction Key: PARQUET-1969 URL: https://issues.apache.org/jira/browse/PARQUET-1969 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.12.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1968) FilterApi support In predicate
Yuming Wang created PARQUET-1968: Summary: FilterApi support In predicate Key: PARQUET-1968 URL: https://issues.apache.org/jira/browse/PARQUET-1968 Project: Parquet Issue Type: Improvement Components: parquet-mr Affects Versions: 1.12.0 Reporter: Yuming Wang FilterApi should support native In predicate. Spark: https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605 Impala: https://issues.apache.org/jira/browse/IMPALA-3654 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1966) Fix build with JDK11 for JDK8
[ https://issues.apache.org/jira/browse/PARQUET-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276323#comment-17276323 ] ASF GitHub Bot commented on PARQUET-1966: - dossett commented on pull request #858: URL: https://github.com/apache/parquet-mr/pull/858#issuecomment-770862284 That's a nice use of profiles. I'm (non-binding) +1. I don't see any logs for the travis build failures, don't know if they've expired or maybe the tests just failed to launch for some reason. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Fix build with JDK11 for JDK8 > - > > Key: PARQUET-1966 > URL: https://issues.apache.org/jira/browse/PARQUET-1966 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Blocker > > However the target is set to 1.8 it seems to be not enough as of building > with JDK11 it fails at runtime with the following exception: > {code:java} > ava.lang.NoSuchMethodError: > java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer; > at > org.apache.parquet.bytes.CapacityByteArrayOutputStream.write(CapacityByteArrayOutputStream.java:197) > at > org.apache.parquet.column.values.rle.RunLengthBitPackingHybridEncoder.writeOrAppendBitPackedRun(RunLengthBitPackingHybridEncoder.java:193) > at > org.apache.parquet.column.values.rle.RunLengthBitPackingHybridEncoder.writeInt(RunLengthBitPackingHybridEncoder.java:179) > at > org.apache.parquet.column.values.dictionary.DictionaryValuesWriter.getBytes(DictionaryValuesWriter.java:167) > at > org.apache.parquet.column.values.fallback.FallbackValuesWriter.getBytes(FallbackValuesWriter.java:74) > at > org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:60) > at > org.apache.parquet.column.impl.ColumnWriterBase.writePage(ColumnWriterBase.java:387) > at > org.apache.parquet.column.impl.ColumnWriteStoreBase.sizeCheck(ColumnWriteStoreBase.java:235) > at > org.apache.parquet.column.impl.ColumnWriteStoreBase.endRecord(ColumnWriteStoreBase.java:222) > at > org.apache.parquet.column.impl.ColumnWriteStoreV1.endRecord(ColumnWriteStoreV1.java:29) > at > org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endMessage(MessageColumnIO.java:307) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:465) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:148) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:54) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:138) > {code} > To reproduce execute the following. > {code} > export JAVA_HOME={the path to the JDK11 home} > mvn clean install -Djvm={the path to the JRE8 java executable} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] dossett commented on pull request #858: PARQUET-1966: Fix build with JDK11 for JDK8
dossett commented on pull request #858: URL: https://github.com/apache/parquet-mr/pull/858#issuecomment-770862284 That's a nice use of profiles. I'm (non-binding) +1. I don't see any logs for the travis build failures, don't know if they've expired or maybe the tests just failed to launch for some reason. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1805) Refactor the configuration for bloom filters
[ https://issues.apache.org/jira/browse/PARQUET-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276312#comment-17276312 ] Yuming Wang commented on PARQUET-1805: -- Thank you [~gszadovszky] [~junjie] This is what I want: {code:sql} set parquet.bloom.filter.enabled=false; set parquet.bloom.filter.enabled#ts=true; set parquet.bloom.filter.enabled#dec=true; {code} Benchmark and benchmark result: {code:scala} val numRows = 1024 * 1024 * 15 val df = spark.range(numRows).selectExpr( "id", "cast(id as string) as s", "cast(id as timestamp) as ts", "cast(cast(id as timestamp) as date) as td", "cast(id as decimal) as dec") val benchmark = new org.apache.spark.benchmark.Benchmark( "Benchmark bloom filter write", numRows, minNumIters = 5) benchmark.addCase("default") { _ => withSQLConf() { df.write.mode("overwrite").parquet("/tmp/spark/parquet") } } benchmark.addCase("Build bloom filter for ts column") { _ => withSQLConf( org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED -> "false", org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#ts" -> "true") { df.write.mode("overwrite").parquet("/tmp/spark/parquet") } } benchmark.addCase("Build bloom filter for ts and dec column") { _ => withSQLConf( org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED -> "false", org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#ts" -> "true", org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#dec" -> "true") { df.write.mode("overwrite").parquet("/tmp/spark/parquet") } } benchmark.addCase("Build bloom filter for all column") { _ => withSQLConf( org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED -> "true") { df.write.mode("overwrite").parquet("/tmp/spark/parquet") } } benchmark.run() {code} {noformat} Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz Benchmark bloom filter write: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative default5207 5314 72 3.0 331.1 1.0X Build bloom filter for ts column 5808 6065 245 2.7 369.2 0.9X Build bloom filter for ts and dec column 6685 6776 79 2.4 425.0 0.8X Build bloom filter for all column 9077 9889 629 1.7 577.1 0.6X {noformat} cc [~dongjoon] > Refactor the configuration for bloom filters > > > Key: PARQUET-1805 > URL: https://issues.apache.org/jira/browse/PARQUET-1805 > Project: Parquet > Issue Type: Improvement >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > Refactor the hadoop configuration for bloom filters according to PARQUET-1784. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-1899) [C++] Deprecated ReadBatchSpaced in parquet/column_reader
[ https://issues.apache.org/jira/browse/PARQUET-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved PARQUET-1899. - Fix Version/s: cpp-1.6.0 Resolution: Fixed Issue resolved by pull request 8015 [https://github.com/apache/arrow/pull/8015] > [C++] Deprecated ReadBatchSpaced in parquet/column_reader > - > > Key: PARQUET-1899 > URL: https://issues.apache.org/jira/browse/PARQUET-1899 > Project: Parquet > Issue Type: New Feature > Components: parquet-cpp >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.6.0 > > Time Spent: 2h > Remaining Estimate: 0h > > This method is not used any place outside of unit tests and doesn't space > elements properly in the context of deeply nested structures. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1950) Define core features / compliance level
[ https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276280#comment-17276280 ] ASF GitHub Bot commented on PARQUET-1950: - gszadovszky commented on a change in pull request #164: URL: https://github.com/apache/parquet-format/pull/164#discussion_r567767040 ## File path: CoreFeatures.md ## @@ -0,0 +1,178 @@ + + +# Parquet Core Features + +This document lists the core features for each parquet-format release. This +list is a subset of the features which parquet-format makes available. + +## Purpose + +The list of core features for a certain release makes a compliance level for +implementations. If a writer implementation claims that it is at a certain +compliance level then it must use only features from the *core feature list* of +that parquet-format release. If a reader implementation claims the same if must +implement all of the listed features. This way it is easier to ensure +compatibility between the different parquet implementations. + +We cannot and don't want to stop our clients to use any features that are not +on this list but it shall be highlighted that using these features might make +the written parquet files unreadable by other implementations. We can say that +the features available in a parquet-format release (and one of the +implementations of it) and not on the *core feature list* are experimental. + +## Versioning + +This document is versioned by the parquet-format releases which follows the +scheme of semantic versioning. It means that no feature will be deleted from +this document under the same major version. (We might deprecate some, though.) +Because of the semantic versioning if one implementation supports the core +features of the parquet-format release `a.b.x` it must be able to read any +parquet files written by implementations supporting the release `a.d.y` where +`b >= d`. + +If a parquet file is written according to a released version of this document +it might be a good idea to write this version into the field `compliance_level` +in the thrift object `FileMetaData`. + +## Adding new features + +The idea is to only include features which are specified correctly and proven +to be useful for everyone. Because of that we require to have at least two +different implementations that are released and widely tested. We also require +to implement interoperability tests for that feature to prove one +implementation can read the data written by the other one and vice versa. + +## Core feature list + +This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift) +where all the data structures we might use in a parquet file are defined. + +### File structure + +All of the required fields in the structure (and sub-structures) of +`FileMetaData` must be set according to the specification. +The following page types are supported: +* Data page V1 (see `DataPageHeader`) +* Dictionary page (see `DictionaryPageHeader`) + +**TODO**: list optional fields that must be filled properly. + +### Types + + Primitive types + +The following [primitive types](README.md#types) are supported +* `BOOLEAN` +* `INT32` +* `INT64` +* `FLOAT` +* `DOUBLE` +* `BYTE\_ARRAY` +* `FIXED\_LEN\_BYTE\_ARRAY` + +NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed +here. + + Logical types + +The [logical type](LogicalTypes.md)s are practically annotations helping to +understand the related primitive type (or structure). Originally we have had +the `ConvertedType` enum in the thrift file representing all the possible +logical types. After a while we realized it is hard to extend and so introduced +the `LogicalType` union. For backward compatibility reasons we allow to use the +old `ConvertedType` values according to the specified rules but we expect that +the logical types in the file schema are defined with `LogicalType` objects. + +The following LogicalTypes are supported: +* `STRING` +* `MAP` +* `LIST` +* `ENUM` +* `DECIMAL` (for which primitives?) +* `DATE` +* `TIME`: **(Which unit, utc?)** +* `TIMESTAMP`: **(Which unit, utc?)** +* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)** +* `UNKNOWN` **(?)** +* `JSON` **(?)** +* `BSON` **(?)** +* `UUID` **(?)** + +NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes. +This is becasue `INTERVAL` is deprecated so we do not include it in this list. + +### Encodings + +The following encodings are supported: +* [PLAIN](Encodings.md#plain-plain--0) + parquet-mr: Basically all value types are written in this encoding in case of + V1 pages +* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8) + **(?)** + parquet-mr: As per the spec this encoding is deprecated while we still use it + for V1 page dictionaries. +* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3) + parquet-m
[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features
gszadovszky commented on a change in pull request #164: URL: https://github.com/apache/parquet-format/pull/164#discussion_r567767040 ## File path: CoreFeatures.md ## @@ -0,0 +1,178 @@ + + +# Parquet Core Features + +This document lists the core features for each parquet-format release. This +list is a subset of the features which parquet-format makes available. + +## Purpose + +The list of core features for a certain release makes a compliance level for +implementations. If a writer implementation claims that it is at a certain +compliance level then it must use only features from the *core feature list* of +that parquet-format release. If a reader implementation claims the same if must +implement all of the listed features. This way it is easier to ensure +compatibility between the different parquet implementations. + +We cannot and don't want to stop our clients to use any features that are not +on this list but it shall be highlighted that using these features might make +the written parquet files unreadable by other implementations. We can say that +the features available in a parquet-format release (and one of the +implementations of it) and not on the *core feature list* are experimental. + +## Versioning + +This document is versioned by the parquet-format releases which follows the +scheme of semantic versioning. It means that no feature will be deleted from +this document under the same major version. (We might deprecate some, though.) +Because of the semantic versioning if one implementation supports the core +features of the parquet-format release `a.b.x` it must be able to read any +parquet files written by implementations supporting the release `a.d.y` where +`b >= d`. + +If a parquet file is written according to a released version of this document +it might be a good idea to write this version into the field `compliance_level` +in the thrift object `FileMetaData`. + +## Adding new features + +The idea is to only include features which are specified correctly and proven +to be useful for everyone. Because of that we require to have at least two +different implementations that are released and widely tested. We also require +to implement interoperability tests for that feature to prove one +implementation can read the data written by the other one and vice versa. + +## Core feature list + +This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift) +where all the data structures we might use in a parquet file are defined. + +### File structure + +All of the required fields in the structure (and sub-structures) of +`FileMetaData` must be set according to the specification. +The following page types are supported: +* Data page V1 (see `DataPageHeader`) +* Dictionary page (see `DictionaryPageHeader`) + +**TODO**: list optional fields that must be filled properly. + +### Types + + Primitive types + +The following [primitive types](README.md#types) are supported +* `BOOLEAN` +* `INT32` +* `INT64` +* `FLOAT` +* `DOUBLE` +* `BYTE\_ARRAY` +* `FIXED\_LEN\_BYTE\_ARRAY` + +NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed +here. + + Logical types + +The [logical type](LogicalTypes.md)s are practically annotations helping to +understand the related primitive type (or structure). Originally we have had +the `ConvertedType` enum in the thrift file representing all the possible +logical types. After a while we realized it is hard to extend and so introduced +the `LogicalType` union. For backward compatibility reasons we allow to use the +old `ConvertedType` values according to the specified rules but we expect that +the logical types in the file schema are defined with `LogicalType` objects. + +The following LogicalTypes are supported: +* `STRING` +* `MAP` +* `LIST` +* `ENUM` +* `DECIMAL` (for which primitives?) +* `DATE` +* `TIME`: **(Which unit, utc?)** +* `TIMESTAMP`: **(Which unit, utc?)** +* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)** +* `UNKNOWN` **(?)** +* `JSON` **(?)** +* `BSON` **(?)** +* `UUID` **(?)** + +NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes. +This is becasue `INTERVAL` is deprecated so we do not include it in this list. + +### Encodings + +The following encodings are supported: +* [PLAIN](Encodings.md#plain-plain--0) + parquet-mr: Basically all value types are written in this encoding in case of + V1 pages +* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8) + **(?)** + parquet-mr: As per the spec this encoding is deprecated while we still use it + for V1 page dictionaries. +* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3) + parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN + values in case of V2 pages +* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5) + **(?)** + parquet-mr: Used for V2 pages to encode INT32 and INT64 values. +*
[jira] [Commented] (PARQUET-1805) Refactor the configuration for bloom filters
[ https://issues.apache.org/jira/browse/PARQUET-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276213#comment-17276213 ] Gabor Szadovszky commented on PARQUET-1805: --- Oh, I got it, thanks [~junjie]. I've felt it was more logical this way. The "major" configuration is for all columns and the "column specific" one is to configure otherwise. Since the "major" one is false by default you only need to enable the bloom filters for the columns one-by-one. You don't even need to set `parquet.bloom.filter.enabled` but the columns specific ones only. We've tried to describe this in the [README|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md]. > Refactor the configuration for bloom filters > > > Key: PARQUET-1805 > URL: https://issues.apache.org/jira/browse/PARQUET-1805 > Project: Parquet > Issue Type: Improvement >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > Refactor the hadoop configuration for bloom filters according to PARQUET-1784. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1950) Define core features / compliance level
[ https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276204#comment-17276204 ] ASF GitHub Bot commented on PARQUET-1950: - gszadovszky commented on a change in pull request #164: URL: https://github.com/apache/parquet-format/pull/164#discussion_r567688108 ## File path: CoreFeatures.md ## @@ -0,0 +1,178 @@ + + +# Parquet Core Features + +This document lists the core features for each parquet-format release. This +list is a subset of the features which parquet-format makes available. + +## Purpose + +The list of core features for a certain release makes a compliance level for +implementations. If a writer implementation claims that it is at a certain +compliance level then it must use only features from the *core feature list* of +that parquet-format release. If a reader implementation claims the same if must +implement all of the listed features. This way it is easier to ensure +compatibility between the different parquet implementations. + +We cannot and don't want to stop our clients to use any features that are not +on this list but it shall be highlighted that using these features might make +the written parquet files unreadable by other implementations. We can say that +the features available in a parquet-format release (and one of the +implementations of it) and not on the *core feature list* are experimental. + +## Versioning + +This document is versioned by the parquet-format releases which follows the +scheme of semantic versioning. It means that no feature will be deleted from +this document under the same major version. (We might deprecate some, though.) +Because of the semantic versioning if one implementation supports the core +features of the parquet-format release `a.b.x` it must be able to read any +parquet files written by implementations supporting the release `a.d.y` where +`b >= d`. + +If a parquet file is written according to a released version of this document +it might be a good idea to write this version into the field `compliance_level` +in the thrift object `FileMetaData`. + +## Adding new features + +The idea is to only include features which are specified correctly and proven +to be useful for everyone. Because of that we require to have at least two +different implementations that are released and widely tested. We also require +to implement interoperability tests for that feature to prove one +implementation can read the data written by the other one and vice versa. + +## Core feature list + +This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift) +where all the data structures we might use in a parquet file are defined. + +### File structure + +All of the required fields in the structure (and sub-structures) of +`FileMetaData` must be set according to the specification. +The following page types are supported: +* Data page V1 (see `DataPageHeader`) +* Dictionary page (see `DictionaryPageHeader`) + +**TODO**: list optional fields that must be filled properly. + +### Types + + Primitive types + +The following [primitive types](README.md#types) are supported +* `BOOLEAN` +* `INT32` +* `INT64` +* `FLOAT` +* `DOUBLE` +* `BYTE\_ARRAY` +* `FIXED\_LEN\_BYTE\_ARRAY` + +NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed +here. + + Logical types + +The [logical type](LogicalTypes.md)s are practically annotations helping to +understand the related primitive type (or structure). Originally we have had +the `ConvertedType` enum in the thrift file representing all the possible +logical types. After a while we realized it is hard to extend and so introduced +the `LogicalType` union. For backward compatibility reasons we allow to use the +old `ConvertedType` values according to the specified rules but we expect that +the logical types in the file schema are defined with `LogicalType` objects. + +The following LogicalTypes are supported: +* `STRING` +* `MAP` +* `LIST` +* `ENUM` +* `DECIMAL` (for which primitives?) +* `DATE` +* `TIME`: **(Which unit, utc?)** +* `TIMESTAMP`: **(Which unit, utc?)** +* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)** +* `UNKNOWN` **(?)** +* `JSON` **(?)** +* `BSON` **(?)** +* `UUID` **(?)** + +NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes. +This is becasue `INTERVAL` is deprecated so we do not include it in this list. + +### Encodings + +The following encodings are supported: +* [PLAIN](Encodings.md#plain-plain--0) + parquet-mr: Basically all value types are written in this encoding in case of + V1 pages +* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8) + **(?)** + parquet-mr: As per the spec this encoding is deprecated while we still use it + for V1 page dictionaries. +* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3) + parquet-m
[GitHub] [parquet-format] gszadovszky commented on a change in pull request #164: PARQUET-1950: Define core features
gszadovszky commented on a change in pull request #164: URL: https://github.com/apache/parquet-format/pull/164#discussion_r567688108 ## File path: CoreFeatures.md ## @@ -0,0 +1,178 @@ + + +# Parquet Core Features + +This document lists the core features for each parquet-format release. This +list is a subset of the features which parquet-format makes available. + +## Purpose + +The list of core features for a certain release makes a compliance level for +implementations. If a writer implementation claims that it is at a certain +compliance level then it must use only features from the *core feature list* of +that parquet-format release. If a reader implementation claims the same if must +implement all of the listed features. This way it is easier to ensure +compatibility between the different parquet implementations. + +We cannot and don't want to stop our clients to use any features that are not +on this list but it shall be highlighted that using these features might make +the written parquet files unreadable by other implementations. We can say that +the features available in a parquet-format release (and one of the +implementations of it) and not on the *core feature list* are experimental. + +## Versioning + +This document is versioned by the parquet-format releases which follows the +scheme of semantic versioning. It means that no feature will be deleted from +this document under the same major version. (We might deprecate some, though.) +Because of the semantic versioning if one implementation supports the core +features of the parquet-format release `a.b.x` it must be able to read any +parquet files written by implementations supporting the release `a.d.y` where +`b >= d`. + +If a parquet file is written according to a released version of this document +it might be a good idea to write this version into the field `compliance_level` +in the thrift object `FileMetaData`. + +## Adding new features + +The idea is to only include features which are specified correctly and proven +to be useful for everyone. Because of that we require to have at least two +different implementations that are released and widely tested. We also require +to implement interoperability tests for that feature to prove one +implementation can read the data written by the other one and vice versa. + +## Core feature list + +This list is based on the [parquet thrift file](src/main/thrift/parquet.thrift) +where all the data structures we might use in a parquet file are defined. + +### File structure + +All of the required fields in the structure (and sub-structures) of +`FileMetaData` must be set according to the specification. +The following page types are supported: +* Data page V1 (see `DataPageHeader`) +* Dictionary page (see `DictionaryPageHeader`) + +**TODO**: list optional fields that must be filled properly. + +### Types + + Primitive types + +The following [primitive types](README.md#types) are supported +* `BOOLEAN` +* `INT32` +* `INT64` +* `FLOAT` +* `DOUBLE` +* `BYTE\_ARRAY` +* `FIXED\_LEN\_BYTE\_ARRAY` + +NOTE: The primitive type `INT96` is deprecated so it is intentionally not listed +here. + + Logical types + +The [logical type](LogicalTypes.md)s are practically annotations helping to +understand the related primitive type (or structure). Originally we have had +the `ConvertedType` enum in the thrift file representing all the possible +logical types. After a while we realized it is hard to extend and so introduced +the `LogicalType` union. For backward compatibility reasons we allow to use the +old `ConvertedType` values according to the specified rules but we expect that +the logical types in the file schema are defined with `LogicalType` objects. + +The following LogicalTypes are supported: +* `STRING` +* `MAP` +* `LIST` +* `ENUM` +* `DECIMAL` (for which primitives?) +* `DATE` +* `TIME`: **(Which unit, utc?)** +* `TIMESTAMP`: **(Which unit, utc?)** +* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)** +* `UNKNOWN` **(?)** +* `JSON` **(?)** +* `BSON` **(?)** +* `UUID` **(?)** + +NOTE: The old ConvertedType `INTERVAL` has no representation in LogicalTypes. +This is becasue `INTERVAL` is deprecated so we do not include it in this list. + +### Encodings + +The following encodings are supported: +* [PLAIN](Encodings.md#plain-plain--0) + parquet-mr: Basically all value types are written in this encoding in case of + V1 pages +* [PLAIN\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8) + **(?)** + parquet-mr: As per the spec this encoding is deprecated while we still use it + for V1 page dictionaries. +* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3) + parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN Review comment: The parts describing how the parquet-mr implementations work were not meant to be part of the final document. As I don't know too much about other implementations I'v
[jira] [Comment Edited] (PARQUET-1805) Refactor the configuration for bloom filters
[ https://issues.apache.org/jira/browse/PARQUET-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276198#comment-17276198 ] Junjie Chen edited comment on PARQUET-1805 at 2/1/21, 9:43 AM: --- I think what [~yumwang] concern is we enable all columns' bloom filter when {{parquet.bloom.filter.enabled}} is set to true. That behaviour is a bit odd consider if we have a table with a heap of columns. We could change to use {{parquet.bloom.filter.enabled#column.path}} to enable the bloom filter for the specific column after setting {{parquet.bloom.filter.enabled}}. was (Author: junjie): I think what [~yumwang] concern is we enable all columns' bloom filter when {{parquet.bloom.filter.enabled}} is set to true. That behaviour is a bit odd, we could change to use {{parquet.bloom.filter.enabled#column.path}} to enable the bloom filter for the specific column after setting {{parquet.bloom.filter.enabled}}. > Refactor the configuration for bloom filters > > > Key: PARQUET-1805 > URL: https://issues.apache.org/jira/browse/PARQUET-1805 > Project: Parquet > Issue Type: Improvement >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > Refactor the hadoop configuration for bloom filters according to PARQUET-1784. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1805) Refactor the configuration for bloom filters
[ https://issues.apache.org/jira/browse/PARQUET-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276198#comment-17276198 ] Junjie Chen commented on PARQUET-1805: -- I think what [~yumwang] concern is we enable all columns' bloom filter when {{parquet.bloom.filter.enabled}} is set to true. That behaviour is a bit odd, we could change to use {{parquet.bloom.filter.enabled#column.path}} to enable the bloom filter for the specific column after setting {{parquet.bloom.filter.enabled}}. > Refactor the configuration for bloom filters > > > Key: PARQUET-1805 > URL: https://issues.apache.org/jira/browse/PARQUET-1805 > Project: Parquet > Issue Type: Improvement >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > Refactor the hadoop configuration for bloom filters according to PARQUET-1784. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1950) Define core features / compliance level
[ https://issues.apache.org/jira/browse/PARQUET-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276187#comment-17276187 ] ASF GitHub Bot commented on PARQUET-1950: - gszadovszky commented on pull request #164: URL: https://github.com/apache/parquet-format/pull/164#issuecomment-770716726 @emkornfield, in parquet-mr there was another reason to use the `file_path` in the footer. The feature is called _summary files_. The idea was to have a separate file containing a summarized footer of several parquet files so you might do filtering and pruning without even checking a file's own footer. As far as I know this implementation exists in parquet-mr only and there are no specification for it in parquet-format. This feature is more or less abandoned meaning during the development of some newer features (e.g. column indexes, bloom filters) the related parts might not updated properly. There were a couple of discussions about this topic in the dev list: [here](https://lists.apache.org/thread.html/fb232d024d3ca0f3900b76fb884b55fad11dffafb182d6f336b37a69%40%3Cdev.parquet.apache.org%3E) and [here](https://lists.apache.org/thread.html/r2e539c50c1cc818304de2b7dc28a4109aaa529955a42664e3073f811%40%3Cdev.parquet.apache.org%3E). Because non of the ideas of _external column chunks_ nor the _summary files_ were spread across the different implementations (because of the lack of specification) I think we should not include the usage of the field `file_path` in this document or even explicitly specify that this field is not supported. I am open to specify such features properly and after the required demonstration we may include them in a later version of the core features. However, I think these requirements (e.g. snapshot API, summary files) are not necessarily needed by all of our clients or already implemented in some ways (e.g. storing statistics in HMS, Iceberg). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Define core features / compliance level > --- > > Key: PARQUET-1950 > URL: https://issues.apache.org/jira/browse/PARQUET-1950 > Project: Parquet > Issue Type: New Feature > Components: parquet-format >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > > Parquet format is getting more and more features while the different > implementations cannot keep the pace and left behind with some features > implemented and some are not. In many cases it is also not clear if the > related feature is mature enough to be used widely or more an experimental > one. > These are huge issues that makes hard ensure interoperability between the > different implementations. > The following idea came up in a > [discussion|https://lists.apache.org/thread.html/rde5cba8443487bccd47593ddf5dfb39f69c729d260165cb936a1a289%40%3Cdev.parquet.apache.org%3E]. > Create a now document in the parquet-format repository that lists the "core > features". This document is versioned by the parquet-format releases. This > way a certain version of "core features" defines a level of compatibility > between the different implementations. This version number can be written to > a new field (e.g. complianceLevel) in the footer. If an implementation writes > a file with a version in the field it must implement all the related "core > features" (read and write) and must not use any other features at write > because it makes the data unreadable by another implementation if only the > same level of "core features" are implemented. > For example if we have encoding A listed in the version 1 "core features" but > encoding B is not then at "complianceLevel = 1" we can use encoding A but we > cannot use encoding B because it would make the related data unreadable. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-format] gszadovszky commented on pull request #164: PARQUET-1950: Define core features
gszadovszky commented on pull request #164: URL: https://github.com/apache/parquet-format/pull/164#issuecomment-770716726 @emkornfield, in parquet-mr there was another reason to use the `file_path` in the footer. The feature is called _summary files_. The idea was to have a separate file containing a summarized footer of several parquet files so you might do filtering and pruning without even checking a file's own footer. As far as I know this implementation exists in parquet-mr only and there are no specification for it in parquet-format. This feature is more or less abandoned meaning during the development of some newer features (e.g. column indexes, bloom filters) the related parts might not updated properly. There were a couple of discussions about this topic in the dev list: [here](https://lists.apache.org/thread.html/fb232d024d3ca0f3900b76fb884b55fad11dffafb182d6f336b37a69%40%3Cdev.parquet.apache.org%3E) and [here](https://lists.apache.org/thread.html/r2e539c50c1cc818304de2b7dc28a4109aaa529955a42664e3073f811%40%3Cdev.parquet.apache.org%3E). Because non of the ideas of _external column chunks_ nor the _summary files_ were spread across the different implementations (because of the lack of specification) I think we should not include the usage of the field `file_path` in this document or even explicitly specify that this field is not supported. I am open to specify such features properly and after the required demonstration we may include them in a later version of the core features. However, I think these requirements (e.g. snapshot API, summary files) are not necessarily needed by all of our clients or already implemented in some ways (e.g. storing statistics in HMS, Iceberg). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1805) Refactor the configuration for bloom filters
[ https://issues.apache.org/jira/browse/PARQUET-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276149#comment-17276149 ] Gabor Szadovszky commented on PARQUET-1805: --- [~yumwang], I think this performance issue is not related to this jira but the whole bloom filter feature (PARQUET-41). If you turn on the writing of the bloom filters for all the columns it will impact writing performance. (You may check the related configuration parameters at https://github.com/apache/parquet-mr/tree/master/parquet-hadoop for details.) I am not an expert of this feature and maybe we can improve the writing performance but generating bloom filters will have performance impact. It is up to the user to decide if this impact worth for the potential benefit at read time. That's why it is highly suggested to specify which exact columns are the bloom filters required for and also to specify the other parameters for bloom filter. [~junjie], any comments on this? > Refactor the configuration for bloom filters > > > Key: PARQUET-1805 > URL: https://issues.apache.org/jira/browse/PARQUET-1805 > Project: Parquet > Issue Type: Improvement >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > Refactor the hadoop configuration for bloom filters according to PARQUET-1784. -- This message was sent by Atlassian Jira (v8.3.4#803005)