[jira] [Commented] (SPARK-13127) Upgrade Parquet to 1.9 (Fixes parquet sorting)
[ https://issues.apache.org/jira/browse/SPARK-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532534#comment-16532534 ] Al M commented on SPARK-13127: -- Would be great to get this resolved in Spark 2.3.2. Especially since Parquet 1.9 supports delta encoding: https://issues.apache.org/jira/browse/PARQUET-225 > Upgrade Parquet to 1.9 (Fixes parquet sorting) > -- > > Key: SPARK-13127 > URL: https://issues.apache.org/jira/browse/SPARK-13127 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Justin Pihony >Priority: Major > > Currently, when you write a sorted DataFrame to Parquet, then reading the > data back out is not sorted by default. [This is due to a bug in > Parquet|https://issues.apache.org/jira/browse/PARQUET-241] that was fixed in > 1.9. > There is a workaround to read the file back in using a file glob (filepath/*). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13127) Upgrade Parquet to 1.9 (Fixes parquet sorting)
[ https://issues.apache.org/jira/browse/SPARK-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367925#comment-16367925 ] Li Jin commented on SPARK-13127: Hi all, The status of the Jira is "Progress". I am wondering if this is being actively worked on? > Upgrade Parquet to 1.9 (Fixes parquet sorting) > -- > > Key: SPARK-13127 > URL: https://issues.apache.org/jira/browse/SPARK-13127 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Justin Pihony >Priority: Major > > Currently, when you write a sorted DataFrame to Parquet, then reading the > data back out is not sorted by default. [This is due to a bug in > Parquet|https://issues.apache.org/jira/browse/PARQUET-241] that was fixed in > 1.9. > There is a workaround to read the file back in using a file glob (filepath/*). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13127) Upgrade Parquet to 1.9 (Fixes parquet sorting)
[ https://issues.apache.org/jira/browse/SPARK-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305691#comment-16305691 ] Dong Jiang commented on SPARK-13127: [~gaurav24], looks like you are like me, waiting for this ticket to be worked on. If you would like, help to comment on this thread in developer list to advocate to have this issue resolved in Spark 2.3 release http://apache-spark-developers-list.1001551.n3.nabble.com/Timeline-for-Spark-2-3-td22793.html > Upgrade Parquet to 1.9 (Fixes parquet sorting) > -- > > Key: SPARK-13127 > URL: https://issues.apache.org/jira/browse/SPARK-13127 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Justin Pihony > > Currently, when you write a sorted DataFrame to Parquet, then reading the > data back out is not sorted by default. [This is due to a bug in > Parquet|https://issues.apache.org/jira/browse/PARQUET-241] that was fixed in > 1.9. > There is a workaround to read the file back in using a file glob (filepath/*). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13127) Upgrade Parquet to 1.9 (Fixes parquet sorting)
[ https://issues.apache.org/jira/browse/SPARK-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305612#comment-16305612 ] Gaurav Shah commented on SPARK-13127: - I am surprised people haven't hit https://issues.apache.org/jira/browse/PARQUET-353, I constantly face OOM error on a continuous streaming application. Wondering if we would get parquet 1.9.1 and then upgrade spark to use that. https://issues.apache.org/jira/browse/PARQUET-1027 > Upgrade Parquet to 1.9 (Fixes parquet sorting) > -- > > Key: SPARK-13127 > URL: https://issues.apache.org/jira/browse/SPARK-13127 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Justin Pihony > > Currently, when you write a sorted DataFrame to Parquet, then reading the > data back out is not sorted by default. [This is due to a bug in > Parquet|https://issues.apache.org/jira/browse/PARQUET-241] that was fixed in > 1.9. > There is a workaround to read the file back in using a file glob (filepath/*). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13127) Upgrade Parquet to 1.9 (Fixes parquet sorting)
[ https://issues.apache.org/jira/browse/SPARK-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250512#comment-16250512 ] Dong Jiang commented on SPARK-13127: [~igozali], I think you are referring to this parquet ticket: https://issues.apache.org/jira/browse/PARQUET-686 The parquet ticket indicated the fix is in 1.9.0, so we still need Spark to upgrade parquet to 1.9.0 I have examined the parquet file generated by Spark 2.2, the string column doesn't have the min/max generated in the footer. I believe it is disabled. > Upgrade Parquet to 1.9 (Fixes parquet sorting) > -- > > Key: SPARK-13127 > URL: https://issues.apache.org/jira/browse/SPARK-13127 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Justin Pihony > > Currently, when you write a sorted DataFrame to Parquet, then reading the > data back out is not sorted by default. [This is due to a bug in > Parquet|https://issues.apache.org/jira/browse/PARQUET-241] that was fixed in > 1.9. > There is a workaround to read the file back in using a file glob (filepath/*). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13127) Upgrade Parquet to 1.9 (Fixes parquet sorting)
[ https://issues.apache.org/jira/browse/SPARK-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248235#comment-16248235 ] Ivan Gozali commented on SPARK-13127: - Hello! I was looking to find more information on Parquet string comparison issues and eventually found my way here. I was just curious to see if this issue has been resolved by upgrading Parquet to 1.8.2, since that's what the PR referenced here seems to suggest? Are there still any benefits to be gained by upgrading to Parquet 1.9.0? > Upgrade Parquet to 1.9 (Fixes parquet sorting) > -- > > Key: SPARK-13127 > URL: https://issues.apache.org/jira/browse/SPARK-13127 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Justin Pihony > > Currently, when you write a sorted DataFrame to Parquet, then reading the > data back out is not sorted by default. [This is due to a bug in > Parquet|https://issues.apache.org/jira/browse/PARQUET-241] that was fixed in > 1.9. > There is a workaround to read the file back in using a file glob (filepath/*). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13127) Upgrade Parquet to 1.9 (Fixes parquet sorting)
[ https://issues.apache.org/jira/browse/SPARK-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749118#comment-15749118 ] Apache Spark commented on SPARK-13127: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/16281 > Upgrade Parquet to 1.9 (Fixes parquet sorting) > -- > > Key: SPARK-13127 > URL: https://issues.apache.org/jira/browse/SPARK-13127 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Justin Pihony > > Currently, when you write a sorted DataFrame to Parquet, then reading the > data back out is not sorted by default. [This is due to a bug in > Parquet|https://issues.apache.org/jira/browse/PARQUET-241] that was fixed in > 1.9. > There is a workaround to read the file back in using a file glob (filepath/*). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13127) Upgrade Parquet to 1.9 (Fixes parquet sorting)
[ https://issues.apache.org/jira/browse/SPARK-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15136236#comment-15136236 ] Sean Owen commented on SPARK-13127: --- [~JustinPihony] I suspect this is a good idea, but whenever someone suggests a dependency upgrade, the question is of course: are there incompatible changes? is it compatible with other dependencies? does it work with all transitive dependencies? Would you mind opening a PR with the change, which will entail running the dependency update scripts to check and declare the changed transitive dependencies? and then also review release notes to identify any breaking changes we should know about? for 2.0.0 we can tolerate most incompatibilities but good to know. > Upgrade Parquet to 1.9 (Fixes parquet sorting) > -- > > Key: SPARK-13127 > URL: https://issues.apache.org/jira/browse/SPARK-13127 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Justin Pihony >Priority: Minor > > Currently, when you write a sorted DataFrame to Parquet, then reading the > data back out is not sorted by default. [This is due to a bug in > Parquet|https://issues.apache.org/jira/browse/PARQUET-241] that was fixed in > 1.9. > There is a workaround to read the file back in using a file glob (filepath/*). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org