[jira] [Updated] (SPARK-15420) Repartition and sort before Parquet writes
[ https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-15420: -- Target Version/s: (was: 2.4.0) > Repartition and sort before Parquet writes > -- > > Key: SPARK-15420 > URL: https://issues.apache.org/jira/browse/SPARK-15420 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Ryan Blue >Priority: Major > > Parquet requires buffering data in memory before writing a group of rows > organized by column. This causes significant memory pressure when writing > partitioned output because each open file must buffer rows. > Currently, Spark will sort data and spill if necessary in the > {{WriterContainer}} to avoid keeping many files open at once. But, this isn't > a full solution for a few reasons: > * The final sort is always performed, even if incoming data is already sorted > correctly. For example, a global sort will cause two sorts to happen, even if > the global sort correctly prepares the data. > * To prevent a large number of output small output files, users must manually > add a repartition step. That step is also ignored by the sort within the > writer. > * Hive does not currently support {{DataFrameWriter#sortBy}} > The sort in {{WriterContainer}} makes sense to prevent problems, but should > detect if the incoming data is already sorted. The {{DataFrameWriter}} should > also expose the ability to repartition data before the write stage, and the > query planner should expose an option to automatically insert repartition > operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15420) Repartition and sort before Parquet writes
[ https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal updated SPARK-15420: --- Target Version/s: 2.4.0 (was: 2.3.0) > Repartition and sort before Parquet writes > -- > > Key: SPARK-15420 > URL: https://issues.apache.org/jira/browse/SPARK-15420 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Ryan Blue > > Parquet requires buffering data in memory before writing a group of rows > organized by column. This causes significant memory pressure when writing > partitioned output because each open file must buffer rows. > Currently, Spark will sort data and spill if necessary in the > {{WriterContainer}} to avoid keeping many files open at once. But, this isn't > a full solution for a few reasons: > * The final sort is always performed, even if incoming data is already sorted > correctly. For example, a global sort will cause two sorts to happen, even if > the global sort correctly prepares the data. > * To prevent a large number of output small output files, users must manually > add a repartition step. That step is also ignored by the sort within the > writer. > * Hive does not currently support {{DataFrameWriter#sortBy}} > The sort in {{WriterContainer}} makes sense to prevent problems, but should > detect if the incoming data is already sorted. The {{DataFrameWriter}} should > also expose the ability to repartition data before the write stage, and the > query planner should expose an option to automatically insert repartition > operations. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15420) Repartition and sort before Parquet writes
[ https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15420: - Target Version/s: 2.3.0 (was: 2.2.0) > Repartition and sort before Parquet writes > -- > > Key: SPARK-15420 > URL: https://issues.apache.org/jira/browse/SPARK-15420 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Ryan Blue > > Parquet requires buffering data in memory before writing a group of rows > organized by column. This causes significant memory pressure when writing > partitioned output because each open file must buffer rows. > Currently, Spark will sort data and spill if necessary in the > {{WriterContainer}} to avoid keeping many files open at once. But, this isn't > a full solution for a few reasons: > * The final sort is always performed, even if incoming data is already sorted > correctly. For example, a global sort will cause two sorts to happen, even if > the global sort correctly prepares the data. > * To prevent a large number of output small output files, users must manually > add a repartition step. That step is also ignored by the sort within the > writer. > * Hive does not currently support {{DataFrameWriter#sortBy}} > The sort in {{WriterContainer}} makes sense to prevent problems, but should > detect if the incoming data is already sorted. The {{DataFrameWriter}} should > also expose the ability to repartition data before the write stage, and the > query planner should expose an option to automatically insert repartition > operations. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15420) Repartition and sort before Parquet writes
[ https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15420: Target Version/s: 2.2.0 (was: 2.1.0) > Repartition and sort before Parquet writes > -- > > Key: SPARK-15420 > URL: https://issues.apache.org/jira/browse/SPARK-15420 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Ryan Blue > > Parquet requires buffering data in memory before writing a group of rows > organized by column. This causes significant memory pressure when writing > partitioned output because each open file must buffer rows. > Currently, Spark will sort data and spill if necessary in the > {{WriterContainer}} to avoid keeping many files open at once. But, this isn't > a full solution for a few reasons: > * The final sort is always performed, even if incoming data is already sorted > correctly. For example, a global sort will cause two sorts to happen, even if > the global sort correctly prepares the data. > * To prevent a large number of output small output files, users must manually > add a repartition step. That step is also ignored by the sort within the > writer. > * Hive does not currently support {{DataFrameWriter#sortBy}} > The sort in {{WriterContainer}} makes sense to prevent problems, but should > detect if the incoming data is already sorted. The {{DataFrameWriter}} should > also expose the ability to repartition data before the write stage, and the > query planner should expose an option to automatically insert repartition > operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15420) Repartition and sort before Parquet writes
[ https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15420: Target Version/s: 2.1.0 > Repartition and sort before Parquet writes > -- > > Key: SPARK-15420 > URL: https://issues.apache.org/jira/browse/SPARK-15420 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Ryan Blue > > Parquet requires buffering data in memory before writing a group of rows > organized by column. This causes significant memory pressure when writing > partitioned output because each open file must buffer rows. > Currently, Spark will sort data and spill if necessary in the > {{WriterContainer}} to avoid keeping many files open at once. But, this isn't > a full solution for a few reasons: > * The final sort is always performed, even if incoming data is already sorted > correctly. For example, a global sort will cause two sorts to happen, even if > the global sort correctly prepares the data. > * To prevent a large number of output small output files, users must manually > add a repartition step. That step is also ignored by the sort within the > writer. > * Hive does not currently support {{DataFrameWriter#sortBy}} > The sort in {{WriterContainer}} makes sense to prevent problems, but should > detect if the incoming data is already sorted. The {{DataFrameWriter}} should > also expose the ability to repartition data before the write stage, and the > query planner should expose an option to automatically insert repartition > operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org