[jira] [Created] (SPARK-32776) Limit in streaming should not be optimized away by PropagateEmptyRelation
Liwen Sun created SPARK-32776: - Summary: Limit in streaming should not be optimized away by PropagateEmptyRelation Key: SPARK-32776 URL: https://issues.apache.org/jira/browse/SPARK-32776 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.1.0 Reporter: Liwen Sun Right now, the limit operator in a streaming query may get optimized away when the relation is empty. This can be problematic for stateful streaming, as this empty batch will not write any state store files, and the next batch will fail when trying to read these state store files and throw a file not found error. We should not let PropagateEmptyRelation optimize away the Limit operator for streaming queries. This ticket is intended to apply a small and safe fix for PropagateEmptyRelation. A fundamental fix that can prevent this from happening again in the future and in other optimizer rules is more desirable, but that's a much larger task. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27938) Turn on LEGACY_PASS_PARTITION_BY_AS_OPTIONS by default
Liwen Sun created SPARK-27938: - Summary: Turn on LEGACY_PASS_PARTITION_BY_AS_OPTIONS by default Key: SPARK-27938 URL: https://issues.apache.org/jira/browse/SPARK-27938 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Liwen Sun In SPARK-27453, we added a config {{LEGACY_PASS_PARTITION_BY_AS_OPTIONS}} for patch release 2.4.3, where we keep this config default as false so it's not intrusive. We can turn it on by default for Spark 3.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25713) Implement copy() for ColumnarArray
Liwen Sun created SPARK-25713: - Summary: Implement copy() for ColumnarArray Key: SPARK-25713 URL: https://issues.apache.org/jira/browse/SPARK-25713 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: Liwen Sun -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4502) Spark SQL unnecessarily reads the entire nested column from Parquet
Liwen Sun created SPARK-4502: Summary: Spark SQL unnecessarily reads the entire nested column from Parquet Key: SPARK-4502 URL: https://issues.apache.org/jira/browse/SPARK-4502 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Liwen Sun When reading a field of a nested column from Parquet, SparkSQL reads and assemble all the fields of that nested column. This is unnecessary, as Parquet supports fine-grained field reads out of a nested column. This may degrades the performance significantly when a nested column has many fields. For example, I loaded json tweets data into SparkSQL and ran the following query: {{SELECT User.contributors_enabled from Tweets;}} User is a nested structure that has 38 primitive fields (for Tweets schema, see: https://dev.twitter.com/overview/api/tweets), here is the log message: {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 cell/ms}} For comparison, I also ran: {{SELECT User FROM Tweets;}} And here is the log message: {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} So both queries load 38 columns from Parquet, while the first query only need 1 column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4502) Spark SQL unnecessarily reads the entire nested column from Parquet
[ https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwen Sun updated SPARK-4502: - Component/s: SQL Spark SQL unnecessarily reads the entire nested column from Parquet --- Key: SPARK-4502 URL: https://issues.apache.org/jira/browse/SPARK-4502 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Liwen Sun When reading a field of a nested column from Parquet, SparkSQL reads and assemble all the fields of that nested column. This is unnecessary, as Parquet supports fine-grained field reads out of a nested column. This may degrades the performance significantly when a nested column has many fields. For example, I loaded json tweets data into SparkSQL and ran the following query: {{SELECT User.contributors_enabled from Tweets;}} User is a nested structure that has 38 primitive fields (for Tweets schema, see: https://dev.twitter.com/overview/api/tweets), here is the log message: {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 cell/ms}} For comparison, I also ran: {{SELECT User FROM Tweets;}} And here is the log message: {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} So both queries load 38 columns from Parquet, while the first query only need 1 column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4502) Spark SQL unnecessarily reads the entire nested column from Parquet
[ https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwen Sun updated SPARK-4502: - Description: When reading a field of a nested column from Parquet, SparkSQL reads and assemble all the fields of that nested column. This is unnecessary, as Parquet supports fine-grained field reads out of a nested column. This may degrades the performance significantly when a nested column has many fields. For example, I loaded json tweets data into SparkSQL and ran the following query: {{SELECT User.contributors_enabled from Tweets;}} User is a nested structure that has 38 primitive fields (for Tweets schema, see: https://dev.twitter.com/overview/api/tweets), here is the log message: {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 cell/ms}} For comparison, I also ran: {{SELECT User FROM Tweets;}} And here is the log message: {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} So both queries load 38 columns from Parquet, while the first query only need 1 column. I also measured the bytes read within Parquet. In these two cases, the same number of bytes (99365194 bytes) were read. was: When reading a field of a nested column from Parquet, SparkSQL reads and assemble all the fields of that nested column. This is unnecessary, as Parquet supports fine-grained field reads out of a nested column. This may degrades the performance significantly when a nested column has many fields. For example, I loaded json tweets data into SparkSQL and ran the following query: {{SELECT User.contributors_enabled from Tweets;}} User is a nested structure that has 38 primitive fields (for Tweets schema, see: https://dev.twitter.com/overview/api/tweets), here is the log message: {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 cell/ms}} For comparison, I also ran: {{SELECT User FROM Tweets;}} And here is the log message: {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} So both queries load 38 columns from Parquet, while the first query only need 1 column. Spark SQL unnecessarily reads the entire nested column from Parquet --- Key: SPARK-4502 URL: https://issues.apache.org/jira/browse/SPARK-4502 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Liwen Sun When reading a field of a nested column from Parquet, SparkSQL reads and assemble all the fields of that nested column. This is unnecessary, as Parquet supports fine-grained field reads out of a nested column. This may degrades the performance significantly when a nested column has many fields. For example, I loaded json tweets data into SparkSQL and ran the following query: {{SELECT User.contributors_enabled from Tweets;}} User is a nested structure that has 38 primitive fields (for Tweets schema, see: https://dev.twitter.com/overview/api/tweets), here is the log message: {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 cell/ms}} For comparison, I also ran: {{SELECT User FROM Tweets;}} And here is the log message: {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} So both queries load 38 columns from Parquet, while the first query only need 1 column. I also measured the bytes read within Parquet. In these two cases, the same number of bytes (99365194 bytes) were read. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4502) Spark SQL reads unneccesary fields from Parquet
[ https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwen Sun updated SPARK-4502: - Summary: Spark SQL reads unneccesary fields from Parquet (was: Spark SQL unnecessarily reads the entire nested column from Parquet) Spark SQL reads unneccesary fields from Parquet --- Key: SPARK-4502 URL: https://issues.apache.org/jira/browse/SPARK-4502 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Liwen Sun When reading a field of a nested column from Parquet, SparkSQL reads and assemble all the fields of that nested column. This is unnecessary, as Parquet supports fine-grained field reads out of a nested column. This may degrades the performance significantly when a nested column has many fields. For example, I loaded json tweets data into SparkSQL and ran the following query: {{SELECT User.contributors_enabled from Tweets;}} User is a nested structure that has 38 primitive fields (for Tweets schema, see: https://dev.twitter.com/overview/api/tweets), here is the log message: {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 cell/ms}} For comparison, I also ran: {{SELECT User FROM Tweets;}} And here is the log message: {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} So both queries load 38 columns from Parquet, while the first query only need 1 column. I also measured the bytes read within Parquet. In these two cases, the same number of bytes (99365194 bytes) were read. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4502) Spark SQL reads unneccesary fields from Parquet
[ https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwen Sun updated SPARK-4502: - Description: When reading a field of a nested column from Parquet, SparkSQL reads and assemble all the fields of that nested column. This is unnecessary, as Parquet supports fine-grained field reads out of a nested column. This may degrades the performance significantly when a nested column has many fields. For example, I loaded json tweets data into SparkSQL and ran the following query: {{SELECT User.contributors_enabled from Tweets;}} User is a nested structure that has 38 primitive fields (for Tweets schema, see: https://dev.twitter.com/overview/api/tweets), here is the log message: {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 cell/ms}} For comparison, I also ran: {{SELECT User FROM Tweets;}} And here is the log message: {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} So both queries load 38 columns from Parquet, while the first query only needs 1 column. I also measured the bytes read within Parquet. In these two cases, the same number of bytes (99365194 bytes) were read. was: When reading a field of a nested column from Parquet, SparkSQL reads and assemble all the fields of that nested column. This is unnecessary, as Parquet supports fine-grained field reads out of a nested column. This may degrades the performance significantly when a nested column has many fields. For example, I loaded json tweets data into SparkSQL and ran the following query: {{SELECT User.contributors_enabled from Tweets;}} User is a nested structure that has 38 primitive fields (for Tweets schema, see: https://dev.twitter.com/overview/api/tweets), here is the log message: {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 cell/ms}} For comparison, I also ran: {{SELECT User FROM Tweets;}} And here is the log message: {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} So both queries load 38 columns from Parquet, while the first query only need 1 column. I also measured the bytes read within Parquet. In these two cases, the same number of bytes (99365194 bytes) were read. Spark SQL reads unneccesary fields from Parquet --- Key: SPARK-4502 URL: https://issues.apache.org/jira/browse/SPARK-4502 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Liwen Sun When reading a field of a nested column from Parquet, SparkSQL reads and assemble all the fields of that nested column. This is unnecessary, as Parquet supports fine-grained field reads out of a nested column. This may degrades the performance significantly when a nested column has many fields. For example, I loaded json tweets data into SparkSQL and ran the following query: {{SELECT User.contributors_enabled from Tweets;}} User is a nested structure that has 38 primitive fields (for Tweets schema, see: https://dev.twitter.com/overview/api/tweets), here is the log message: {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 cell/ms}} For comparison, I also ran: {{SELECT User FROM Tweets;}} And here is the log message: {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} So both queries load 38 columns from Parquet, while the first query only needs 1 column. I also measured the bytes read within Parquet. In these two cases, the same number of bytes (99365194 bytes) were read. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org