[jira] [Created] (SPARK-32776) Limit in streaming should not be optimized away by PropagateEmptyRelation

2020-09-01 Thread Liwen Sun (Jira)
Liwen Sun created SPARK-32776:
-

 Summary: Limit in streaming should not be optimized away by 
PropagateEmptyRelation
 Key: SPARK-32776
 URL: https://issues.apache.org/jira/browse/SPARK-32776
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.1.0
Reporter: Liwen Sun


Right now, the limit operator in a streaming query may get optimized away when 
the relation is empty. This can be problematic for stateful streaming, as this 
empty batch will not write any state store files, and the next batch will fail 
when trying to read these state store files and throw a file not found error.

We should not let PropagateEmptyRelation optimize away the Limit operator for 
streaming queries.

This ticket is intended to apply a small and safe fix for 
PropagateEmptyRelation. A fundamental fix that can prevent this from happening 
again in the future and in other optimizer rules is more desirable, but that's 
a much larger task.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27938) Turn on LEGACY_PASS_PARTITION_BY_AS_OPTIONS by default

2019-06-03 Thread Liwen Sun (JIRA)
Liwen Sun created SPARK-27938:
-

 Summary: Turn on LEGACY_PASS_PARTITION_BY_AS_OPTIONS by default
 Key: SPARK-27938
 URL: https://issues.apache.org/jira/browse/SPARK-27938
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Liwen Sun


In SPARK-27453, we added a config {{LEGACY_PASS_PARTITION_BY_AS_OPTIONS}} for 
patch release 2.4.3, where we keep this config default as false so it's not 
intrusive. We can turn it on by default for Spark 3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25713) Implement copy() for ColumnarArray

2018-10-11 Thread Liwen Sun (JIRA)
Liwen Sun created SPARK-25713:
-

 Summary: Implement copy() for ColumnarArray
 Key: SPARK-25713
 URL: https://issues.apache.org/jira/browse/SPARK-25713
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Liwen Sun






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4502) Spark SQL unnecessarily reads the entire nested column from Parquet

2014-11-19 Thread Liwen Sun (JIRA)
Liwen Sun created SPARK-4502:


 Summary: Spark SQL unnecessarily reads the entire nested column 
from Parquet
 Key: SPARK-4502
 URL: https://issues.apache.org/jira/browse/SPARK-4502
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Liwen Sun


When reading a field of a nested column from Parquet, SparkSQL reads and 
assemble all the fields of that nested column. This is unnecessary, as Parquet 
supports fine-grained field reads out of a nested column. This may degrades the 
performance significantly when a nested column has many fields. 

For example, I loaded json tweets data into SparkSQL and ran the following 
query:

{{SELECT User.contributors_enabled from Tweets;}}

User is a nested structure that has 38 primitive fields (for Tweets schema, 
see: https://dev.twitter.com/overview/api/tweets), here is the log message:

{{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 cell/ms}}

For comparison, I also ran:
{{SELECT User FROM Tweets;}}

And here is the log message:
{{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}

So both queries load 38 columns from Parquet, while the first query only need 1 
column. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4502) Spark SQL unnecessarily reads the entire nested column from Parquet

2014-11-19 Thread Liwen Sun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwen Sun updated SPARK-4502:
-
Component/s: SQL

 Spark SQL unnecessarily reads the entire nested column from Parquet
 ---

 Key: SPARK-4502
 URL: https://issues.apache.org/jira/browse/SPARK-4502
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Liwen Sun

 When reading a field of a nested column from Parquet, SparkSQL reads and 
 assemble all the fields of that nested column. This is unnecessary, as 
 Parquet supports fine-grained field reads out of a nested column. This may 
 degrades the performance significantly when a nested column has many fields. 
 For example, I loaded json tweets data into SparkSQL and ran the following 
 query:
 {{SELECT User.contributors_enabled from Tweets;}}
 User is a nested structure that has 38 primitive fields (for Tweets schema, 
 see: https://dev.twitter.com/overview/api/tweets), here is the log message:
 {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
 cell/ms}}
 For comparison, I also ran:
 {{SELECT User FROM Tweets;}}
 And here is the log message:
 {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
 So both queries load 38 columns from Parquet, while the first query only need 
 1 column. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4502) Spark SQL unnecessarily reads the entire nested column from Parquet

2014-11-19 Thread Liwen Sun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwen Sun updated SPARK-4502:
-
Description: 
When reading a field of a nested column from Parquet, SparkSQL reads and 
assemble all the fields of that nested column. This is unnecessary, as Parquet 
supports fine-grained field reads out of a nested column. This may degrades the 
performance significantly when a nested column has many fields. 

For example, I loaded json tweets data into SparkSQL and ran the following 
query:

{{SELECT User.contributors_enabled from Tweets;}}

User is a nested structure that has 38 primitive fields (for Tweets schema, 
see: https://dev.twitter.com/overview/api/tweets), here is the log message:

{{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 cell/ms}}

For comparison, I also ran:
{{SELECT User FROM Tweets;}}

And here is the log message:
{{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}

So both queries load 38 columns from Parquet, while the first query only need 1 
column. I also measured the bytes read within Parquet. In these two cases, the 
same number of bytes (99365194 bytes) were read. 

  was:
When reading a field of a nested column from Parquet, SparkSQL reads and 
assemble all the fields of that nested column. This is unnecessary, as Parquet 
supports fine-grained field reads out of a nested column. This may degrades the 
performance significantly when a nested column has many fields. 

For example, I loaded json tweets data into SparkSQL and ran the following 
query:

{{SELECT User.contributors_enabled from Tweets;}}

User is a nested structure that has 38 primitive fields (for Tweets schema, 
see: https://dev.twitter.com/overview/api/tweets), here is the log message:

{{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 cell/ms}}

For comparison, I also ran:
{{SELECT User FROM Tweets;}}

And here is the log message:
{{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}

So both queries load 38 columns from Parquet, while the first query only need 1 
column. 



 Spark SQL unnecessarily reads the entire nested column from Parquet
 ---

 Key: SPARK-4502
 URL: https://issues.apache.org/jira/browse/SPARK-4502
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Liwen Sun

 When reading a field of a nested column from Parquet, SparkSQL reads and 
 assemble all the fields of that nested column. This is unnecessary, as 
 Parquet supports fine-grained field reads out of a nested column. This may 
 degrades the performance significantly when a nested column has many fields. 
 For example, I loaded json tweets data into SparkSQL and ran the following 
 query:
 {{SELECT User.contributors_enabled from Tweets;}}
 User is a nested structure that has 38 primitive fields (for Tweets schema, 
 see: https://dev.twitter.com/overview/api/tweets), here is the log message:
 {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
 cell/ms}}
 For comparison, I also ran:
 {{SELECT User FROM Tweets;}}
 And here is the log message:
 {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
 So both queries load 38 columns from Parquet, while the first query only need 
 1 column. I also measured the bytes read within Parquet. In these two cases, 
 the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4502) Spark SQL reads unneccesary fields from Parquet

2014-11-19 Thread Liwen Sun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwen Sun updated SPARK-4502:
-
Summary: Spark SQL reads unneccesary fields from Parquet  (was: Spark SQL 
unnecessarily reads the entire nested column from Parquet)

 Spark SQL reads unneccesary fields from Parquet
 ---

 Key: SPARK-4502
 URL: https://issues.apache.org/jira/browse/SPARK-4502
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Liwen Sun

 When reading a field of a nested column from Parquet, SparkSQL reads and 
 assemble all the fields of that nested column. This is unnecessary, as 
 Parquet supports fine-grained field reads out of a nested column. This may 
 degrades the performance significantly when a nested column has many fields. 
 For example, I loaded json tweets data into SparkSQL and ran the following 
 query:
 {{SELECT User.contributors_enabled from Tweets;}}
 User is a nested structure that has 38 primitive fields (for Tweets schema, 
 see: https://dev.twitter.com/overview/api/tweets), here is the log message:
 {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
 cell/ms}}
 For comparison, I also ran:
 {{SELECT User FROM Tweets;}}
 And here is the log message:
 {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
 So both queries load 38 columns from Parquet, while the first query only need 
 1 column. I also measured the bytes read within Parquet. In these two cases, 
 the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4502) Spark SQL reads unneccesary fields from Parquet

2014-11-19 Thread Liwen Sun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwen Sun updated SPARK-4502:
-
Description: 
When reading a field of a nested column from Parquet, SparkSQL reads and 
assemble all the fields of that nested column. This is unnecessary, as Parquet 
supports fine-grained field reads out of a nested column. This may degrades the 
performance significantly when a nested column has many fields. 

For example, I loaded json tweets data into SparkSQL and ran the following 
query:

{{SELECT User.contributors_enabled from Tweets;}}

User is a nested structure that has 38 primitive fields (for Tweets schema, 
see: https://dev.twitter.com/overview/api/tweets), here is the log message:

{{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 cell/ms}}

For comparison, I also ran:
{{SELECT User FROM Tweets;}}

And here is the log message:
{{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}

So both queries load 38 columns from Parquet, while the first query only needs 
1 column. I also measured the bytes read within Parquet. In these two cases, 
the same number of bytes (99365194 bytes) were read. 

  was:
When reading a field of a nested column from Parquet, SparkSQL reads and 
assemble all the fields of that nested column. This is unnecessary, as Parquet 
supports fine-grained field reads out of a nested column. This may degrades the 
performance significantly when a nested column has many fields. 

For example, I loaded json tweets data into SparkSQL and ran the following 
query:

{{SELECT User.contributors_enabled from Tweets;}}

User is a nested structure that has 38 primitive fields (for Tweets schema, 
see: https://dev.twitter.com/overview/api/tweets), here is the log message:

{{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 cell/ms}}

For comparison, I also ran:
{{SELECT User FROM Tweets;}}

And here is the log message:
{{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}

So both queries load 38 columns from Parquet, while the first query only need 1 
column. I also measured the bytes read within Parquet. In these two cases, the 
same number of bytes (99365194 bytes) were read. 


 Spark SQL reads unneccesary fields from Parquet
 ---

 Key: SPARK-4502
 URL: https://issues.apache.org/jira/browse/SPARK-4502
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Liwen Sun

 When reading a field of a nested column from Parquet, SparkSQL reads and 
 assemble all the fields of that nested column. This is unnecessary, as 
 Parquet supports fine-grained field reads out of a nested column. This may 
 degrades the performance significantly when a nested column has many fields. 
 For example, I loaded json tweets data into SparkSQL and ran the following 
 query:
 {{SELECT User.contributors_enabled from Tweets;}}
 User is a nested structure that has 38 primitive fields (for Tweets schema, 
 see: https://dev.twitter.com/overview/api/tweets), here is the log message:
 {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
 cell/ms}}
 For comparison, I also ran:
 {{SELECT User FROM Tweets;}}
 And here is the log message:
 {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
 So both queries load 38 columns from Parquet, while the first query only 
 needs 1 column. I also measured the bytes read within Parquet. In these two 
 cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org