date:20210214

[jira] [Comment Edited] (SPARK-34295) Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude

2021-02-14 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284588#comment-17284588
 ] 

L. C. Hsieh edited comment on SPARK-34295 at 2/15/21, 7:39 AM:
---

To prevent other questioning about the assignee, I have the changed code ready 
locally but I don't have the environment to test. I'd let our customer to test 
it internally. Once I get the confirmation, I will submit the PR.


was (Author: viirya):
To prevent other questioning about the assignee, I have the changed ready 
locally but I don't have the environment to test. I'd let our customer to test 
it internally. Once I get the confirmation, I will submit the PR.

> Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude
> 
>
> Key: SPARK-34295
> URL: https://issues.apache.org/jira/browse/SPARK-34295
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> MapReduce jobs can instruct YARN to skip renewal of tokens obtained from 
> certain hosts by specifying the hosts with configuration 
> mapreduce.job.hdfs-servers.token-renewal.exclude=,,..,.
> But seems Spark lacks of similar option. So the job submission fails if YARN 
> fails to renew DelegationToken for any of the remote HDFS cluster.  The 
> failure in DT renewal can happen due to many reason like Remote HDFS does not 
> trust Kerberos identity of YARN etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34295) Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude

2021-02-14 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284588#comment-17284588
 ] 

L. C. Hsieh commented on SPARK-34295:
-

To prevent other questioning about the assignee, I have the changed ready 
locally but I don't have the environment to test. I'd let our customer to test 
it internally. Once I get the confirmation, I will submit the PR.

> Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude
> 
>
> Key: SPARK-34295
> URL: https://issues.apache.org/jira/browse/SPARK-34295
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> MapReduce jobs can instruct YARN to skip renewal of tokens obtained from 
> certain hosts by specifying the hosts with configuration 
> mapreduce.job.hdfs-servers.token-renewal.exclude=,,..,.
> But seems Spark lacks of similar option. So the job submission fails if YARN 
> fails to renew DelegationToken for any of the remote HDFS cluster.  The 
> failure in DT renewal can happen due to many reason like Remote HDFS does not 
> trust Kerberos identity of YARN etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34438) Python Driver is not correctly detected using presigned URLs

2021-02-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34438:


Assignee: (was: Apache Spark)

> Python Driver is not correctly detected using presigned URLs
> 
>
> Key: SPARK-34438
> URL: https://issues.apache.org/jira/browse/SPARK-34438
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0
>Reporter: Julian Fleischer
>Priority: Minor
>
> In AWS one can generate so-called presigned URLs. spark-submit accepts URLs 
> for the driver program, e.g. {{http://my-web-server/driver.py}}. Now a 
> presigned URL has a query fragment 
> {{http://my-web-server/driver.py?signature}}.
> Now the check for whether the given URL is a python driver simply checks 
> whether it ends in {{.py}} – which the presigned URL does not, as it ends in 
> {{signature}}.
> The relevant check is in {{SparkSubmit.scala}}, Line 1051 (commit tagged 
> {{v3.0.1}}):
> [https://github.com/apache/spark/blob/v3.0.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L1051]
>  
> Here is a more realistic example URL:
> {{https://bucket-name.s3.us-east-1.amazonaws.com/driver.py?X-Amz-Algorithm=AWS4-HMAC-SHA256=AKIATBNPKWPCNUMWMLUR%2F20210214%2Fus-east-1%2Fs3%2Faws4_request=20210214T062047Z=172800=host=49ef39b6bb7090001af9312692788892551916a6ac0ff6c961ce52efb9acc235}}
> A fix could be to parse the the given path as a {{java.net.URI}} and look for 
> the pathname to end in {{.py}} (as opposed to the whole thing).
> To circumvent this issue I am currently appending a fragment to the query 
> which makes it end in {{.py}}, i.e. 
> {{http://my-web-server/driver.py?signature#.py}} which does work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34438) Python Driver is not correctly detected using presigned URLs

2021-02-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284535#comment-17284535
 ] 

Apache Spark commented on SPARK-34438:
--

User 'scravy' has created a pull request for this issue:
https://github.com/apache/spark/pull/31565

> Python Driver is not correctly detected using presigned URLs
> 
>
> Key: SPARK-34438
> URL: https://issues.apache.org/jira/browse/SPARK-34438
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0
>Reporter: Julian Fleischer
>Priority: Minor
>
> In AWS one can generate so-called presigned URLs. spark-submit accepts URLs 
> for the driver program, e.g. {{http://my-web-server/driver.py}}. Now a 
> presigned URL has a query fragment 
> {{http://my-web-server/driver.py?signature}}.
> Now the check for whether the given URL is a python driver simply checks 
> whether it ends in {{.py}} – which the presigned URL does not, as it ends in 
> {{signature}}.
> The relevant check is in {{SparkSubmit.scala}}, Line 1051 (commit tagged 
> {{v3.0.1}}):
> [https://github.com/apache/spark/blob/v3.0.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L1051]
>  
> Here is a more realistic example URL:
> {{https://bucket-name.s3.us-east-1.amazonaws.com/driver.py?X-Amz-Algorithm=AWS4-HMAC-SHA256=AKIATBNPKWPCNUMWMLUR%2F20210214%2Fus-east-1%2Fs3%2Faws4_request=20210214T062047Z=172800=host=49ef39b6bb7090001af9312692788892551916a6ac0ff6c961ce52efb9acc235}}
> A fix could be to parse the the given path as a {{java.net.URI}} and look for 
> the pathname to end in {{.py}} (as opposed to the whole thing).
> To circumvent this issue I am currently appending a fragment to the query 
> which makes it end in {{.py}}, i.e. 
> {{http://my-web-server/driver.py?signature#.py}} which does work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34438) Python Driver is not correctly detected using presigned URLs

2021-02-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34438:


Assignee: Apache Spark

> Python Driver is not correctly detected using presigned URLs
> 
>
> Key: SPARK-34438
> URL: https://issues.apache.org/jira/browse/SPARK-34438
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0
>Reporter: Julian Fleischer
>Assignee: Apache Spark
>Priority: Minor
>
> In AWS one can generate so-called presigned URLs. spark-submit accepts URLs 
> for the driver program, e.g. {{http://my-web-server/driver.py}}. Now a 
> presigned URL has a query fragment 
> {{http://my-web-server/driver.py?signature}}.
> Now the check for whether the given URL is a python driver simply checks 
> whether it ends in {{.py}} – which the presigned URL does not, as it ends in 
> {{signature}}.
> The relevant check is in {{SparkSubmit.scala}}, Line 1051 (commit tagged 
> {{v3.0.1}}):
> [https://github.com/apache/spark/blob/v3.0.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L1051]
>  
> Here is a more realistic example URL:
> {{https://bucket-name.s3.us-east-1.amazonaws.com/driver.py?X-Amz-Algorithm=AWS4-HMAC-SHA256=AKIATBNPKWPCNUMWMLUR%2F20210214%2Fus-east-1%2Fs3%2Faws4_request=20210214T062047Z=172800=host=49ef39b6bb7090001af9312692788892551916a6ac0ff6c961ce52efb9acc235}}
> A fix could be to parse the the given path as a {{java.net.URI}} and look for 
> the pathname to end in {{.py}} (as opposed to the whole thing).
> To circumvent this issue I am currently appending a fragment to the query 
> which makes it end in {{.py}}, i.e. 
> {{http://my-web-server/driver.py?signature#.py}} which does work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34438) Python Driver is not correctly detected using presigned URLs

2021-02-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284533#comment-17284533
 ] 

Apache Spark commented on SPARK-34438:
--

User 'scravy' has created a pull request for this issue:
https://github.com/apache/spark/pull/31565

> Python Driver is not correctly detected using presigned URLs
> 
>
> Key: SPARK-34438
> URL: https://issues.apache.org/jira/browse/SPARK-34438
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0
>Reporter: Julian Fleischer
>Priority: Minor
>
> In AWS one can generate so-called presigned URLs. spark-submit accepts URLs 
> for the driver program, e.g. {{http://my-web-server/driver.py}}. Now a 
> presigned URL has a query fragment 
> {{http://my-web-server/driver.py?signature}}.
> Now the check for whether the given URL is a python driver simply checks 
> whether it ends in {{.py}} – which the presigned URL does not, as it ends in 
> {{signature}}.
> The relevant check is in {{SparkSubmit.scala}}, Line 1051 (commit tagged 
> {{v3.0.1}}):
> [https://github.com/apache/spark/blob/v3.0.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L1051]
>  
> Here is a more realistic example URL:
> {{https://bucket-name.s3.us-east-1.amazonaws.com/driver.py?X-Amz-Algorithm=AWS4-HMAC-SHA256=AKIATBNPKWPCNUMWMLUR%2F20210214%2Fus-east-1%2Fs3%2Faws4_request=20210214T062047Z=172800=host=49ef39b6bb7090001af9312692788892551916a6ac0ff6c961ce52efb9acc235}}
> A fix could be to parse the the given path as a {{java.net.URI}} and look for 
> the pathname to end in {{.py}} (as opposed to the whole thing).
> To circumvent this issue I am currently appending a fragment to the query 
> which makes it end in {{.py}}, i.e. 
> {{http://my-web-server/driver.py?signature#.py}} which does work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34438) Python Driver is not correctly detected using presigned URLs

2021-02-14 Thread Julian Fleischer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284531#comment-17284531
 ] 

Julian Fleischer commented on SPARK-34438:
--

I am proposing a patch here: https://github.com/apache/spark/pull/31565

> Python Driver is not correctly detected using presigned URLs
> 
>
> Key: SPARK-34438
> URL: https://issues.apache.org/jira/browse/SPARK-34438
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0
>Reporter: Julian Fleischer
>Priority: Minor
>
> In AWS one can generate so-called presigned URLs. spark-submit accepts URLs 
> for the driver program, e.g. {{http://my-web-server/driver.py}}. Now a 
> presigned URL has a query fragment 
> {{http://my-web-server/driver.py?signature}}.
> Now the check for whether the given URL is a python driver simply checks 
> whether it ends in {{.py}} – which the presigned URL does not, as it ends in 
> {{signature}}.
> The relevant check is in {{SparkSubmit.scala}}, Line 1051 (commit tagged 
> {{v3.0.1}}):
> [https://github.com/apache/spark/blob/v3.0.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L1051]
>  
> Here is a more realistic example URL:
> {{https://bucket-name.s3.us-east-1.amazonaws.com/driver.py?X-Amz-Algorithm=AWS4-HMAC-SHA256=AKIATBNPKWPCNUMWMLUR%2F20210214%2Fus-east-1%2Fs3%2Faws4_request=20210214T062047Z=172800=host=49ef39b6bb7090001af9312692788892551916a6ac0ff6c961ce52efb9acc235}}
> A fix could be to parse the the given path as a {{java.net.URI}} and look for 
> the pathname to end in {{.py}} (as opposed to the whole thing).
> To circumvent this issue I am currently appending a fragment to the query 
> which makes it end in {{.py}}, i.e. 
> {{http://my-web-server/driver.py?signature#.py}} which does work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34438) Python Driver is not correctly detected using presigned URLs

2021-02-14 Thread Julian Fleischer (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julian Fleischer updated SPARK-34438:
-
Priority: Minor  (was: Major)

> Python Driver is not correctly detected using presigned URLs
> 
>
> Key: SPARK-34438
> URL: https://issues.apache.org/jira/browse/SPARK-34438
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0
>Reporter: Julian Fleischer
>Priority: Minor
>
> In AWS one can generate so-called presigned URLs. spark-submit accepts URLs 
> for the driver program, e.g. {{http://my-web-server/driver.py}}. Now a 
> presigned URL has a query fragment 
> {{http://my-web-server/driver.py?signature}}.
> Now the check for whether the given URL is a python driver simply checks 
> whether it ends in {{.py}} – which the presigned URL does not, as it ends in 
> {{signature}}.
> The relevant check is in {{SparkSubmit.scala}}, Line 1051 (commit tagged 
> {{v3.0.1}}):
> [https://github.com/apache/spark/blob/v3.0.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L1051]
>  
> Here is a more realistic example URL:
> {{https://bucket-name.s3.us-east-1.amazonaws.com/driver.py?X-Amz-Algorithm=AWS4-HMAC-SHA256=AKIATBNPKWPCNUMWMLUR%2F20210214%2Fus-east-1%2Fs3%2Faws4_request=20210214T062047Z=172800=host=49ef39b6bb7090001af9312692788892551916a6ac0ff6c961ce52efb9acc235}}
> A fix could be to parse the the given path as a {{java.net.URI}} and look for 
> the pathname to end in {{.py}} (as opposed to the whole thing).
> To circumvent this issue I am currently appending a fragment to the query 
> which makes it end in {{.py}}, i.e. 
> {{http://my-web-server/driver.py?signature#.py}} which does work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34438) Python Driver is not correctly detected using presigned URLs

2021-02-14 Thread Julian Fleischer (Jira)

Julian Fleischer created SPARK-34438:


 Summary: Python Driver is not correctly detected using presigned 
URLs
 Key: SPARK-34438
 URL: https://issues.apache.org/jira/browse/SPARK-34438
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 3.0.1, 3.0.0, 3.0.2, 3.1.0
Reporter: Julian Fleischer


In AWS one can generate so-called presigned URLs. spark-submit accepts URLs for 
the driver program, e.g. {{http://my-web-server/driver.py}}. Now a presigned 
URL has a query fragment {{http://my-web-server/driver.py?signature}}.

Now the check for whether the given URL is a python driver simply checks 
whether it ends in {{.py}} – which the presigned URL does not, as it ends in 
{{signature}}.

The relevant check is in {{SparkSubmit.scala}}, Line 1051 (commit tagged 
{{v3.0.1}}):

[https://github.com/apache/spark/blob/v3.0.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L1051]
 

Here is a more realistic example URL:

{{https://bucket-name.s3.us-east-1.amazonaws.com/driver.py?X-Amz-Algorithm=AWS4-HMAC-SHA256=AKIATBNPKWPCNUMWMLUR%2F20210214%2Fus-east-1%2Fs3%2Faws4_request=20210214T062047Z=172800=host=49ef39b6bb7090001af9312692788892551916a6ac0ff6c961ce52efb9acc235}}

A fix could be to parse the the given path as a {{java.net.URI}} and look for 
the pathname to end in {{.py}} (as opposed to the whole thing).

To circumvent this issue I am currently appending a fragment to the query which 
makes it end in {{.py}}, i.e. {{http://my-web-server/driver.py?signature#.py}} 
which does work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34434) Mention DS rebase options in SparkUpgradeException

2021-02-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34434:
-

Assignee: Maxim Gekk

> Mention DS rebase options in SparkUpgradeException 
> ---
>
> Key: SPARK-34434
> URL: https://issues.apache.org/jira/browse/SPARK-34434
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Mention the DS options added by SPARK-34404 and SPARK-34377 in 
> SparkUpgradeException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34434) Mention DS rebase options in SparkUpgradeException

2021-02-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34434.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31562
[https://github.com/apache/spark/pull/31562]

> Mention DS rebase options in SparkUpgradeException 
> ---
>
> Key: SPARK-34434
> URL: https://issues.apache.org/jira/browse/SPARK-34434
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Mention the DS options added by SPARK-34404 and SPARK-34377 in 
> SparkUpgradeException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34427) Session window support in SS

2021-02-14 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284507#comment-17284507
 ] 

Jungtaek Lim edited comment on SPARK-34427 at 2/15/21, 12:50 AM:
-

OK I agree it's going to meaningless argue. I should have raised the discussion 
to dev@ mailing list.
(EDIT: 
https://lists.apache.org/thread.html/r0802c6e8c5c4f51c0b781d137e6c62eb4e4105fbaea4d9743e8b6c51%40%3Cdev.spark.apache.org%3E)

Please don't get me wrong. My origin concern is that you're trying to preempt 
major two efforts which would take non-trivial time for each one. There's no 
prove that there's ongoing work internally - you should have created a design 
doc or WIP PR if you made a meaningful progress internally, but you shared 
nothing and just assigned both issues to you and said I'm working on both (or 
planning to work on both) so don't step my toes. Sorry but that's not something 
I can understand.

Again I'm not "just" concerned about this because it conflicts SPARK-10816. You 
want it? I can give up SPARK-10816 if you want it, though I'd -1 if you don't 
ensure having design doc, perf test, etc. to make the efforts on par. Just I 
don't think you can take up multiple major efforts altogether even none of 
things don't reach the PR (even WIP). I would have no argument if you just do 
the thing one by one, leaving space for contributors to play with.
(Say I have no concern if you let RocksDB stuff be taken over from other 
contributor to focus on this stuff. Vice versa.)


was (Author: kabhwan):
OK I agree it's going to meaningless argue. I should have raised the discussion 
to dev@ mailing list.
(EDIT: 
https://lists.apache.org/thread.html/r0802c6e8c5c4f51c0b781d137e6c62eb4e4105fbaea4d9743e8b6c51%40%3Cdev.spark.apache.org%3E)

Please don't get me wrong. My origin concern is that you're trying to preempt 
major two efforts which would take non-trivial time for each one. There's no 
prove that there's ongoing work internally - you should have created a design 
doc or WIP PR if you made a meaningful progress internally, but you shared 
nothing and just assigned both issues to you and said I'm working on both. 
Sorry but that's not something I can understand.

Again I'm not "just" concerned about this because it conflicts SPARK-10816. You 
want it? I can give up SPARK-10816 if you want it, though I'd -1 if you don't 
ensure having design doc, perf test, etc. to make the efforts on par. Just I 
don't think you can take up multiple major efforts altogether even none of 
things don't reach the PR (even WIP). I would have no argument if you just do 
the thing one by one, leaving space for contributors to play with.
(Say I have no concern if you let RocksDB stuff be taken over from other 
contributor to focus on this stuff. Vice versa.)

> Session window support in SS
> 
>
> Key: SPARK-34427
> URL: https://issues.apache.org/jira/browse/SPARK-34427
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Currently structured streaming supports two kinds of windows: tumbling window 
> and sliding window. Another useful window function is session window. Which 
> is not supported by SS. We have user requirement to use session window. We'd 
> like to have this support in the upstream.
> About session window, there is some info: 
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#session-windows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34427) Session window support in SS

2021-02-14 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284507#comment-17284507
 ] 

Jungtaek Lim edited comment on SPARK-34427 at 2/15/21, 12:49 AM:
-

OK I agree it's going to meaningless argue. I should have raised the discussion 
to dev@ mailing list.
(EDIT: 
https://lists.apache.org/thread.html/r0802c6e8c5c4f51c0b781d137e6c62eb4e4105fbaea4d9743e8b6c51%40%3Cdev.spark.apache.org%3E)

Please don't get me wrong. My origin concern is that you're trying to preempt 
major two efforts which would take non-trivial time for each one. There's no 
prove that there's ongoing work internally - you should have created a design 
doc or WIP PR if you made a meaningful progress internally, but you shared 
nothing and just assigned both issues to you and said I'm working on both. 
Sorry but that's not something I can understand.

Again I'm not "just" concerned about this because it conflicts SPARK-10816. You 
want it? I can give up SPARK-10816 if you want it, though I'd -1 if you don't 
ensure having design doc, perf test, etc. to make the efforts on par. Just I 
don't think you can take up multiple major efforts altogether even none of 
things don't reach the PR (even WIP). I would have no argument if you just do 
the thing one by one, leaving space for contributors to play with.
(Say I have no concern if you let RocksDB stuff be taken over from other 
contributor to focus on this stuff. Vice versa.)


was (Author: kabhwan):
OK I agree it's going to meaningless argue. I should have raised the discussion 
to dev@ mailing list. Will do.

Please don't get me wrong. My origin concern is that you're trying to preempt 
major two efforts which would take non-trivial time for each one. There's no 
prove that there's ongoing work internally - you should have created a design 
doc or WIP PR if you made a meaningful progress internally, but you shared 
nothing and just assigned both issues to you and said I'm working on both. 
Sorry but that's not something I can understand.

Again I'm not "just" concerned about this because it conflicts SPARK-10816. You 
want it? I can give up SPARK-10816 if you want it, though I'd -1 if you don't 
ensure having design doc, perf test, etc. to make the efforts on par. Just I 
don't think you can take up multiple major efforts altogether even none of 
things don't reach the PR (even WIP). I would have no argument if you just do 
the thing one by one, leaving space for contributors to play with.
(Say I have no concern if you let RocksDB stuff be taken over from other 
contributor to focus on this stuff. Vice versa.)

> Session window support in SS
> 
>
> Key: SPARK-34427
> URL: https://issues.apache.org/jira/browse/SPARK-34427
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Currently structured streaming supports two kinds of windows: tumbling window 
> and sliding window. Another useful window function is session window. Which 
> is not supported by SS. We have user requirement to use session window. We'd 
> like to have this support in the upstream.
> About session window, there is some info: 
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#session-windows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34427) Session window support in SS

2021-02-14 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284507#comment-17284507
 ] 

Jungtaek Lim commented on SPARK-34427:
--

OK I agree it's going to meaningless argue. I should have raised the discussion 
to dev@ mailing list. Will do.

Please don't get me wrong. My origin concern is that you're trying to preempt 
major two efforts which would take non-trivial time for each one. There's no 
prove that there's ongoing work internally - you should have created a design 
doc or WIP PR if you made a meaningful progress internally, but you shared 
nothing and just assigned both issues to you and said I'm working on both. 
Sorry but that's not something I can understand.

Again I'm not "just" concerned about this because it conflicts SPARK-10816. You 
want it? I can give up SPARK-10816 if you want it, though I'd -1 if you don't 
ensure having design doc, perf test, etc. to make the efforts on par. Just I 
don't think you can take up multiple major efforts altogether even none of 
things don't reach the PR (even WIP). I would have no argument if you just do 
the thing one by one, leaving space for contributors to play with.
(Say I have no concern if you let RocksDB stuff be taken over from other 
contributor to focus on this stuff. Vice versa.)

> Session window support in SS
> 
>
> Key: SPARK-34427
> URL: https://issues.apache.org/jira/browse/SPARK-34427
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Currently structured streaming supports two kinds of windows: tumbling window 
> and sliding window. Another useful window function is session window. Which 
> is not supported by SS. We have user requirement to use session window. We'd 
> like to have this support in the upstream.
> About session window, there is some info: 
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#session-windows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34427) Session window support in SS

2021-02-14 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284504#comment-17284504
 ] 

L. C. Hsieh commented on SPARK-34427:
-

Sigh...do you ever see that I say I want to ignore SPARK-10816 in my previous 
comments? Do I say I don't want to consider the existing effort? I just said 
(you can look at previous comments, it is unchanged):

> From the code size, that (yours) PR is much larger than another. I'm not sure 
> if from feature perspective they are the same. As it comes to the weekend, I 
> can take another look at the previous two PRs.
> From my side, I'd like to push this feature as we have real use case and 
> requirement. But I'm not sure if we want to follow up with previous PRs.

I am not aware of SPARK-10816 when I created this JIRA with assignee. That's 
all. I don't know why this JIRA irritates you so much.

What I did is NOT that I created this SPARK-34427, then see there is an 
existing SPARK-10816, then I immediately assign SPARK-34427 or SPARK-10816 to 
myself to occupy the issue and prevent others working on it...

The assignee works like a placholder to notify others the issue is ongoing work 
or a work on plan. It is not strict and as you did, it can be easier removed or 
changed. If I don't set it, then other folks might think it is open issue and 
put some efforts on working on it. That is so called not to step on others toes.

Once we figure out from communication with all parties what is best way to have 
an implementation for the feature, we can definitely change the assignee.

I cannot accept your point to explain this assignee case is different. If I am 
going to assign SPARK-10816 to myself, then it is not acceptable. But I just 
created a new JIRA we plan to do with assignee. I don't know what is wrong with 
this usual practice. So sorry, but your point doesn't make sense to me. It is 
also not what I saw in past years and now in the Spark community. 

I guess you are unhappy here as I assigned this JIRA because you was working on 
it, and you think I occupy it. But again, when I created this JIRA with 
assignee, I don't know there is SPARK-10816 and you worked on it before. I 
don't mean to occupy the work you have worked on it. Is it clear to you?

I don't really want to continue this argument. It is meaningless to me and 
waste my weekend time. Let me to be clear again:

I created this JIRA with assignee because we plan to have this feature. Setting 
assignee is to prevent others (especially the contributors who are not familiar 
with Spark community) accidentally think it is open and put their time working 
on it.

We will respect existing efforts. I did not know there is existing SPARK-10816. 
I need take some time to look at the existing works (they are both big change). 
Note that there is not only one implementation even in SPARK-10816, and I don't 
see any cooperation between two implementations. We can have communication 
between all parties involved and see what is the best way to have the feature.

I will like to focus on real work instead of arguing this stuff. If you are 
interested in continuing pushing the session window. I think I need some time 
taking look the details of design and code in SPARK-10816 and think how to have 
the feature in best shape.


> Session window support in SS
> 
>
> Key: SPARK-34427
> URL: https://issues.apache.org/jira/browse/SPARK-34427
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Currently structured streaming supports two kinds of windows: tumbling window 
> and sliding window. Another useful window function is session window. Which 
> is not supported by SS. We have user requirement to use session window. We'd 
> like to have this support in the upstream.
> About session window, there is some info: 
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#session-windows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34416) Support avroSchemaUrl in addition to avroSchema

2021-02-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34416:
-

Assignee: Ohad Raviv

> Support avroSchemaUrl in addition to avroSchema
> ---
>
> Key: SPARK-34416
> URL: https://issues.apache.org/jira/browse/SPARK-34416
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Ohad Raviv
>Assignee: Ohad Raviv
>Priority: Minor
> Fix For: 3.2.0
>
>
> We have a use case in which we read a huge table in Avro format. About 30k 
> columns.
> using the default Hive reader - `AvroGenericRecordReader` it is just hangs 
> forever. after 4 hours not even one task has finished.
> We tried instead to use 
> `spark.read.format("com.databricks.spark.avro").load(..)` but we failed on:
> ```
> org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data 
> schema
> ..
> at 
> org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85)
>  at 
> org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:67)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:421)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
>  ... 53 elided
> ```
>  
> because files schema contain duplicate column names (when considering 
> case-insensitive).
> So we wanted to provide a user schema with non-duplicated fields, but the 
> schema is huge. a few MBs. it is not practical to provide it in json format.
>  
> So we patched spark-avro to be able to get also `avroSchemaUrl` in addition 
> to `avroSchema` and it worked perfectly.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34416) Support avroSchemaUrl in addition to avroSchema

2021-02-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34416.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31543
[https://github.com/apache/spark/pull/31543]

> Support avroSchemaUrl in addition to avroSchema
> ---
>
> Key: SPARK-34416
> URL: https://issues.apache.org/jira/browse/SPARK-34416
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Ohad Raviv
>Priority: Minor
> Fix For: 3.2.0
>
>
> We have a use case in which we read a huge table in Avro format. About 30k 
> columns.
> using the default Hive reader - `AvroGenericRecordReader` it is just hangs 
> forever. after 4 hours not even one task has finished.
> We tried instead to use 
> `spark.read.format("com.databricks.spark.avro").load(..)` but we failed on:
> ```
> org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data 
> schema
> ..
> at 
> org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85)
>  at 
> org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:67)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:421)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
>  ... 53 elided
> ```
>  
> because files schema contain duplicate column names (when considering 
> case-insensitive).
> So we wanted to provide a user schema with non-duplicated fields, but the 
> schema is huge. a few MBs. it is not practical to provide it in json format.
>  
> So we patched spark-avro to be able to get also `avroSchemaUrl` in addition 
> to `avroSchema` and it worked perfectly.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34427) Session window support in SS

2021-02-14 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284496#comment-17284496
 ] 

Jungtaek Lim commented on SPARK-34427:
--

This assignee case is quite different from what I've seen committers have been 
doing, because these issues are not "new" (there has been existing efforts just 
not on right time) and the idea is quite well known so many of contributors can 
simply plan in parallel. e.g. In SPARK-34198 you'd realize one contributor in 
FB is also working on the solution in parallel.

I don't think we are happy with someone occupies the major feature without even 
providing design doc or so. No one knows about the plan - no one knows whether 
the effort is started or even in backlog actually. In parallel, someone may 
have more progress. Stepping on others toes has been normal in Spark community 
and setting assignee never avoids it properly. It just makes an unfair 
competition between contributor and committer.

If you want to make clear on the ownership for the major feature, then please 
prepare SPIP and raise it on dev@ mailing list. That ensures recognition that 
you're making meaningful progress already, and others could help on reviewing.
(Even in that case someone argue with another SPIP, then either collaboration 
or competition should happen. I don't think committer can simply preempt.)

Also, I think we should try to find the JIRA issue which did the same or 
similar, and leverage the one. There're lots of information and history of 
efforts which we can leverage "even" we take the different PR. Once you're 
filing a new JIRA issue and let the old one be ignored then the efforts were 
lost.

I don't think you could simply raise a PR for SPARK-34427 and ask for review, 
as from SPARK-10816 we found there're various ways to implement it, which 
requires design doc to make sure the implementation considers these designs as 
well and picks up the best one. The implementation should also run the 
performance test and ensure it's superior or at least on par. That establishes 
the "minimum bar" on the efforts. Before achieving that, consider my voice as 
-1 on the proposal. To make the comparison easier I think you should really 
continue your work in SPARK-10816, not here.

I'm happy to see some other committer finally found the necessity of the 
feature, but also unhappy that resurrection of the existing effort is not 
considered "at first" which would save a bunch of time among us. The existing 
effort wasn't discarded because of technical issue, that said, the design and 
implementation are still valid. That wasn't just put right on time.

> Session window support in SS
> 
>
> Key: SPARK-34427
> URL: https://issues.apache.org/jira/browse/SPARK-34427
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Currently structured streaming supports two kinds of windows: tumbling window 
> and sliding window. Another useful window function is session window. Which 
> is not supported by SS. We have user requirement to use session window. We'd 
> like to have this support in the upstream.
> About session window, there is some info: 
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#session-windows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34198) Add RocksDB StateStore as external module

2021-02-14 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284491#comment-17284491
 ] 

Jungtaek Lim edited comment on SPARK-34198 at 2/14/21, 8:02 PM:


Thanks for considering it. I think it would be the best option for Apache Spark 
among these if it makes sense to Databricks as well, just because it has been 
served for years with enterprise level of support. We can't expect the 
stability from other options and may struggle with it for some period - it'd be 
best if we can avoid it.
(Worth noting that second one may also provide enterprise level of support, but 
less than an year, and I had 50+ of review comments on proposed PR and 
personally didn't feel the PR was super solid at that time. I mean, for me, the 
PR was not proposed with production level quality at first.)


was (Author: kabhwan):
Thanks for considering it. I think it would be the best option for Apache Spark 
among these if it makes sense to Databricks as well, just because it has been 
served for years with enterprise level of support. We can't expect the 
stability from other options and may struggle with it for some period - it'd be 
best if we can avoid it.
(Worth noting that second one may also provide enterprise level of support, but 
less than an year, and I had 50+ of review comments on proposed PR and 
personally didn't feel the PR was super solid at that time. I mean, for me, the 
PR was not proposed with production quality at first.)

> Add RocksDB StateStore as external module
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module

2021-02-14 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284491#comment-17284491
 ] 

Jungtaek Lim commented on SPARK-34198:
--

Thanks for considering it. I think it would be the best option for Apache Spark 
among these if it makes sense to Databricks as well, just because it has been 
served for years with enterprise level of support. We can't expect the 
stability from other options and may struggle with it for some period - it'd be 
best if we can avoid it.
(Worth noting that second one may also provide enterprise level of support, but 
less than an year, and I had 50+ of review comments on proposed PR and 
personally didn't feel the PR was super solid at that time. I mean, for me, the 
PR was not proposed with production quality at first.)

> Add RocksDB StateStore as external module
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34435) ArrayIndexOutOfBoundsException when select in different case

2021-02-14 Thread Enver Osmanov (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284489#comment-17284489
 ] 

Enver Osmanov commented on SPARK-34435:
---

[~ymajid] , it is absolutely ok with me. If you will have any questions, 
please, let me know.

> ArrayIndexOutOfBoundsException when select in different case
> 
>
> Key: SPARK-34435
> URL: https://issues.apache.org/jira/browse/SPARK-34435
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.0.1
>Reporter: Enver Osmanov
>Priority: Trivial
>
> h5. Actual behavior:
> Select column with different case after remapping fail with 
> ArrayIndexOutOfBoundsException.
> h5. Expected behavior:
> Spark shouldn't fail with ArrayIndexOutOfBoundsException.
>  Spark is case insensetive by default, so select should return selected 
> column.
> h5. Test case:
> {code:java}
> case class User(aA: String, bb: String)
> // ...
> val user = User("John", "Doe")
> val ds = Seq(user).toDS().map(identity)
> ds.select("aa").show(false)
> {code}
> h5. Additional notes:
> Test case is reproducible with Spark 3.0.1. There are no errors with Spark 
> 2.4.7.
> I belive problem could be solved by changing filter in 
> `SchemaPruning#pruneDataSchema` from this:
> {code:java}
> val dataSchemaFieldNames = dataSchema.fieldNames.toSet
> val mergedDataSchema =
>   StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
> {code}
> to this:
> {code:java}
> val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
> val mergedDataSchema =
>   StructType(mergedSchema.filter(f => 
> dataSchemaFieldNames.contains(f.name.toLowerCase)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34198) Add RocksDB StateStore as external module

2021-02-14 Thread Reynold Xin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284479#comment-17284479
 ] 

Reynold Xin edited comment on SPARK-34198 at 2/14/21, 6:59 PM:
---

I don't know the intricate details of it but I suspect it's a different one 
with much more features because it existed long before those two.


was (Author: rxin):
I don't know the intricate details of it but I suspect it's a different one 
because it existed long before those two.

> Add RocksDB StateStore as external module
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module

2021-02-14 Thread Reynold Xin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284479#comment-17284479
 ] 

Reynold Xin commented on SPARK-34198:
-

I don't know the intricate details of it but I suspect it's a different one 
because it existed long before those two.

> Add RocksDB StateStore as external module
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module

2021-02-14 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284478#comment-17284478
 ] 

L. C. Hsieh commented on SPARK-34198:
-

Thanks [~rxin]. Is the implementation used in Databricks a completely different 
one than other two implementations? Or it is also based on any one of the two?

> Add RocksDB StateStore as external module
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module

2021-02-14 Thread Reynold Xin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284474#comment-17284474
 ] 

Reynold Xin commented on SPARK-34198:
-

[~kabhwan] let me talk to the team that built our internal version of that on 
whether it'd make sense.

> Add RocksDB StateStore as external module
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34435) ArrayIndexOutOfBoundsException when select in different case

2021-02-14 Thread Yousif Majid (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284434#comment-17284434
 ] 

Yousif Majid commented on SPARK-34435:
--

Hey [~Enverest], I would like to work on this if that's ok with you!

> ArrayIndexOutOfBoundsException when select in different case
> 
>
> Key: SPARK-34435
> URL: https://issues.apache.org/jira/browse/SPARK-34435
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.0.1
>Reporter: Enver Osmanov
>Priority: Trivial
>
> h5. Actual behavior:
> Select column with different case after remapping fail with 
> ArrayIndexOutOfBoundsException.
> h5. Expected behavior:
> Spark shouldn't fail with ArrayIndexOutOfBoundsException.
>  Spark is case insensetive by default, so select should return selected 
> column.
> h5. Test case:
> {code:java}
> case class User(aA: String, bb: String)
> // ...
> val user = User("John", "Doe")
> val ds = Seq(user).toDS().map(identity)
> ds.select("aa").show(false)
> {code}
> h5. Additional notes:
> Test case is reproducible with Spark 3.0.1. There are no errors with Spark 
> 2.4.7.
> I belive problem could be solved by changing filter in 
> `SchemaPruning#pruneDataSchema` from this:
> {code:java}
> val dataSchemaFieldNames = dataSchema.fieldNames.toSet
> val mergedDataSchema =
>   StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
> {code}
> to this:
> {code:java}
> val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
> val mergedDataSchema =
>   StructType(mergedSchema.filter(f => 
> dataSchemaFieldNames.contains(f.name.toLowerCase)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34432) add a java implementation for the simple writable data source

2021-02-14 Thread Kevin Pis (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284424#comment-17284424
 ] 

Kevin Pis edited comment on SPARK-34432 at 2/14/21, 3:30 PM:
-

Hi [~cloud_fan]!   Sorry to bother you, but Could you help me to review the 
following pr :

[https://github.com/apache/spark/pull/31560]


was (Author: kevinpis):
Hi [~cloud_fan]!   Sorry to bother you, but Could you help me to review the pr  
 https://github.com/apache/spark/pull/31560

> add a java implementation for the simple writable data source
> -
>
> Key: SPARK-34432
> URL: https://issues.apache.org/jira/browse/SPARK-34432
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.1
>Reporter: Kevin Pis
>Priority: Minor
>
> This is a followup of https://github.com/apache/spark/pull/19269
> In #19269 , there is only a scala implementation of simple writable data 
> source in `DataSourceV2Suite`.
> This PR adds a java implementation of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34432) add a java implementation for the simple writable data source

2021-02-14 Thread Kevin Pis (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284424#comment-17284424
 ] 

Kevin Pis commented on SPARK-34432:
---

Hi [~cloud_fan]!   Sorry to bother you, but Could you help me to review the pr  
 https://github.com/apache/spark/pull/31560

> add a java implementation for the simple writable data source
> -
>
> Key: SPARK-34432
> URL: https://issues.apache.org/jira/browse/SPARK-34432
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.1
>Reporter: Kevin Pis
>Priority: Minor
>
> This is a followup of https://github.com/apache/spark/pull/19269
> In #19269 , there is only a scala implementation of simple writable data 
> source in `DataSourceV2Suite`.
> This PR adds a java implementation of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34437) Update Spark SQL guide about rebase DS options and SQL configs

2021-02-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34437:


Assignee: (was: Apache Spark)

> Update Spark SQL guide about rebase DS options and SQL configs
> --
>
> Key: SPARK-34437
> URL: https://issues.apache.org/jira/browse/SPARK-34437
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Describe the following SQL configs:
> * spark.sql.legacy.parquet.int96RebaseModeInWrite
> * spark.sql.legacy.parquet.datetimeRebaseModeInWrite
> * spark.sql.legacy.parquet.int96RebaseModeInRead
> * spark.sql.legacy.parquet.datetimeRebaseModeInRead
> * spark.sql.legacy.avro.datetimeRebaseModeInWrite
> * spark.sql.legacy.avro.datetimeRebaseModeInRead
> And Avro/Parquet options datetimeRebaseMode and int96RebaseMode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34437) Update Spark SQL guide about rebase DS options and SQL configs

2021-02-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284423#comment-17284423
 ] 

Apache Spark commented on SPARK-34437:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31564

> Update Spark SQL guide about rebase DS options and SQL configs
> --
>
> Key: SPARK-34437
> URL: https://issues.apache.org/jira/browse/SPARK-34437
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Describe the following SQL configs:
> * spark.sql.legacy.parquet.int96RebaseModeInWrite
> * spark.sql.legacy.parquet.datetimeRebaseModeInWrite
> * spark.sql.legacy.parquet.int96RebaseModeInRead
> * spark.sql.legacy.parquet.datetimeRebaseModeInRead
> * spark.sql.legacy.avro.datetimeRebaseModeInWrite
> * spark.sql.legacy.avro.datetimeRebaseModeInRead
> And Avro/Parquet options datetimeRebaseMode and int96RebaseMode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34437) Update Spark SQL guide about rebase DS options and SQL configs

2021-02-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34437:


Assignee: Apache Spark

> Update Spark SQL guide about rebase DS options and SQL configs
> --
>
> Key: SPARK-34437
> URL: https://issues.apache.org/jira/browse/SPARK-34437
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Describe the following SQL configs:
> * spark.sql.legacy.parquet.int96RebaseModeInWrite
> * spark.sql.legacy.parquet.datetimeRebaseModeInWrite
> * spark.sql.legacy.parquet.int96RebaseModeInRead
> * spark.sql.legacy.parquet.datetimeRebaseModeInRead
> * spark.sql.legacy.avro.datetimeRebaseModeInWrite
> * spark.sql.legacy.avro.datetimeRebaseModeInRead
> And Avro/Parquet options datetimeRebaseMode and int96RebaseMode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34432) add a java implementation for the simple writable data source

2021-02-14 Thread Kevin Pis (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Pis updated SPARK-34432:
--
Affects Version/s: (was: 3.1.1)
   3.0.1

> add a java implementation for the simple writable data source
> -
>
> Key: SPARK-34432
> URL: https://issues.apache.org/jira/browse/SPARK-34432
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.1
>Reporter: Kevin Pis
>Priority: Minor
>
> This is a followup of https://github.com/apache/spark/pull/19269
> In #19269 , there is only a scala implementation of simple writable data 
> source in `DataSourceV2Suite`.
> This PR adds a java implementation of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34437) Update Spark SQL guide about rebase DS options and SQL configs

2021-02-14 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-34437:
--

 Summary: Update Spark SQL guide about rebase DS options and SQL 
configs
 Key: SPARK-34437
 URL: https://issues.apache.org/jira/browse/SPARK-34437
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk


Describe the following SQL configs:
* spark.sql.legacy.parquet.int96RebaseModeInWrite
* spark.sql.legacy.parquet.datetimeRebaseModeInWrite
* spark.sql.legacy.parquet.int96RebaseModeInRead
* spark.sql.legacy.parquet.datetimeRebaseModeInRead
* spark.sql.legacy.avro.datetimeRebaseModeInWrite
* spark.sql.legacy.avro.datetimeRebaseModeInRead

And Avro/Parquet options datetimeRebaseMode and int96RebaseMode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16745) Spark job completed however have to wait for 13 mins (data size is small)

2021-02-14 Thread Abhay Dandekar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-16745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284416#comment-17284416
 ] 

Abhay Dandekar commented on SPARK-16745:


+1. Getting same issue on standalone spark 3.0.1

Workaround is to pass a local network for driver as follows:

$ ./bin/spark-shell --conf spark.driver.host=localhost

Can we please update the default option accordingly for standalone? Esp when 
master == local

> Spark job completed however have to wait for 13 mins (data size is small)
> -
>
> Key: SPARK-16745
> URL: https://issues.apache.org/jira/browse/SPARK-16745
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.1
> Environment: Max OS X Yosemite, Terminal, MacBook Air Late 2014
>Reporter: Joe Chong
>Priority: Minor
>
> I submitted a job in scala spark shell to show a DataFrame. The data size is 
> about 43K. The job was successful in the end, but took more than 13 minutes 
> to resolve. Upon checking the log, there's multiple exception raised on 
> "Failed to check existence of class" with a java.net.connectionexpcetion 
> message indicating timeout trying to connect to the port 52067, the repl port 
> that Spark setup. Please assist to troubleshoot. Thanks. 
> Started Spark in standalone mode
> $ spark-shell --driver-memory 5g --master local[*]
> 16/07/26 21:05:29 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 16/07/26 21:05:30 INFO spark.SecurityManager: Changing view acls to: joechong
> 16/07/26 21:05:30 INFO spark.SecurityManager: Changing modify acls to: 
> joechong
> 16/07/26 21:05:30 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(joechong); users 
> with modify permissions: Set(joechong)
> 16/07/26 21:05:30 INFO spark.HttpServer: Starting HTTP Server
> 16/07/26 21:05:30 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 16/07/26 21:05:30 INFO server.AbstractConnector: Started 
> SocketConnector@0.0.0.0:52067
> 16/07/26 21:05:30 INFO util.Utils: Successfully started service 'HTTP class 
> server' on port 52067.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.1
>   /_/
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
> Type in expressions to have them evaluated.
> Type :help for more information.
> 16/07/26 21:05:34 INFO spark.SparkContext: Running Spark version 1.6.1
> 16/07/26 21:05:34 INFO spark.SecurityManager: Changing view acls to: joechong
> 16/07/26 21:05:34 INFO spark.SecurityManager: Changing modify acls to: 
> joechong
> 16/07/26 21:05:34 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(joechong); users 
> with modify permissions: Set(joechong)
> 16/07/26 21:05:35 INFO util.Utils: Successfully started service 'sparkDriver' 
> on port 52072.
> 16/07/26 21:05:35 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 16/07/26 21:05:35 INFO Remoting: Starting remoting
> 16/07/26 21:05:35 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkDriverActorSystem@10.199.29.218:52074]
> 16/07/26 21:05:35 INFO util.Utils: Successfully started service 
> 'sparkDriverActorSystem' on port 52074.
> 16/07/26 21:05:35 INFO spark.SparkEnv: Registering MapOutputTracker
> 16/07/26 21:05:35 INFO spark.SparkEnv: Registering BlockManagerMaster
> 16/07/26 21:05:35 INFO storage.DiskBlockManager: Created local directory at 
> /private/var/folders/r7/bs2f87nj6lnd5vm51lvxcw68gn/T/blockmgr-cd542a27-6ff1-4f51-a72b-78654142fdb6
> 16/07/26 21:05:35 INFO storage.MemoryStore: MemoryStore started with capacity 
> 3.4 GB
> 16/07/26 21:05:35 INFO spark.SparkEnv: Registering OutputCommitCoordinator
> 16/07/26 21:05:36 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 16/07/26 21:05:36 INFO server.AbstractConnector: Started 
> SelectChannelConnector@0.0.0.0:4040
> 16/07/26 21:05:36 INFO util.Utils: Successfully started service 'SparkUI' on 
> port 4040.
> 16/07/26 21:05:36 INFO ui.SparkUI: Started SparkUI at 
> http://10.199.29.218:4040
> 16/07/26 21:05:36 INFO executor.Executor: Starting executor ID driver on host 
> localhost
> 16/07/26 21:05:36 INFO executor.Executor: Using REPL class URI: 
> http://10.199.29.218:52067
> 16/07/26 21:05:36 INFO util.Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52075.
> 16/07/26 21:05:36 INFO netty.NettyBlockTransferService: Server created on 
> 52075
> 16/07/26 21:05:36 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/07/26 21:05:36 INFO

[jira] [Assigned] (SPARK-34436) DPP support LIKE ANY/ALL

2021-02-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34436:


Assignee: (was: Apache Spark)

> DPP support LIKE ANY/ALL
> 
>
> Key: SPARK-34436
> URL: https://issues.apache.org/jira/browse/SPARK-34436
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> Support this case:
> {code:sql}
> SELECT date_id, product_id FROM fact_sk f
> JOIN dim_store s
> ON f.store_id = s.store_id WHERE s.country LIKE ANY ('%D%E%', '%A%B%')
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34436) DPP support LIKE ANY/ALL

2021-02-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284414#comment-17284414
 ] 

Apache Spark commented on SPARK-34436:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/31563

> DPP support LIKE ANY/ALL
> 
>
> Key: SPARK-34436
> URL: https://issues.apache.org/jira/browse/SPARK-34436
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> Support this case:
> {code:sql}
> SELECT date_id, product_id FROM fact_sk f
> JOIN dim_store s
> ON f.store_id = s.store_id WHERE s.country LIKE ANY ('%D%E%', '%A%B%')
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34436) DPP support LIKE ANY/ALL

2021-02-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34436:


Assignee: Apache Spark

> DPP support LIKE ANY/ALL
> 
>
> Key: SPARK-34436
> URL: https://issues.apache.org/jira/browse/SPARK-34436
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> Support this case:
> {code:sql}
> SELECT date_id, product_id FROM fact_sk f
> JOIN dim_store s
> ON f.store_id = s.store_id WHERE s.country LIKE ANY ('%D%E%', '%A%B%')
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34436) DPP support LIKE ANY/ALL

2021-02-14 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-34436:
---

 Summary: DPP support LIKE ANY/ALL
 Key: SPARK-34436
 URL: https://issues.apache.org/jira/browse/SPARK-34436
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


Support this case:
{code:sql}
SELECT date_id, product_id FROM fact_sk f
JOIN dim_store s
ON f.store_id = s.store_id WHERE s.country LIKE ANY ('%D%E%', '%A%B%')
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34435) ArrayIndexOutOfBoundsException when select in different case

2021-02-14 Thread Enver Osmanov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enver Osmanov updated SPARK-34435:
--
Description: 
h5. Actual behavior:

Select column with different case after remapping fail with 
ArrayIndexOutOfBoundsException.
h5. Expected behavior:

Spark shouldn't fail with ArrayIndexOutOfBoundsException.
 Spark is case insensetive by default, so select should return selected column.
h5. Test case:
{code:java}
case class User(aA: String, bb: String)
// ...
val user = User("John", "Doe")

val ds = Seq(user).toDS().map(identity)

ds.select("aa").show(false)
{code}
h5. Additional notes:

Test case is reproducible with Spark 3.0.1. There are no errors with Spark 
2.4.7.

I belive problem could be solved by changing filter in 
`SchemaPruning#pruneDataSchema` from this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
{code}
to this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => 
dataSchemaFieldNames.contains(f.name.toLowerCase)))
{code}

  was:
h5. Actual behavior:

Select column with different case after remapping fail with 
ArrayIndexOutOfBoundsException.
h5. Expected behavior:

Spark shouldn't fail with ArrayIndexOutOfBoundsException.
 Spark is case insensetive by default, so select should return selected column.
h5. Test case:
{code:java}
case class User(aA: String, bb: String)
// ...
val user = User("John", "Doe")

val ds = Seq(user).toDS().map(identity)

ds.select("aa").show(false)
{code}
h5. Additional notes:

Test case is reproducible with Spark 3.0.1. There are no errors with Spark 
2.4.7.

I belive problem could be solved by changing filter in pruneDataSchema method 
from SchemaPruning object from this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
{code}
to this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => 
dataSchemaFieldNames.contains(f.name.toLowerCase)))
{code}


> ArrayIndexOutOfBoundsException when select in different case
> 
>
> Key: SPARK-34435
> URL: https://issues.apache.org/jira/browse/SPARK-34435
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.0.1
>Reporter: Enver Osmanov
>Priority: Trivial
>
> h5. Actual behavior:
> Select column with different case after remapping fail with 
> ArrayIndexOutOfBoundsException.
> h5. Expected behavior:
> Spark shouldn't fail with ArrayIndexOutOfBoundsException.
>  Spark is case insensetive by default, so select should return selected 
> column.
> h5. Test case:
> {code:java}
> case class User(aA: String, bb: String)
> // ...
> val user = User("John", "Doe")
> val ds = Seq(user).toDS().map(identity)
> ds.select("aa").show(false)
> {code}
> h5. Additional notes:
> Test case is reproducible with Spark 3.0.1. There are no errors with Spark 
> 2.4.7.
> I belive problem could be solved by changing filter in 
> `SchemaPruning#pruneDataSchema` from this:
> {code:java}
> val dataSchemaFieldNames = dataSchema.fieldNames.toSet
> val mergedDataSchema =
>   StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
> {code}
> to this:
> {code:java}
> val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
> val mergedDataSchema =
>   StructType(mergedSchema.filter(f => 
> dataSchemaFieldNames.contains(f.name.toLowerCase)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34435) ArrayIndexOutOfBoundsException when select in different case

2021-02-14 Thread Enver Osmanov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enver Osmanov updated SPARK-34435:
--
Description: 
h5. Actual behavior:

Select column with different case after remapping fail with 
ArrayIndexOutOfBoundsException.
h5. Expected behavior:

Spark shouldn't fail with ArrayIndexOutOfBoundsException.
 Spark is case insensetive by default, so select should return selected column.
h5. Test case:
{code:java}
case class User(aA: String, bb: String)
// ...
val user = User("John", "Doe")

val ds = Seq(user).toDS().map(identity)

ds.select("aa").show(false)
{code}
h5. Additional notes:

Test case is reproducible with Spark 3.0.1. There is no errors with Spark 2.4.7.

I belive problem could be solved by changing filter in pruneDataSchema method 
from SchemaPruning object from this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
{code}
to this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => 
dataSchemaFieldNames.contains(f.name.toLowerCase)))
{code}

  was:
h5. Actual behavior:
 Select column with different case after remapping fail with 
ArrayIndexOutOfBoundsException.

h5. Expected behavior:

Spark shouldn't fail with ArrayIndexOutOfBoundsException.
 Spark is case insensetive by default, so select should return selected column.

h5. Test case:
{code:java}
case class User(aA: String, bb: String)
// ...
val user = User("John", "Doe")

val ds = Seq(user).toDS().map(identity)

ds.select("aa").show(false)
{code}
h5. Additional notes:

Test case is reproducible with Spark 3.0.1. It works fine with Spark 2.4.7.

I belive problem could be solved by changing filter in pruneDataSchema method 
from SchemaPruning object from this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
{code}
to this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => 
dataSchemaFieldNames.contains(f.name.toLowerCase)))
{code}


> ArrayIndexOutOfBoundsException when select in different case
> 
>
> Key: SPARK-34435
> URL: https://issues.apache.org/jira/browse/SPARK-34435
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.0.1
>Reporter: Enver Osmanov
>Priority: Trivial
>
> h5. Actual behavior:
> Select column with different case after remapping fail with 
> ArrayIndexOutOfBoundsException.
> h5. Expected behavior:
> Spark shouldn't fail with ArrayIndexOutOfBoundsException.
>  Spark is case insensetive by default, so select should return selected 
> column.
> h5. Test case:
> {code:java}
> case class User(aA: String, bb: String)
> // ...
> val user = User("John", "Doe")
> val ds = Seq(user).toDS().map(identity)
> ds.select("aa").show(false)
> {code}
> h5. Additional notes:
> Test case is reproducible with Spark 3.0.1. There is no errors with Spark 
> 2.4.7.
> I belive problem could be solved by changing filter in pruneDataSchema method 
> from SchemaPruning object from this:
> {code:java}
> val dataSchemaFieldNames = dataSchema.fieldNames.toSet
> val mergedDataSchema =
>   StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
> {code}
> to this:
> {code:java}
> val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
> val mergedDataSchema =
>   StructType(mergedSchema.filter(f => 
> dataSchemaFieldNames.contains(f.name.toLowerCase)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34435) ArrayIndexOutOfBoundsException when select in different case

2021-02-14 Thread Enver Osmanov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enver Osmanov updated SPARK-34435:
--
Description: 
h5. Actual behavior:

Select column with different case after remapping fail with 
ArrayIndexOutOfBoundsException.
h5. Expected behavior:

Spark shouldn't fail with ArrayIndexOutOfBoundsException.
 Spark is case insensetive by default, so select should return selected column.
h5. Test case:
{code:java}
case class User(aA: String, bb: String)
// ...
val user = User("John", "Doe")

val ds = Seq(user).toDS().map(identity)

ds.select("aa").show(false)
{code}
h5. Additional notes:

Test case is reproducible with Spark 3.0.1. There are no errors with Spark 
2.4.7.

I belive problem could be solved by changing filter in pruneDataSchema method 
from SchemaPruning object from this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
{code}
to this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => 
dataSchemaFieldNames.contains(f.name.toLowerCase)))
{code}

  was:
h5. Actual behavior:

Select column with different case after remapping fail with 
ArrayIndexOutOfBoundsException.
h5. Expected behavior:

Spark shouldn't fail with ArrayIndexOutOfBoundsException.
 Spark is case insensetive by default, so select should return selected column.
h5. Test case:
{code:java}
case class User(aA: String, bb: String)
// ...
val user = User("John", "Doe")

val ds = Seq(user).toDS().map(identity)

ds.select("aa").show(false)
{code}
h5. Additional notes:

Test case is reproducible with Spark 3.0.1. There is no errors with Spark 2.4.7.

I belive problem could be solved by changing filter in pruneDataSchema method 
from SchemaPruning object from this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
{code}
to this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => 
dataSchemaFieldNames.contains(f.name.toLowerCase)))
{code}


> ArrayIndexOutOfBoundsException when select in different case
> 
>
> Key: SPARK-34435
> URL: https://issues.apache.org/jira/browse/SPARK-34435
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.0.1
>Reporter: Enver Osmanov
>Priority: Trivial
>
> h5. Actual behavior:
> Select column with different case after remapping fail with 
> ArrayIndexOutOfBoundsException.
> h5. Expected behavior:
> Spark shouldn't fail with ArrayIndexOutOfBoundsException.
>  Spark is case insensetive by default, so select should return selected 
> column.
> h5. Test case:
> {code:java}
> case class User(aA: String, bb: String)
> // ...
> val user = User("John", "Doe")
> val ds = Seq(user).toDS().map(identity)
> ds.select("aa").show(false)
> {code}
> h5. Additional notes:
> Test case is reproducible with Spark 3.0.1. There are no errors with Spark 
> 2.4.7.
> I belive problem could be solved by changing filter in pruneDataSchema method 
> from SchemaPruning object from this:
> {code:java}
> val dataSchemaFieldNames = dataSchema.fieldNames.toSet
> val mergedDataSchema =
>   StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
> {code}
> to this:
> {code:java}
> val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
> val mergedDataSchema =
>   StructType(mergedSchema.filter(f => 
> dataSchemaFieldNames.contains(f.name.toLowerCase)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34435) ArrayIndexOutOfBoundsException when select in different case

2021-02-14 Thread Enver Osmanov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enver Osmanov updated SPARK-34435:
--
Description: 
h5. Actual behavior:
 Select column with different case after remapping fail with 
ArrayIndexOutOfBoundsException.

h5. Expected behavior:

Spark shouldn't fail with ArrayIndexOutOfBoundsException.
 Spark is case insensetive by default, so select should return selected column.

h5. Test case:
{code:java}
case class User(aA: String, bb: String)
// ...
val user = User("John", "Doe")

val ds = Seq(user).toDS().map(identity)

ds.select("aa").show(false)
{code}
h5. Additional notes:

Test case is reproducible with Spark 3.0.1. It works fine with Spark 2.4.7.

I belive problem could be solved by changing filter in pruneDataSchema method 
from SchemaPruning object from this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
{code}
to this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => 
dataSchemaFieldNames.contains(f.name.toLowerCase)))
{code}

  was:
Actual behavior:
 Select column with different case after remapping fail with 
ArrayIndexOutOfBoundsException.

Expected behavior:

Spark shouldn't fail with ArrayIndexOutOfBoundsException.
 Spark is case insensetive by default, so select should return selected column.

Test case:
{code:java}
case class User(aA: String, bb: String)
// ...
val user = User("John", "Doe")

val ds = Seq(user).toDS().map(identity)

ds.select("aa").show(false)
{code}
Additional notes:

Test case is reproduceble with Spark 3.0.1. It works fine with Spark 2.4.7.

I belive problem could be solved by changing filter in pruneDataSchema method 
from SchemaPruning object from this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
{code}
to this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => 
dataSchemaFieldNames.contains(f.name.toLowerCase)))
{code}


> ArrayIndexOutOfBoundsException when select in different case
> 
>
> Key: SPARK-34435
> URL: https://issues.apache.org/jira/browse/SPARK-34435
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.0.1
>Reporter: Enver Osmanov
>Priority: Trivial
>
> h5. Actual behavior:
>  Select column with different case after remapping fail with 
> ArrayIndexOutOfBoundsException.
> h5. Expected behavior:
> Spark shouldn't fail with ArrayIndexOutOfBoundsException.
>  Spark is case insensetive by default, so select should return selected 
> column.
> h5. Test case:
> {code:java}
> case class User(aA: String, bb: String)
> // ...
> val user = User("John", "Doe")
> val ds = Seq(user).toDS().map(identity)
> ds.select("aa").show(false)
> {code}
> h5. Additional notes:
> Test case is reproducible with Spark 3.0.1. It works fine with Spark 2.4.7.
> I belive problem could be solved by changing filter in pruneDataSchema method 
> from SchemaPruning object from this:
> {code:java}
> val dataSchemaFieldNames = dataSchema.fieldNames.toSet
> val mergedDataSchema =
>   StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
> {code}
> to this:
> {code:java}
> val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
> val mergedDataSchema =
>   StructType(mergedSchema.filter(f => 
> dataSchemaFieldNames.contains(f.name.toLowerCase)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34435) ArrayIndexOutOfBoundsException when select in different case

2021-02-14 Thread Enver Osmanov (Jira)

Enver Osmanov created SPARK-34435:
-

 Summary: ArrayIndexOutOfBoundsException when select in different 
case
 Key: SPARK-34435
 URL: https://issues.apache.org/jira/browse/SPARK-34435
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, SQL
Affects Versions: 3.0.1
 Environment: Actual behavior:
Select column with different case after remapping fail with 
ArrayIndexOutOfBoundsException.

Expected behavior:

Spark shouldn't fail with ArrayIndexOutOfBoundsException.
Spark is case insensetive by default, so select should return selected column.

Test case:
{code:java}
case class User(aA: String, bb: String)
// ...
val user = User("John", "Doe")

val ds = Seq(user).toDS().map(identity)

ds.select("aa").show(false)
{code}
Additional notes:

Test case is reproduceble with Spark 3.0.1. It works fine with Spark 2.4.7.

I belive problem could be solved by changing filter in pruneDataSchema method 
from SchemaPruning object from this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
{code}
to this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => 
dataSchemaFieldNames.contains(f.name.toLowerCase)))
{code}
Reporter: Enver Osmanov






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34435) ArrayIndexOutOfBoundsException when select in different case

2021-02-14 Thread Enver Osmanov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enver Osmanov updated SPARK-34435:
--
Description: 
Actual behavior:
 Select column with different case after remapping fail with 
ArrayIndexOutOfBoundsException.

Expected behavior:

Spark shouldn't fail with ArrayIndexOutOfBoundsException.
 Spark is case insensetive by default, so select should return selected column.

Test case:
{code:java}
case class User(aA: String, bb: String)
// ...
val user = User("John", "Doe")

val ds = Seq(user).toDS().map(identity)

ds.select("aa").show(false)
{code}
Additional notes:

Test case is reproduceble with Spark 3.0.1. It works fine with Spark 2.4.7.

I belive problem could be solved by changing filter in pruneDataSchema method 
from SchemaPruning object from this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
{code}
to this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => 
dataSchemaFieldNames.contains(f.name.toLowerCase)))
{code}

> ArrayIndexOutOfBoundsException when select in different case
> 
>
> Key: SPARK-34435
> URL: https://issues.apache.org/jira/browse/SPARK-34435
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.0.1
>Reporter: Enver Osmanov
>Priority: Trivial
>
> Actual behavior:
>  Select column with different case after remapping fail with 
> ArrayIndexOutOfBoundsException.
> Expected behavior:
> Spark shouldn't fail with ArrayIndexOutOfBoundsException.
>  Spark is case insensetive by default, so select should return selected 
> column.
> Test case:
> {code:java}
> case class User(aA: String, bb: String)
> // ...
> val user = User("John", "Doe")
> val ds = Seq(user).toDS().map(identity)
> ds.select("aa").show(false)
> {code}
> Additional notes:
> Test case is reproduceble with Spark 3.0.1. It works fine with Spark 2.4.7.
> I belive problem could be solved by changing filter in pruneDataSchema method 
> from SchemaPruning object from this:
> {code:java}
> val dataSchemaFieldNames = dataSchema.fieldNames.toSet
> val mergedDataSchema =
>   StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
> {code}
> to this:
> {code:java}
> val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
> val mergedDataSchema =
>   StructType(mergedSchema.filter(f => 
> dataSchemaFieldNames.contains(f.name.toLowerCase)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34435) ArrayIndexOutOfBoundsException when select in different case

2021-02-14 Thread Enver Osmanov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enver Osmanov updated SPARK-34435:
--
Environment: (was: Actual behavior:
Select column with different case after remapping fail with 
ArrayIndexOutOfBoundsException.

Expected behavior:

Spark shouldn't fail with ArrayIndexOutOfBoundsException.
Spark is case insensetive by default, so select should return selected column.

Test case:
{code:java}
case class User(aA: String, bb: String)
// ...
val user = User("John", "Doe")

val ds = Seq(user).toDS().map(identity)

ds.select("aa").show(false)
{code}
Additional notes:

Test case is reproduceble with Spark 3.0.1. It works fine with Spark 2.4.7.

I belive problem could be solved by changing filter in pruneDataSchema method 
from SchemaPruning object from this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name)))
{code}
to this:
{code:java}
val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet
val mergedDataSchema =
  StructType(mergedSchema.filter(f => 
dataSchemaFieldNames.contains(f.name.toLowerCase)))
{code})

> ArrayIndexOutOfBoundsException when select in different case
> 
>
> Key: SPARK-34435
> URL: https://issues.apache.org/jira/browse/SPARK-34435
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.0.1
>Reporter: Enver Osmanov
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module

2021-02-14 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284381#comment-17284381
 ] 

L. C. Hsieh commented on SPARK-34198:
-

I'd tend to take as the baseline from 
[https://github.com/qubole/spark-state-store|https://github.com/qubole/spark-state-store,]
 as we are experimenting it internally and seems we are not only one using it 
based on previous comments, and yea I think it is basically from the previous 
PR [https://github.com/apache/spark/pull/24922]. It looks newer than the first 
one and has a better structure.

 

 

> Add RocksDB StateStore as external module
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34434) Mention DS rebase options in SparkUpgradeException

2021-02-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284371#comment-17284371
 ] 

Apache Spark commented on SPARK-34434:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31562

> Mention DS rebase options in SparkUpgradeException 
> ---
>
> Key: SPARK-34434
> URL: https://issues.apache.org/jira/browse/SPARK-34434
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Mention the DS options added by SPARK-34404 and SPARK-34377 in 
> SparkUpgradeException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34434) Mention DS rebase options in SparkUpgradeException

2021-02-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284370#comment-17284370
 ] 

Apache Spark commented on SPARK-34434:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31562

> Mention DS rebase options in SparkUpgradeException 
> ---
>
> Key: SPARK-34434
> URL: https://issues.apache.org/jira/browse/SPARK-34434
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Mention the DS options added by SPARK-34404 and SPARK-34377 in 
> SparkUpgradeException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34434) Mention DS rebase options in SparkUpgradeException

2021-02-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34434:


Assignee: Apache Spark

> Mention DS rebase options in SparkUpgradeException 
> ---
>
> Key: SPARK-34434
> URL: https://issues.apache.org/jira/browse/SPARK-34434
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Mention the DS options added by SPARK-34404 and SPARK-34377 in 
> SparkUpgradeException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34434) Mention DS rebase options in SparkUpgradeException

2021-02-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34434:


Assignee: (was: Apache Spark)

> Mention DS rebase options in SparkUpgradeException 
> ---
>
> Key: SPARK-34434
> URL: https://issues.apache.org/jira/browse/SPARK-34434
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Mention the DS options added by SPARK-34404 and SPARK-34377 in 
> SparkUpgradeException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34427) Session window support in SS

2021-02-14 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284368#comment-17284368
 ] 

L. C. Hsieh commented on SPARK-34427:
-

Please check the JIRA history and I don't think this is unconventional to 
assign JIRA issue when there are ongoing works internally without PRs 
submitted. This works in many years in Spark community.

Again, conventionally I do see the committers assign JIRA issues to themselves 
or other contributors because they are working on it (even PR is not submitted 
yet), or they plan to do it. That is how the Spark community does in the past 
and now. So again, if you are against the convention, please raise a discussion 
to disallow it. Otherwise I don't know why these issues are special for you.

We all need to plan what we want to do in Spark community. Opening JIRA issue 
early can help gather thoughts from others. If we don't assign it, we can 
easily step on others toes. From your perspective, once a JIRA issue is created 
and we cannot assign it, it is open for others to work on it. How does the plan 
work? Then I think no one will be willing to create JIRA issue before really 
submitting PR.

We are experimenting RocksDB work internally so we create SPARK-34198 and 
assign it. I don't know why it means we occupy major effort in parallel and 
block others? So we can only work on one JIRA issue at a time?

I think these issues are not active in past years. I don't know why when we 
want to push it and work on it, now we are blocking others???

I'm not saying that we definitely want to push our implementation for 
SPARK-10816 by abandoning other two efforts in the past. But before any 
communication ahead, it sounds too harsh to me that after we put the feature on 
our plan explicitly, then there comes the claim that we should leave the work, 
otherwise we are blocking others.



> Session window support in SS
> 
>
> Key: SPARK-34427
> URL: https://issues.apache.org/jira/browse/SPARK-34427
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Currently structured streaming supports two kinds of windows: tumbling window 
> and sliding window. Another useful window function is session window. Which 
> is not supported by SS. We have user requirement to use session window. We'd 
> like to have this support in the upstream.
> About session window, there is some info: 
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#session-windows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34434) Mention DS rebase options in SparkUpgradeException

2021-02-14 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-34434:
--

 Summary: Mention DS rebase options in SparkUpgradeException 
 Key: SPARK-34434
 URL: https://issues.apache.org/jira/browse/SPARK-34434
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk


Mention the DS options added by SPARK-34404 and SPARK-34377 in 
SparkUpgradeException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

52 matches

Mail list logo