[jira] [Comment Edited] (SPARK-34295) Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude
[ https://issues.apache.org/jira/browse/SPARK-34295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284588#comment-17284588 ] L. C. Hsieh edited comment on SPARK-34295 at 2/15/21, 7:39 AM: --- To prevent other questioning about the assignee, I have the changed code ready locally but I don't have the environment to test. I'd let our customer to test it internally. Once I get the confirmation, I will submit the PR. was (Author: viirya): To prevent other questioning about the assignee, I have the changed ready locally but I don't have the environment to test. I'd let our customer to test it internally. Once I get the confirmation, I will submit the PR. > Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude > > > Key: SPARK-34295 > URL: https://issues.apache.org/jira/browse/SPARK-34295 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > MapReduce jobs can instruct YARN to skip renewal of tokens obtained from > certain hosts by specifying the hosts with configuration > mapreduce.job.hdfs-servers.token-renewal.exclude=,,..,. > But seems Spark lacks of similar option. So the job submission fails if YARN > fails to renew DelegationToken for any of the remote HDFS cluster. The > failure in DT renewal can happen due to many reason like Remote HDFS does not > trust Kerberos identity of YARN etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34295) Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude
[ https://issues.apache.org/jira/browse/SPARK-34295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284588#comment-17284588 ] L. C. Hsieh commented on SPARK-34295: - To prevent other questioning about the assignee, I have the changed ready locally but I don't have the environment to test. I'd let our customer to test it internally. Once I get the confirmation, I will submit the PR. > Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude > > > Key: SPARK-34295 > URL: https://issues.apache.org/jira/browse/SPARK-34295 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > MapReduce jobs can instruct YARN to skip renewal of tokens obtained from > certain hosts by specifying the hosts with configuration > mapreduce.job.hdfs-servers.token-renewal.exclude=,,..,. > But seems Spark lacks of similar option. So the job submission fails if YARN > fails to renew DelegationToken for any of the remote HDFS cluster. The > failure in DT renewal can happen due to many reason like Remote HDFS does not > trust Kerberos identity of YARN etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34438) Python Driver is not correctly detected using presigned URLs
[ https://issues.apache.org/jira/browse/SPARK-34438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34438: Assignee: (was: Apache Spark) > Python Driver is not correctly detected using presigned URLs > > > Key: SPARK-34438 > URL: https://issues.apache.org/jira/browse/SPARK-34438 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0 >Reporter: Julian Fleischer >Priority: Minor > > In AWS one can generate so-called presigned URLs. spark-submit accepts URLs > for the driver program, e.g. {{http://my-web-server/driver.py}}. Now a > presigned URL has a query fragment > {{http://my-web-server/driver.py?signature}}. > Now the check for whether the given URL is a python driver simply checks > whether it ends in {{.py}} – which the presigned URL does not, as it ends in > {{signature}}. > The relevant check is in {{SparkSubmit.scala}}, Line 1051 (commit tagged > {{v3.0.1}}): > [https://github.com/apache/spark/blob/v3.0.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L1051] > > Here is a more realistic example URL: > {{https://bucket-name.s3.us-east-1.amazonaws.com/driver.py?X-Amz-Algorithm=AWS4-HMAC-SHA256=AKIATBNPKWPCNUMWMLUR%2F20210214%2Fus-east-1%2Fs3%2Faws4_request=20210214T062047Z=172800=host=49ef39b6bb7090001af9312692788892551916a6ac0ff6c961ce52efb9acc235}} > A fix could be to parse the the given path as a {{java.net.URI}} and look for > the pathname to end in {{.py}} (as opposed to the whole thing). > To circumvent this issue I am currently appending a fragment to the query > which makes it end in {{.py}}, i.e. > {{http://my-web-server/driver.py?signature#.py}} which does work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34438) Python Driver is not correctly detected using presigned URLs
[ https://issues.apache.org/jira/browse/SPARK-34438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284535#comment-17284535 ] Apache Spark commented on SPARK-34438: -- User 'scravy' has created a pull request for this issue: https://github.com/apache/spark/pull/31565 > Python Driver is not correctly detected using presigned URLs > > > Key: SPARK-34438 > URL: https://issues.apache.org/jira/browse/SPARK-34438 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0 >Reporter: Julian Fleischer >Priority: Minor > > In AWS one can generate so-called presigned URLs. spark-submit accepts URLs > for the driver program, e.g. {{http://my-web-server/driver.py}}. Now a > presigned URL has a query fragment > {{http://my-web-server/driver.py?signature}}. > Now the check for whether the given URL is a python driver simply checks > whether it ends in {{.py}} – which the presigned URL does not, as it ends in > {{signature}}. > The relevant check is in {{SparkSubmit.scala}}, Line 1051 (commit tagged > {{v3.0.1}}): > [https://github.com/apache/spark/blob/v3.0.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L1051] > > Here is a more realistic example URL: > {{https://bucket-name.s3.us-east-1.amazonaws.com/driver.py?X-Amz-Algorithm=AWS4-HMAC-SHA256=AKIATBNPKWPCNUMWMLUR%2F20210214%2Fus-east-1%2Fs3%2Faws4_request=20210214T062047Z=172800=host=49ef39b6bb7090001af9312692788892551916a6ac0ff6c961ce52efb9acc235}} > A fix could be to parse the the given path as a {{java.net.URI}} and look for > the pathname to end in {{.py}} (as opposed to the whole thing). > To circumvent this issue I am currently appending a fragment to the query > which makes it end in {{.py}}, i.e. > {{http://my-web-server/driver.py?signature#.py}} which does work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34438) Python Driver is not correctly detected using presigned URLs
[ https://issues.apache.org/jira/browse/SPARK-34438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34438: Assignee: Apache Spark > Python Driver is not correctly detected using presigned URLs > > > Key: SPARK-34438 > URL: https://issues.apache.org/jira/browse/SPARK-34438 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0 >Reporter: Julian Fleischer >Assignee: Apache Spark >Priority: Minor > > In AWS one can generate so-called presigned URLs. spark-submit accepts URLs > for the driver program, e.g. {{http://my-web-server/driver.py}}. Now a > presigned URL has a query fragment > {{http://my-web-server/driver.py?signature}}. > Now the check for whether the given URL is a python driver simply checks > whether it ends in {{.py}} – which the presigned URL does not, as it ends in > {{signature}}. > The relevant check is in {{SparkSubmit.scala}}, Line 1051 (commit tagged > {{v3.0.1}}): > [https://github.com/apache/spark/blob/v3.0.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L1051] > > Here is a more realistic example URL: > {{https://bucket-name.s3.us-east-1.amazonaws.com/driver.py?X-Amz-Algorithm=AWS4-HMAC-SHA256=AKIATBNPKWPCNUMWMLUR%2F20210214%2Fus-east-1%2Fs3%2Faws4_request=20210214T062047Z=172800=host=49ef39b6bb7090001af9312692788892551916a6ac0ff6c961ce52efb9acc235}} > A fix could be to parse the the given path as a {{java.net.URI}} and look for > the pathname to end in {{.py}} (as opposed to the whole thing). > To circumvent this issue I am currently appending a fragment to the query > which makes it end in {{.py}}, i.e. > {{http://my-web-server/driver.py?signature#.py}} which does work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34438) Python Driver is not correctly detected using presigned URLs
[ https://issues.apache.org/jira/browse/SPARK-34438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284533#comment-17284533 ] Apache Spark commented on SPARK-34438: -- User 'scravy' has created a pull request for this issue: https://github.com/apache/spark/pull/31565 > Python Driver is not correctly detected using presigned URLs > > > Key: SPARK-34438 > URL: https://issues.apache.org/jira/browse/SPARK-34438 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0 >Reporter: Julian Fleischer >Priority: Minor > > In AWS one can generate so-called presigned URLs. spark-submit accepts URLs > for the driver program, e.g. {{http://my-web-server/driver.py}}. Now a > presigned URL has a query fragment > {{http://my-web-server/driver.py?signature}}. > Now the check for whether the given URL is a python driver simply checks > whether it ends in {{.py}} – which the presigned URL does not, as it ends in > {{signature}}. > The relevant check is in {{SparkSubmit.scala}}, Line 1051 (commit tagged > {{v3.0.1}}): > [https://github.com/apache/spark/blob/v3.0.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L1051] > > Here is a more realistic example URL: > {{https://bucket-name.s3.us-east-1.amazonaws.com/driver.py?X-Amz-Algorithm=AWS4-HMAC-SHA256=AKIATBNPKWPCNUMWMLUR%2F20210214%2Fus-east-1%2Fs3%2Faws4_request=20210214T062047Z=172800=host=49ef39b6bb7090001af9312692788892551916a6ac0ff6c961ce52efb9acc235}} > A fix could be to parse the the given path as a {{java.net.URI}} and look for > the pathname to end in {{.py}} (as opposed to the whole thing). > To circumvent this issue I am currently appending a fragment to the query > which makes it end in {{.py}}, i.e. > {{http://my-web-server/driver.py?signature#.py}} which does work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34438) Python Driver is not correctly detected using presigned URLs
[ https://issues.apache.org/jira/browse/SPARK-34438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284531#comment-17284531 ] Julian Fleischer commented on SPARK-34438: -- I am proposing a patch here: https://github.com/apache/spark/pull/31565 > Python Driver is not correctly detected using presigned URLs > > > Key: SPARK-34438 > URL: https://issues.apache.org/jira/browse/SPARK-34438 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0 >Reporter: Julian Fleischer >Priority: Minor > > In AWS one can generate so-called presigned URLs. spark-submit accepts URLs > for the driver program, e.g. {{http://my-web-server/driver.py}}. Now a > presigned URL has a query fragment > {{http://my-web-server/driver.py?signature}}. > Now the check for whether the given URL is a python driver simply checks > whether it ends in {{.py}} – which the presigned URL does not, as it ends in > {{signature}}. > The relevant check is in {{SparkSubmit.scala}}, Line 1051 (commit tagged > {{v3.0.1}}): > [https://github.com/apache/spark/blob/v3.0.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L1051] > > Here is a more realistic example URL: > {{https://bucket-name.s3.us-east-1.amazonaws.com/driver.py?X-Amz-Algorithm=AWS4-HMAC-SHA256=AKIATBNPKWPCNUMWMLUR%2F20210214%2Fus-east-1%2Fs3%2Faws4_request=20210214T062047Z=172800=host=49ef39b6bb7090001af9312692788892551916a6ac0ff6c961ce52efb9acc235}} > A fix could be to parse the the given path as a {{java.net.URI}} and look for > the pathname to end in {{.py}} (as opposed to the whole thing). > To circumvent this issue I am currently appending a fragment to the query > which makes it end in {{.py}}, i.e. > {{http://my-web-server/driver.py?signature#.py}} which does work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34438) Python Driver is not correctly detected using presigned URLs
[ https://issues.apache.org/jira/browse/SPARK-34438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julian Fleischer updated SPARK-34438: - Priority: Minor (was: Major) > Python Driver is not correctly detected using presigned URLs > > > Key: SPARK-34438 > URL: https://issues.apache.org/jira/browse/SPARK-34438 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0 >Reporter: Julian Fleischer >Priority: Minor > > In AWS one can generate so-called presigned URLs. spark-submit accepts URLs > for the driver program, e.g. {{http://my-web-server/driver.py}}. Now a > presigned URL has a query fragment > {{http://my-web-server/driver.py?signature}}. > Now the check for whether the given URL is a python driver simply checks > whether it ends in {{.py}} – which the presigned URL does not, as it ends in > {{signature}}. > The relevant check is in {{SparkSubmit.scala}}, Line 1051 (commit tagged > {{v3.0.1}}): > [https://github.com/apache/spark/blob/v3.0.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L1051] > > Here is a more realistic example URL: > {{https://bucket-name.s3.us-east-1.amazonaws.com/driver.py?X-Amz-Algorithm=AWS4-HMAC-SHA256=AKIATBNPKWPCNUMWMLUR%2F20210214%2Fus-east-1%2Fs3%2Faws4_request=20210214T062047Z=172800=host=49ef39b6bb7090001af9312692788892551916a6ac0ff6c961ce52efb9acc235}} > A fix could be to parse the the given path as a {{java.net.URI}} and look for > the pathname to end in {{.py}} (as opposed to the whole thing). > To circumvent this issue I am currently appending a fragment to the query > which makes it end in {{.py}}, i.e. > {{http://my-web-server/driver.py?signature#.py}} which does work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34438) Python Driver is not correctly detected using presigned URLs
Julian Fleischer created SPARK-34438: Summary: Python Driver is not correctly detected using presigned URLs Key: SPARK-34438 URL: https://issues.apache.org/jira/browse/SPARK-34438 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 3.0.1, 3.0.0, 3.0.2, 3.1.0 Reporter: Julian Fleischer In AWS one can generate so-called presigned URLs. spark-submit accepts URLs for the driver program, e.g. {{http://my-web-server/driver.py}}. Now a presigned URL has a query fragment {{http://my-web-server/driver.py?signature}}. Now the check for whether the given URL is a python driver simply checks whether it ends in {{.py}} – which the presigned URL does not, as it ends in {{signature}}. The relevant check is in {{SparkSubmit.scala}}, Line 1051 (commit tagged {{v3.0.1}}): [https://github.com/apache/spark/blob/v3.0.1/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L1051] Here is a more realistic example URL: {{https://bucket-name.s3.us-east-1.amazonaws.com/driver.py?X-Amz-Algorithm=AWS4-HMAC-SHA256=AKIATBNPKWPCNUMWMLUR%2F20210214%2Fus-east-1%2Fs3%2Faws4_request=20210214T062047Z=172800=host=49ef39b6bb7090001af9312692788892551916a6ac0ff6c961ce52efb9acc235}} A fix could be to parse the the given path as a {{java.net.URI}} and look for the pathname to end in {{.py}} (as opposed to the whole thing). To circumvent this issue I am currently appending a fragment to the query which makes it end in {{.py}}, i.e. {{http://my-web-server/driver.py?signature#.py}} which does work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34434) Mention DS rebase options in SparkUpgradeException
[ https://issues.apache.org/jira/browse/SPARK-34434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-34434: - Assignee: Maxim Gekk > Mention DS rebase options in SparkUpgradeException > --- > > Key: SPARK-34434 > URL: https://issues.apache.org/jira/browse/SPARK-34434 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > Mention the DS options added by SPARK-34404 and SPARK-34377 in > SparkUpgradeException. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34434) Mention DS rebase options in SparkUpgradeException
[ https://issues.apache.org/jira/browse/SPARK-34434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34434. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31562 [https://github.com/apache/spark/pull/31562] > Mention DS rebase options in SparkUpgradeException > --- > > Key: SPARK-34434 > URL: https://issues.apache.org/jira/browse/SPARK-34434 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > Mention the DS options added by SPARK-34404 and SPARK-34377 in > SparkUpgradeException. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34427) Session window support in SS
[ https://issues.apache.org/jira/browse/SPARK-34427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284507#comment-17284507 ] Jungtaek Lim edited comment on SPARK-34427 at 2/15/21, 12:50 AM: - OK I agree it's going to meaningless argue. I should have raised the discussion to dev@ mailing list. (EDIT: https://lists.apache.org/thread.html/r0802c6e8c5c4f51c0b781d137e6c62eb4e4105fbaea4d9743e8b6c51%40%3Cdev.spark.apache.org%3E) Please don't get me wrong. My origin concern is that you're trying to preempt major two efforts which would take non-trivial time for each one. There's no prove that there's ongoing work internally - you should have created a design doc or WIP PR if you made a meaningful progress internally, but you shared nothing and just assigned both issues to you and said I'm working on both (or planning to work on both) so don't step my toes. Sorry but that's not something I can understand. Again I'm not "just" concerned about this because it conflicts SPARK-10816. You want it? I can give up SPARK-10816 if you want it, though I'd -1 if you don't ensure having design doc, perf test, etc. to make the efforts on par. Just I don't think you can take up multiple major efforts altogether even none of things don't reach the PR (even WIP). I would have no argument if you just do the thing one by one, leaving space for contributors to play with. (Say I have no concern if you let RocksDB stuff be taken over from other contributor to focus on this stuff. Vice versa.) was (Author: kabhwan): OK I agree it's going to meaningless argue. I should have raised the discussion to dev@ mailing list. (EDIT: https://lists.apache.org/thread.html/r0802c6e8c5c4f51c0b781d137e6c62eb4e4105fbaea4d9743e8b6c51%40%3Cdev.spark.apache.org%3E) Please don't get me wrong. My origin concern is that you're trying to preempt major two efforts which would take non-trivial time for each one. There's no prove that there's ongoing work internally - you should have created a design doc or WIP PR if you made a meaningful progress internally, but you shared nothing and just assigned both issues to you and said I'm working on both. Sorry but that's not something I can understand. Again I'm not "just" concerned about this because it conflicts SPARK-10816. You want it? I can give up SPARK-10816 if you want it, though I'd -1 if you don't ensure having design doc, perf test, etc. to make the efforts on par. Just I don't think you can take up multiple major efforts altogether even none of things don't reach the PR (even WIP). I would have no argument if you just do the thing one by one, leaving space for contributors to play with. (Say I have no concern if you let RocksDB stuff be taken over from other contributor to focus on this stuff. Vice versa.) > Session window support in SS > > > Key: SPARK-34427 > URL: https://issues.apache.org/jira/browse/SPARK-34427 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > > Currently structured streaming supports two kinds of windows: tumbling window > and sliding window. Another useful window function is session window. Which > is not supported by SS. We have user requirement to use session window. We'd > like to have this support in the upstream. > About session window, there is some info: > https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#session-windows. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34427) Session window support in SS
[ https://issues.apache.org/jira/browse/SPARK-34427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284507#comment-17284507 ] Jungtaek Lim edited comment on SPARK-34427 at 2/15/21, 12:49 AM: - OK I agree it's going to meaningless argue. I should have raised the discussion to dev@ mailing list. (EDIT: https://lists.apache.org/thread.html/r0802c6e8c5c4f51c0b781d137e6c62eb4e4105fbaea4d9743e8b6c51%40%3Cdev.spark.apache.org%3E) Please don't get me wrong. My origin concern is that you're trying to preempt major two efforts which would take non-trivial time for each one. There's no prove that there's ongoing work internally - you should have created a design doc or WIP PR if you made a meaningful progress internally, but you shared nothing and just assigned both issues to you and said I'm working on both. Sorry but that's not something I can understand. Again I'm not "just" concerned about this because it conflicts SPARK-10816. You want it? I can give up SPARK-10816 if you want it, though I'd -1 if you don't ensure having design doc, perf test, etc. to make the efforts on par. Just I don't think you can take up multiple major efforts altogether even none of things don't reach the PR (even WIP). I would have no argument if you just do the thing one by one, leaving space for contributors to play with. (Say I have no concern if you let RocksDB stuff be taken over from other contributor to focus on this stuff. Vice versa.) was (Author: kabhwan): OK I agree it's going to meaningless argue. I should have raised the discussion to dev@ mailing list. Will do. Please don't get me wrong. My origin concern is that you're trying to preempt major two efforts which would take non-trivial time for each one. There's no prove that there's ongoing work internally - you should have created a design doc or WIP PR if you made a meaningful progress internally, but you shared nothing and just assigned both issues to you and said I'm working on both. Sorry but that's not something I can understand. Again I'm not "just" concerned about this because it conflicts SPARK-10816. You want it? I can give up SPARK-10816 if you want it, though I'd -1 if you don't ensure having design doc, perf test, etc. to make the efforts on par. Just I don't think you can take up multiple major efforts altogether even none of things don't reach the PR (even WIP). I would have no argument if you just do the thing one by one, leaving space for contributors to play with. (Say I have no concern if you let RocksDB stuff be taken over from other contributor to focus on this stuff. Vice versa.) > Session window support in SS > > > Key: SPARK-34427 > URL: https://issues.apache.org/jira/browse/SPARK-34427 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > > Currently structured streaming supports two kinds of windows: tumbling window > and sliding window. Another useful window function is session window. Which > is not supported by SS. We have user requirement to use session window. We'd > like to have this support in the upstream. > About session window, there is some info: > https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#session-windows. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34427) Session window support in SS
[ https://issues.apache.org/jira/browse/SPARK-34427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284507#comment-17284507 ] Jungtaek Lim commented on SPARK-34427: -- OK I agree it's going to meaningless argue. I should have raised the discussion to dev@ mailing list. Will do. Please don't get me wrong. My origin concern is that you're trying to preempt major two efforts which would take non-trivial time for each one. There's no prove that there's ongoing work internally - you should have created a design doc or WIP PR if you made a meaningful progress internally, but you shared nothing and just assigned both issues to you and said I'm working on both. Sorry but that's not something I can understand. Again I'm not "just" concerned about this because it conflicts SPARK-10816. You want it? I can give up SPARK-10816 if you want it, though I'd -1 if you don't ensure having design doc, perf test, etc. to make the efforts on par. Just I don't think you can take up multiple major efforts altogether even none of things don't reach the PR (even WIP). I would have no argument if you just do the thing one by one, leaving space for contributors to play with. (Say I have no concern if you let RocksDB stuff be taken over from other contributor to focus on this stuff. Vice versa.) > Session window support in SS > > > Key: SPARK-34427 > URL: https://issues.apache.org/jira/browse/SPARK-34427 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > > Currently structured streaming supports two kinds of windows: tumbling window > and sliding window. Another useful window function is session window. Which > is not supported by SS. We have user requirement to use session window. We'd > like to have this support in the upstream. > About session window, there is some info: > https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#session-windows. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34427) Session window support in SS
[ https://issues.apache.org/jira/browse/SPARK-34427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284504#comment-17284504 ] L. C. Hsieh commented on SPARK-34427: - Sigh...do you ever see that I say I want to ignore SPARK-10816 in my previous comments? Do I say I don't want to consider the existing effort? I just said (you can look at previous comments, it is unchanged): > From the code size, that (yours) PR is much larger than another. I'm not sure > if from feature perspective they are the same. As it comes to the weekend, I > can take another look at the previous two PRs. > From my side, I'd like to push this feature as we have real use case and > requirement. But I'm not sure if we want to follow up with previous PRs. I am not aware of SPARK-10816 when I created this JIRA with assignee. That's all. I don't know why this JIRA irritates you so much. What I did is NOT that I created this SPARK-34427, then see there is an existing SPARK-10816, then I immediately assign SPARK-34427 or SPARK-10816 to myself to occupy the issue and prevent others working on it... The assignee works like a placholder to notify others the issue is ongoing work or a work on plan. It is not strict and as you did, it can be easier removed or changed. If I don't set it, then other folks might think it is open issue and put some efforts on working on it. That is so called not to step on others toes. Once we figure out from communication with all parties what is best way to have an implementation for the feature, we can definitely change the assignee. I cannot accept your point to explain this assignee case is different. If I am going to assign SPARK-10816 to myself, then it is not acceptable. But I just created a new JIRA we plan to do with assignee. I don't know what is wrong with this usual practice. So sorry, but your point doesn't make sense to me. It is also not what I saw in past years and now in the Spark community. I guess you are unhappy here as I assigned this JIRA because you was working on it, and you think I occupy it. But again, when I created this JIRA with assignee, I don't know there is SPARK-10816 and you worked on it before. I don't mean to occupy the work you have worked on it. Is it clear to you? I don't really want to continue this argument. It is meaningless to me and waste my weekend time. Let me to be clear again: I created this JIRA with assignee because we plan to have this feature. Setting assignee is to prevent others (especially the contributors who are not familiar with Spark community) accidentally think it is open and put their time working on it. We will respect existing efforts. I did not know there is existing SPARK-10816. I need take some time to look at the existing works (they are both big change). Note that there is not only one implementation even in SPARK-10816, and I don't see any cooperation between two implementations. We can have communication between all parties involved and see what is the best way to have the feature. I will like to focus on real work instead of arguing this stuff. If you are interested in continuing pushing the session window. I think I need some time taking look the details of design and code in SPARK-10816 and think how to have the feature in best shape. > Session window support in SS > > > Key: SPARK-34427 > URL: https://issues.apache.org/jira/browse/SPARK-34427 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > > Currently structured streaming supports two kinds of windows: tumbling window > and sliding window. Another useful window function is session window. Which > is not supported by SS. We have user requirement to use session window. We'd > like to have this support in the upstream. > About session window, there is some info: > https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#session-windows. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34416) Support avroSchemaUrl in addition to avroSchema
[ https://issues.apache.org/jira/browse/SPARK-34416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-34416: - Assignee: Ohad Raviv > Support avroSchemaUrl in addition to avroSchema > --- > > Key: SPARK-34416 > URL: https://issues.apache.org/jira/browse/SPARK-34416 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Ohad Raviv >Assignee: Ohad Raviv >Priority: Minor > Fix For: 3.2.0 > > > We have a use case in which we read a huge table in Avro format. About 30k > columns. > using the default Hive reader - `AvroGenericRecordReader` it is just hangs > forever. after 4 hours not even one task has finished. > We tried instead to use > `spark.read.format("com.databricks.spark.avro").load(..)` but we failed on: > ``` > org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data > schema > .. > at > org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85) > at > org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:67) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:421) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174) > ... 53 elided > ``` > > because files schema contain duplicate column names (when considering > case-insensitive). > So we wanted to provide a user schema with non-duplicated fields, but the > schema is huge. a few MBs. it is not practical to provide it in json format. > > So we patched spark-avro to be able to get also `avroSchemaUrl` in addition > to `avroSchema` and it worked perfectly. > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34416) Support avroSchemaUrl in addition to avroSchema
[ https://issues.apache.org/jira/browse/SPARK-34416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34416. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31543 [https://github.com/apache/spark/pull/31543] > Support avroSchemaUrl in addition to avroSchema > --- > > Key: SPARK-34416 > URL: https://issues.apache.org/jira/browse/SPARK-34416 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Ohad Raviv >Priority: Minor > Fix For: 3.2.0 > > > We have a use case in which we read a huge table in Avro format. About 30k > columns. > using the default Hive reader - `AvroGenericRecordReader` it is just hangs > forever. after 4 hours not even one task has finished. > We tried instead to use > `spark.read.format("com.databricks.spark.avro").load(..)` but we failed on: > ``` > org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data > schema > .. > at > org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85) > at > org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:67) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:421) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174) > ... 53 elided > ``` > > because files schema contain duplicate column names (when considering > case-insensitive). > So we wanted to provide a user schema with non-duplicated fields, but the > schema is huge. a few MBs. it is not practical to provide it in json format. > > So we patched spark-avro to be able to get also `avroSchemaUrl` in addition > to `avroSchema` and it worked perfectly. > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34427) Session window support in SS
[ https://issues.apache.org/jira/browse/SPARK-34427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284496#comment-17284496 ] Jungtaek Lim commented on SPARK-34427: -- This assignee case is quite different from what I've seen committers have been doing, because these issues are not "new" (there has been existing efforts just not on right time) and the idea is quite well known so many of contributors can simply plan in parallel. e.g. In SPARK-34198 you'd realize one contributor in FB is also working on the solution in parallel. I don't think we are happy with someone occupies the major feature without even providing design doc or so. No one knows about the plan - no one knows whether the effort is started or even in backlog actually. In parallel, someone may have more progress. Stepping on others toes has been normal in Spark community and setting assignee never avoids it properly. It just makes an unfair competition between contributor and committer. If you want to make clear on the ownership for the major feature, then please prepare SPIP and raise it on dev@ mailing list. That ensures recognition that you're making meaningful progress already, and others could help on reviewing. (Even in that case someone argue with another SPIP, then either collaboration or competition should happen. I don't think committer can simply preempt.) Also, I think we should try to find the JIRA issue which did the same or similar, and leverage the one. There're lots of information and history of efforts which we can leverage "even" we take the different PR. Once you're filing a new JIRA issue and let the old one be ignored then the efforts were lost. I don't think you could simply raise a PR for SPARK-34427 and ask for review, as from SPARK-10816 we found there're various ways to implement it, which requires design doc to make sure the implementation considers these designs as well and picks up the best one. The implementation should also run the performance test and ensure it's superior or at least on par. That establishes the "minimum bar" on the efforts. Before achieving that, consider my voice as -1 on the proposal. To make the comparison easier I think you should really continue your work in SPARK-10816, not here. I'm happy to see some other committer finally found the necessity of the feature, but also unhappy that resurrection of the existing effort is not considered "at first" which would save a bunch of time among us. The existing effort wasn't discarded because of technical issue, that said, the design and implementation are still valid. That wasn't just put right on time. > Session window support in SS > > > Key: SPARK-34427 > URL: https://issues.apache.org/jira/browse/SPARK-34427 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > > Currently structured streaming supports two kinds of windows: tumbling window > and sliding window. Another useful window function is session window. Which > is not supported by SS. We have user requirement to use session window. We'd > like to have this support in the upstream. > About session window, there is some info: > https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#session-windows. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34198) Add RocksDB StateStore as external module
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284491#comment-17284491 ] Jungtaek Lim edited comment on SPARK-34198 at 2/14/21, 8:02 PM: Thanks for considering it. I think it would be the best option for Apache Spark among these if it makes sense to Databricks as well, just because it has been served for years with enterprise level of support. We can't expect the stability from other options and may struggle with it for some period - it'd be best if we can avoid it. (Worth noting that second one may also provide enterprise level of support, but less than an year, and I had 50+ of review comments on proposed PR and personally didn't feel the PR was super solid at that time. I mean, for me, the PR was not proposed with production level quality at first.) was (Author: kabhwan): Thanks for considering it. I think it would be the best option for Apache Spark among these if it makes sense to Databricks as well, just because it has been served for years with enterprise level of support. We can't expect the stability from other options and may struggle with it for some period - it'd be best if we can avoid it. (Worth noting that second one may also provide enterprise level of support, but less than an year, and I had 50+ of review comments on proposed PR and personally didn't feel the PR was super solid at that time. I mean, for me, the PR was not proposed with production quality at first.) > Add RocksDB StateStore as external module > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. For the concern about adding RocksDB as a direct dependency, our > plan is to add this StateStore as an external module first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284491#comment-17284491 ] Jungtaek Lim commented on SPARK-34198: -- Thanks for considering it. I think it would be the best option for Apache Spark among these if it makes sense to Databricks as well, just because it has been served for years with enterprise level of support. We can't expect the stability from other options and may struggle with it for some period - it'd be best if we can avoid it. (Worth noting that second one may also provide enterprise level of support, but less than an year, and I had 50+ of review comments on proposed PR and personally didn't feel the PR was super solid at that time. I mean, for me, the PR was not proposed with production quality at first.) > Add RocksDB StateStore as external module > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. For the concern about adding RocksDB as a direct dependency, our > plan is to add this StateStore as an external module first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34435) ArrayIndexOutOfBoundsException when select in different case
[ https://issues.apache.org/jira/browse/SPARK-34435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284489#comment-17284489 ] Enver Osmanov commented on SPARK-34435: --- [~ymajid] , it is absolutely ok with me. If you will have any questions, please, let me know. > ArrayIndexOutOfBoundsException when select in different case > > > Key: SPARK-34435 > URL: https://issues.apache.org/jira/browse/SPARK-34435 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.0.1 >Reporter: Enver Osmanov >Priority: Trivial > > h5. Actual behavior: > Select column with different case after remapping fail with > ArrayIndexOutOfBoundsException. > h5. Expected behavior: > Spark shouldn't fail with ArrayIndexOutOfBoundsException. > Spark is case insensetive by default, so select should return selected > column. > h5. Test case: > {code:java} > case class User(aA: String, bb: String) > // ... > val user = User("John", "Doe") > val ds = Seq(user).toDS().map(identity) > ds.select("aa").show(false) > {code} > h5. Additional notes: > Test case is reproducible with Spark 3.0.1. There are no errors with Spark > 2.4.7. > I belive problem could be solved by changing filter in > `SchemaPruning#pruneDataSchema` from this: > {code:java} > val dataSchemaFieldNames = dataSchema.fieldNames.toSet > val mergedDataSchema = > StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) > {code} > to this: > {code:java} > val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet > val mergedDataSchema = > StructType(mergedSchema.filter(f => > dataSchemaFieldNames.contains(f.name.toLowerCase))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34198) Add RocksDB StateStore as external module
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284479#comment-17284479 ] Reynold Xin edited comment on SPARK-34198 at 2/14/21, 6:59 PM: --- I don't know the intricate details of it but I suspect it's a different one with much more features because it existed long before those two. was (Author: rxin): I don't know the intricate details of it but I suspect it's a different one because it existed long before those two. > Add RocksDB StateStore as external module > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. For the concern about adding RocksDB as a direct dependency, our > plan is to add this StateStore as an external module first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284479#comment-17284479 ] Reynold Xin commented on SPARK-34198: - I don't know the intricate details of it but I suspect it's a different one because it existed long before those two. > Add RocksDB StateStore as external module > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. For the concern about adding RocksDB as a direct dependency, our > plan is to add this StateStore as an external module first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284478#comment-17284478 ] L. C. Hsieh commented on SPARK-34198: - Thanks [~rxin]. Is the implementation used in Databricks a completely different one than other two implementations? Or it is also based on any one of the two? > Add RocksDB StateStore as external module > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. For the concern about adding RocksDB as a direct dependency, our > plan is to add this StateStore as an external module first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284474#comment-17284474 ] Reynold Xin commented on SPARK-34198: - [~kabhwan] let me talk to the team that built our internal version of that on whether it'd make sense. > Add RocksDB StateStore as external module > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. For the concern about adding RocksDB as a direct dependency, our > plan is to add this StateStore as an external module first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34435) ArrayIndexOutOfBoundsException when select in different case
[ https://issues.apache.org/jira/browse/SPARK-34435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284434#comment-17284434 ] Yousif Majid commented on SPARK-34435: -- Hey [~Enverest], I would like to work on this if that's ok with you! > ArrayIndexOutOfBoundsException when select in different case > > > Key: SPARK-34435 > URL: https://issues.apache.org/jira/browse/SPARK-34435 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.0.1 >Reporter: Enver Osmanov >Priority: Trivial > > h5. Actual behavior: > Select column with different case after remapping fail with > ArrayIndexOutOfBoundsException. > h5. Expected behavior: > Spark shouldn't fail with ArrayIndexOutOfBoundsException. > Spark is case insensetive by default, so select should return selected > column. > h5. Test case: > {code:java} > case class User(aA: String, bb: String) > // ... > val user = User("John", "Doe") > val ds = Seq(user).toDS().map(identity) > ds.select("aa").show(false) > {code} > h5. Additional notes: > Test case is reproducible with Spark 3.0.1. There are no errors with Spark > 2.4.7. > I belive problem could be solved by changing filter in > `SchemaPruning#pruneDataSchema` from this: > {code:java} > val dataSchemaFieldNames = dataSchema.fieldNames.toSet > val mergedDataSchema = > StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) > {code} > to this: > {code:java} > val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet > val mergedDataSchema = > StructType(mergedSchema.filter(f => > dataSchemaFieldNames.contains(f.name.toLowerCase))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34432) add a java implementation for the simple writable data source
[ https://issues.apache.org/jira/browse/SPARK-34432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284424#comment-17284424 ] Kevin Pis edited comment on SPARK-34432 at 2/14/21, 3:30 PM: - Hi [~cloud_fan]! Sorry to bother you, but Could you help me to review the following pr : [https://github.com/apache/spark/pull/31560] was (Author: kevinpis): Hi [~cloud_fan]! Sorry to bother you, but Could you help me to review the pr https://github.com/apache/spark/pull/31560 > add a java implementation for the simple writable data source > - > > Key: SPARK-34432 > URL: https://issues.apache.org/jira/browse/SPARK-34432 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.1 >Reporter: Kevin Pis >Priority: Minor > > This is a followup of https://github.com/apache/spark/pull/19269 > In #19269 , there is only a scala implementation of simple writable data > source in `DataSourceV2Suite`. > This PR adds a java implementation of it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34432) add a java implementation for the simple writable data source
[ https://issues.apache.org/jira/browse/SPARK-34432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284424#comment-17284424 ] Kevin Pis commented on SPARK-34432: --- Hi [~cloud_fan]! Sorry to bother you, but Could you help me to review the pr https://github.com/apache/spark/pull/31560 > add a java implementation for the simple writable data source > - > > Key: SPARK-34432 > URL: https://issues.apache.org/jira/browse/SPARK-34432 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.1 >Reporter: Kevin Pis >Priority: Minor > > This is a followup of https://github.com/apache/spark/pull/19269 > In #19269 , there is only a scala implementation of simple writable data > source in `DataSourceV2Suite`. > This PR adds a java implementation of it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34437) Update Spark SQL guide about rebase DS options and SQL configs
[ https://issues.apache.org/jira/browse/SPARK-34437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34437: Assignee: (was: Apache Spark) > Update Spark SQL guide about rebase DS options and SQL configs > -- > > Key: SPARK-34437 > URL: https://issues.apache.org/jira/browse/SPARK-34437 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > Describe the following SQL configs: > * spark.sql.legacy.parquet.int96RebaseModeInWrite > * spark.sql.legacy.parquet.datetimeRebaseModeInWrite > * spark.sql.legacy.parquet.int96RebaseModeInRead > * spark.sql.legacy.parquet.datetimeRebaseModeInRead > * spark.sql.legacy.avro.datetimeRebaseModeInWrite > * spark.sql.legacy.avro.datetimeRebaseModeInRead > And Avro/Parquet options datetimeRebaseMode and int96RebaseMode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34437) Update Spark SQL guide about rebase DS options and SQL configs
[ https://issues.apache.org/jira/browse/SPARK-34437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284423#comment-17284423 ] Apache Spark commented on SPARK-34437: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31564 > Update Spark SQL guide about rebase DS options and SQL configs > -- > > Key: SPARK-34437 > URL: https://issues.apache.org/jira/browse/SPARK-34437 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > Describe the following SQL configs: > * spark.sql.legacy.parquet.int96RebaseModeInWrite > * spark.sql.legacy.parquet.datetimeRebaseModeInWrite > * spark.sql.legacy.parquet.int96RebaseModeInRead > * spark.sql.legacy.parquet.datetimeRebaseModeInRead > * spark.sql.legacy.avro.datetimeRebaseModeInWrite > * spark.sql.legacy.avro.datetimeRebaseModeInRead > And Avro/Parquet options datetimeRebaseMode and int96RebaseMode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34437) Update Spark SQL guide about rebase DS options and SQL configs
[ https://issues.apache.org/jira/browse/SPARK-34437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34437: Assignee: Apache Spark > Update Spark SQL guide about rebase DS options and SQL configs > -- > > Key: SPARK-34437 > URL: https://issues.apache.org/jira/browse/SPARK-34437 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > Describe the following SQL configs: > * spark.sql.legacy.parquet.int96RebaseModeInWrite > * spark.sql.legacy.parquet.datetimeRebaseModeInWrite > * spark.sql.legacy.parquet.int96RebaseModeInRead > * spark.sql.legacy.parquet.datetimeRebaseModeInRead > * spark.sql.legacy.avro.datetimeRebaseModeInWrite > * spark.sql.legacy.avro.datetimeRebaseModeInRead > And Avro/Parquet options datetimeRebaseMode and int96RebaseMode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34432) add a java implementation for the simple writable data source
[ https://issues.apache.org/jira/browse/SPARK-34432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Pis updated SPARK-34432: -- Affects Version/s: (was: 3.1.1) 3.0.1 > add a java implementation for the simple writable data source > - > > Key: SPARK-34432 > URL: https://issues.apache.org/jira/browse/SPARK-34432 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.1 >Reporter: Kevin Pis >Priority: Minor > > This is a followup of https://github.com/apache/spark/pull/19269 > In #19269 , there is only a scala implementation of simple writable data > source in `DataSourceV2Suite`. > This PR adds a java implementation of it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34437) Update Spark SQL guide about rebase DS options and SQL configs
Maxim Gekk created SPARK-34437: -- Summary: Update Spark SQL guide about rebase DS options and SQL configs Key: SPARK-34437 URL: https://issues.apache.org/jira/browse/SPARK-34437 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 3.2.0 Reporter: Maxim Gekk Describe the following SQL configs: * spark.sql.legacy.parquet.int96RebaseModeInWrite * spark.sql.legacy.parquet.datetimeRebaseModeInWrite * spark.sql.legacy.parquet.int96RebaseModeInRead * spark.sql.legacy.parquet.datetimeRebaseModeInRead * spark.sql.legacy.avro.datetimeRebaseModeInWrite * spark.sql.legacy.avro.datetimeRebaseModeInRead And Avro/Parquet options datetimeRebaseMode and int96RebaseMode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16745) Spark job completed however have to wait for 13 mins (data size is small)
[ https://issues.apache.org/jira/browse/SPARK-16745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284416#comment-17284416 ] Abhay Dandekar commented on SPARK-16745: +1. Getting same issue on standalone spark 3.0.1 Workaround is to pass a local network for driver as follows: $ ./bin/spark-shell --conf spark.driver.host=localhost Can we please update the default option accordingly for standalone? Esp when master == local > Spark job completed however have to wait for 13 mins (data size is small) > - > > Key: SPARK-16745 > URL: https://issues.apache.org/jira/browse/SPARK-16745 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.6.1 > Environment: Max OS X Yosemite, Terminal, MacBook Air Late 2014 >Reporter: Joe Chong >Priority: Minor > > I submitted a job in scala spark shell to show a DataFrame. The data size is > about 43K. The job was successful in the end, but took more than 13 minutes > to resolve. Upon checking the log, there's multiple exception raised on > "Failed to check existence of class" with a java.net.connectionexpcetion > message indicating timeout trying to connect to the port 52067, the repl port > that Spark setup. Please assist to troubleshoot. Thanks. > Started Spark in standalone mode > $ spark-shell --driver-memory 5g --master local[*] > 16/07/26 21:05:29 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 16/07/26 21:05:30 INFO spark.SecurityManager: Changing view acls to: joechong > 16/07/26 21:05:30 INFO spark.SecurityManager: Changing modify acls to: > joechong > 16/07/26 21:05:30 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(joechong); users > with modify permissions: Set(joechong) > 16/07/26 21:05:30 INFO spark.HttpServer: Starting HTTP Server > 16/07/26 21:05:30 INFO server.Server: jetty-8.y.z-SNAPSHOT > 16/07/26 21:05:30 INFO server.AbstractConnector: Started > SocketConnector@0.0.0.0:52067 > 16/07/26 21:05:30 INFO util.Utils: Successfully started service 'HTTP class > server' on port 52067. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.6.1 > /_/ > Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66) > Type in expressions to have them evaluated. > Type :help for more information. > 16/07/26 21:05:34 INFO spark.SparkContext: Running Spark version 1.6.1 > 16/07/26 21:05:34 INFO spark.SecurityManager: Changing view acls to: joechong > 16/07/26 21:05:34 INFO spark.SecurityManager: Changing modify acls to: > joechong > 16/07/26 21:05:34 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(joechong); users > with modify permissions: Set(joechong) > 16/07/26 21:05:35 INFO util.Utils: Successfully started service 'sparkDriver' > on port 52072. > 16/07/26 21:05:35 INFO slf4j.Slf4jLogger: Slf4jLogger started > 16/07/26 21:05:35 INFO Remoting: Starting remoting > 16/07/26 21:05:35 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkDriverActorSystem@10.199.29.218:52074] > 16/07/26 21:05:35 INFO util.Utils: Successfully started service > 'sparkDriverActorSystem' on port 52074. > 16/07/26 21:05:35 INFO spark.SparkEnv: Registering MapOutputTracker > 16/07/26 21:05:35 INFO spark.SparkEnv: Registering BlockManagerMaster > 16/07/26 21:05:35 INFO storage.DiskBlockManager: Created local directory at > /private/var/folders/r7/bs2f87nj6lnd5vm51lvxcw68gn/T/blockmgr-cd542a27-6ff1-4f51-a72b-78654142fdb6 > 16/07/26 21:05:35 INFO storage.MemoryStore: MemoryStore started with capacity > 3.4 GB > 16/07/26 21:05:35 INFO spark.SparkEnv: Registering OutputCommitCoordinator > 16/07/26 21:05:36 INFO server.Server: jetty-8.y.z-SNAPSHOT > 16/07/26 21:05:36 INFO server.AbstractConnector: Started > SelectChannelConnector@0.0.0.0:4040 > 16/07/26 21:05:36 INFO util.Utils: Successfully started service 'SparkUI' on > port 4040. > 16/07/26 21:05:36 INFO ui.SparkUI: Started SparkUI at > http://10.199.29.218:4040 > 16/07/26 21:05:36 INFO executor.Executor: Starting executor ID driver on host > localhost > 16/07/26 21:05:36 INFO executor.Executor: Using REPL class URI: > http://10.199.29.218:52067 > 16/07/26 21:05:36 INFO util.Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52075. > 16/07/26 21:05:36 INFO netty.NettyBlockTransferService: Server created on > 52075 > 16/07/26 21:05:36 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/07/26 21:05:36 INFO
[jira] [Assigned] (SPARK-34436) DPP support LIKE ANY/ALL
[ https://issues.apache.org/jira/browse/SPARK-34436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34436: Assignee: (was: Apache Spark) > DPP support LIKE ANY/ALL > > > Key: SPARK-34436 > URL: https://issues.apache.org/jira/browse/SPARK-34436 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > Support this case: > {code:sql} > SELECT date_id, product_id FROM fact_sk f > JOIN dim_store s > ON f.store_id = s.store_id WHERE s.country LIKE ANY ('%D%E%', '%A%B%') > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34436) DPP support LIKE ANY/ALL
[ https://issues.apache.org/jira/browse/SPARK-34436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284414#comment-17284414 ] Apache Spark commented on SPARK-34436: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/31563 > DPP support LIKE ANY/ALL > > > Key: SPARK-34436 > URL: https://issues.apache.org/jira/browse/SPARK-34436 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > Support this case: > {code:sql} > SELECT date_id, product_id FROM fact_sk f > JOIN dim_store s > ON f.store_id = s.store_id WHERE s.country LIKE ANY ('%D%E%', '%A%B%') > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34436) DPP support LIKE ANY/ALL
[ https://issues.apache.org/jira/browse/SPARK-34436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34436: Assignee: Apache Spark > DPP support LIKE ANY/ALL > > > Key: SPARK-34436 > URL: https://issues.apache.org/jira/browse/SPARK-34436 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > Support this case: > {code:sql} > SELECT date_id, product_id FROM fact_sk f > JOIN dim_store s > ON f.store_id = s.store_id WHERE s.country LIKE ANY ('%D%E%', '%A%B%') > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34436) DPP support LIKE ANY/ALL
Yuming Wang created SPARK-34436: --- Summary: DPP support LIKE ANY/ALL Key: SPARK-34436 URL: https://issues.apache.org/jira/browse/SPARK-34436 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang Support this case: {code:sql} SELECT date_id, product_id FROM fact_sk f JOIN dim_store s ON f.store_id = s.store_id WHERE s.country LIKE ANY ('%D%E%', '%A%B%') {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34435) ArrayIndexOutOfBoundsException when select in different case
[ https://issues.apache.org/jira/browse/SPARK-34435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enver Osmanov updated SPARK-34435: -- Description: h5. Actual behavior: Select column with different case after remapping fail with ArrayIndexOutOfBoundsException. h5. Expected behavior: Spark shouldn't fail with ArrayIndexOutOfBoundsException. Spark is case insensetive by default, so select should return selected column. h5. Test case: {code:java} case class User(aA: String, bb: String) // ... val user = User("John", "Doe") val ds = Seq(user).toDS().map(identity) ds.select("aa").show(false) {code} h5. Additional notes: Test case is reproducible with Spark 3.0.1. There are no errors with Spark 2.4.7. I belive problem could be solved by changing filter in `SchemaPruning#pruneDataSchema` from this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) {code} to this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name.toLowerCase))) {code} was: h5. Actual behavior: Select column with different case after remapping fail with ArrayIndexOutOfBoundsException. h5. Expected behavior: Spark shouldn't fail with ArrayIndexOutOfBoundsException. Spark is case insensetive by default, so select should return selected column. h5. Test case: {code:java} case class User(aA: String, bb: String) // ... val user = User("John", "Doe") val ds = Seq(user).toDS().map(identity) ds.select("aa").show(false) {code} h5. Additional notes: Test case is reproducible with Spark 3.0.1. There are no errors with Spark 2.4.7. I belive problem could be solved by changing filter in pruneDataSchema method from SchemaPruning object from this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) {code} to this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name.toLowerCase))) {code} > ArrayIndexOutOfBoundsException when select in different case > > > Key: SPARK-34435 > URL: https://issues.apache.org/jira/browse/SPARK-34435 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.0.1 >Reporter: Enver Osmanov >Priority: Trivial > > h5. Actual behavior: > Select column with different case after remapping fail with > ArrayIndexOutOfBoundsException. > h5. Expected behavior: > Spark shouldn't fail with ArrayIndexOutOfBoundsException. > Spark is case insensetive by default, so select should return selected > column. > h5. Test case: > {code:java} > case class User(aA: String, bb: String) > // ... > val user = User("John", "Doe") > val ds = Seq(user).toDS().map(identity) > ds.select("aa").show(false) > {code} > h5. Additional notes: > Test case is reproducible with Spark 3.0.1. There are no errors with Spark > 2.4.7. > I belive problem could be solved by changing filter in > `SchemaPruning#pruneDataSchema` from this: > {code:java} > val dataSchemaFieldNames = dataSchema.fieldNames.toSet > val mergedDataSchema = > StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) > {code} > to this: > {code:java} > val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet > val mergedDataSchema = > StructType(mergedSchema.filter(f => > dataSchemaFieldNames.contains(f.name.toLowerCase))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34435) ArrayIndexOutOfBoundsException when select in different case
[ https://issues.apache.org/jira/browse/SPARK-34435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enver Osmanov updated SPARK-34435: -- Description: h5. Actual behavior: Select column with different case after remapping fail with ArrayIndexOutOfBoundsException. h5. Expected behavior: Spark shouldn't fail with ArrayIndexOutOfBoundsException. Spark is case insensetive by default, so select should return selected column. h5. Test case: {code:java} case class User(aA: String, bb: String) // ... val user = User("John", "Doe") val ds = Seq(user).toDS().map(identity) ds.select("aa").show(false) {code} h5. Additional notes: Test case is reproducible with Spark 3.0.1. There is no errors with Spark 2.4.7. I belive problem could be solved by changing filter in pruneDataSchema method from SchemaPruning object from this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) {code} to this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name.toLowerCase))) {code} was: h5. Actual behavior: Select column with different case after remapping fail with ArrayIndexOutOfBoundsException. h5. Expected behavior: Spark shouldn't fail with ArrayIndexOutOfBoundsException. Spark is case insensetive by default, so select should return selected column. h5. Test case: {code:java} case class User(aA: String, bb: String) // ... val user = User("John", "Doe") val ds = Seq(user).toDS().map(identity) ds.select("aa").show(false) {code} h5. Additional notes: Test case is reproducible with Spark 3.0.1. It works fine with Spark 2.4.7. I belive problem could be solved by changing filter in pruneDataSchema method from SchemaPruning object from this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) {code} to this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name.toLowerCase))) {code} > ArrayIndexOutOfBoundsException when select in different case > > > Key: SPARK-34435 > URL: https://issues.apache.org/jira/browse/SPARK-34435 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.0.1 >Reporter: Enver Osmanov >Priority: Trivial > > h5. Actual behavior: > Select column with different case after remapping fail with > ArrayIndexOutOfBoundsException. > h5. Expected behavior: > Spark shouldn't fail with ArrayIndexOutOfBoundsException. > Spark is case insensetive by default, so select should return selected > column. > h5. Test case: > {code:java} > case class User(aA: String, bb: String) > // ... > val user = User("John", "Doe") > val ds = Seq(user).toDS().map(identity) > ds.select("aa").show(false) > {code} > h5. Additional notes: > Test case is reproducible with Spark 3.0.1. There is no errors with Spark > 2.4.7. > I belive problem could be solved by changing filter in pruneDataSchema method > from SchemaPruning object from this: > {code:java} > val dataSchemaFieldNames = dataSchema.fieldNames.toSet > val mergedDataSchema = > StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) > {code} > to this: > {code:java} > val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet > val mergedDataSchema = > StructType(mergedSchema.filter(f => > dataSchemaFieldNames.contains(f.name.toLowerCase))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34435) ArrayIndexOutOfBoundsException when select in different case
[ https://issues.apache.org/jira/browse/SPARK-34435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enver Osmanov updated SPARK-34435: -- Description: h5. Actual behavior: Select column with different case after remapping fail with ArrayIndexOutOfBoundsException. h5. Expected behavior: Spark shouldn't fail with ArrayIndexOutOfBoundsException. Spark is case insensetive by default, so select should return selected column. h5. Test case: {code:java} case class User(aA: String, bb: String) // ... val user = User("John", "Doe") val ds = Seq(user).toDS().map(identity) ds.select("aa").show(false) {code} h5. Additional notes: Test case is reproducible with Spark 3.0.1. There are no errors with Spark 2.4.7. I belive problem could be solved by changing filter in pruneDataSchema method from SchemaPruning object from this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) {code} to this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name.toLowerCase))) {code} was: h5. Actual behavior: Select column with different case after remapping fail with ArrayIndexOutOfBoundsException. h5. Expected behavior: Spark shouldn't fail with ArrayIndexOutOfBoundsException. Spark is case insensetive by default, so select should return selected column. h5. Test case: {code:java} case class User(aA: String, bb: String) // ... val user = User("John", "Doe") val ds = Seq(user).toDS().map(identity) ds.select("aa").show(false) {code} h5. Additional notes: Test case is reproducible with Spark 3.0.1. There is no errors with Spark 2.4.7. I belive problem could be solved by changing filter in pruneDataSchema method from SchemaPruning object from this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) {code} to this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name.toLowerCase))) {code} > ArrayIndexOutOfBoundsException when select in different case > > > Key: SPARK-34435 > URL: https://issues.apache.org/jira/browse/SPARK-34435 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.0.1 >Reporter: Enver Osmanov >Priority: Trivial > > h5. Actual behavior: > Select column with different case after remapping fail with > ArrayIndexOutOfBoundsException. > h5. Expected behavior: > Spark shouldn't fail with ArrayIndexOutOfBoundsException. > Spark is case insensetive by default, so select should return selected > column. > h5. Test case: > {code:java} > case class User(aA: String, bb: String) > // ... > val user = User("John", "Doe") > val ds = Seq(user).toDS().map(identity) > ds.select("aa").show(false) > {code} > h5. Additional notes: > Test case is reproducible with Spark 3.0.1. There are no errors with Spark > 2.4.7. > I belive problem could be solved by changing filter in pruneDataSchema method > from SchemaPruning object from this: > {code:java} > val dataSchemaFieldNames = dataSchema.fieldNames.toSet > val mergedDataSchema = > StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) > {code} > to this: > {code:java} > val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet > val mergedDataSchema = > StructType(mergedSchema.filter(f => > dataSchemaFieldNames.contains(f.name.toLowerCase))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34435) ArrayIndexOutOfBoundsException when select in different case
[ https://issues.apache.org/jira/browse/SPARK-34435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enver Osmanov updated SPARK-34435: -- Description: h5. Actual behavior: Select column with different case after remapping fail with ArrayIndexOutOfBoundsException. h5. Expected behavior: Spark shouldn't fail with ArrayIndexOutOfBoundsException. Spark is case insensetive by default, so select should return selected column. h5. Test case: {code:java} case class User(aA: String, bb: String) // ... val user = User("John", "Doe") val ds = Seq(user).toDS().map(identity) ds.select("aa").show(false) {code} h5. Additional notes: Test case is reproducible with Spark 3.0.1. It works fine with Spark 2.4.7. I belive problem could be solved by changing filter in pruneDataSchema method from SchemaPruning object from this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) {code} to this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name.toLowerCase))) {code} was: Actual behavior: Select column with different case after remapping fail with ArrayIndexOutOfBoundsException. Expected behavior: Spark shouldn't fail with ArrayIndexOutOfBoundsException. Spark is case insensetive by default, so select should return selected column. Test case: {code:java} case class User(aA: String, bb: String) // ... val user = User("John", "Doe") val ds = Seq(user).toDS().map(identity) ds.select("aa").show(false) {code} Additional notes: Test case is reproduceble with Spark 3.0.1. It works fine with Spark 2.4.7. I belive problem could be solved by changing filter in pruneDataSchema method from SchemaPruning object from this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) {code} to this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name.toLowerCase))) {code} > ArrayIndexOutOfBoundsException when select in different case > > > Key: SPARK-34435 > URL: https://issues.apache.org/jira/browse/SPARK-34435 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.0.1 >Reporter: Enver Osmanov >Priority: Trivial > > h5. Actual behavior: > Select column with different case after remapping fail with > ArrayIndexOutOfBoundsException. > h5. Expected behavior: > Spark shouldn't fail with ArrayIndexOutOfBoundsException. > Spark is case insensetive by default, so select should return selected > column. > h5. Test case: > {code:java} > case class User(aA: String, bb: String) > // ... > val user = User("John", "Doe") > val ds = Seq(user).toDS().map(identity) > ds.select("aa").show(false) > {code} > h5. Additional notes: > Test case is reproducible with Spark 3.0.1. It works fine with Spark 2.4.7. > I belive problem could be solved by changing filter in pruneDataSchema method > from SchemaPruning object from this: > {code:java} > val dataSchemaFieldNames = dataSchema.fieldNames.toSet > val mergedDataSchema = > StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) > {code} > to this: > {code:java} > val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet > val mergedDataSchema = > StructType(mergedSchema.filter(f => > dataSchemaFieldNames.contains(f.name.toLowerCase))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34435) ArrayIndexOutOfBoundsException when select in different case
Enver Osmanov created SPARK-34435: - Summary: ArrayIndexOutOfBoundsException when select in different case Key: SPARK-34435 URL: https://issues.apache.org/jira/browse/SPARK-34435 Project: Spark Issue Type: Bug Components: Optimizer, SQL Affects Versions: 3.0.1 Environment: Actual behavior: Select column with different case after remapping fail with ArrayIndexOutOfBoundsException. Expected behavior: Spark shouldn't fail with ArrayIndexOutOfBoundsException. Spark is case insensetive by default, so select should return selected column. Test case: {code:java} case class User(aA: String, bb: String) // ... val user = User("John", "Doe") val ds = Seq(user).toDS().map(identity) ds.select("aa").show(false) {code} Additional notes: Test case is reproduceble with Spark 3.0.1. It works fine with Spark 2.4.7. I belive problem could be solved by changing filter in pruneDataSchema method from SchemaPruning object from this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) {code} to this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name.toLowerCase))) {code} Reporter: Enver Osmanov -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34435) ArrayIndexOutOfBoundsException when select in different case
[ https://issues.apache.org/jira/browse/SPARK-34435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enver Osmanov updated SPARK-34435: -- Description: Actual behavior: Select column with different case after remapping fail with ArrayIndexOutOfBoundsException. Expected behavior: Spark shouldn't fail with ArrayIndexOutOfBoundsException. Spark is case insensetive by default, so select should return selected column. Test case: {code:java} case class User(aA: String, bb: String) // ... val user = User("John", "Doe") val ds = Seq(user).toDS().map(identity) ds.select("aa").show(false) {code} Additional notes: Test case is reproduceble with Spark 3.0.1. It works fine with Spark 2.4.7. I belive problem could be solved by changing filter in pruneDataSchema method from SchemaPruning object from this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) {code} to this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name.toLowerCase))) {code} > ArrayIndexOutOfBoundsException when select in different case > > > Key: SPARK-34435 > URL: https://issues.apache.org/jira/browse/SPARK-34435 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.0.1 >Reporter: Enver Osmanov >Priority: Trivial > > Actual behavior: > Select column with different case after remapping fail with > ArrayIndexOutOfBoundsException. > Expected behavior: > Spark shouldn't fail with ArrayIndexOutOfBoundsException. > Spark is case insensetive by default, so select should return selected > column. > Test case: > {code:java} > case class User(aA: String, bb: String) > // ... > val user = User("John", "Doe") > val ds = Seq(user).toDS().map(identity) > ds.select("aa").show(false) > {code} > Additional notes: > Test case is reproduceble with Spark 3.0.1. It works fine with Spark 2.4.7. > I belive problem could be solved by changing filter in pruneDataSchema method > from SchemaPruning object from this: > {code:java} > val dataSchemaFieldNames = dataSchema.fieldNames.toSet > val mergedDataSchema = > StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) > {code} > to this: > {code:java} > val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet > val mergedDataSchema = > StructType(mergedSchema.filter(f => > dataSchemaFieldNames.contains(f.name.toLowerCase))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34435) ArrayIndexOutOfBoundsException when select in different case
[ https://issues.apache.org/jira/browse/SPARK-34435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enver Osmanov updated SPARK-34435: -- Environment: (was: Actual behavior: Select column with different case after remapping fail with ArrayIndexOutOfBoundsException. Expected behavior: Spark shouldn't fail with ArrayIndexOutOfBoundsException. Spark is case insensetive by default, so select should return selected column. Test case: {code:java} case class User(aA: String, bb: String) // ... val user = User("John", "Doe") val ds = Seq(user).toDS().map(identity) ds.select("aa").show(false) {code} Additional notes: Test case is reproduceble with Spark 3.0.1. It works fine with Spark 2.4.7. I belive problem could be solved by changing filter in pruneDataSchema method from SchemaPruning object from this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) {code} to this: {code:java} val dataSchemaFieldNames = dataSchema.fieldNames.map(_.toLowerCase).toSet val mergedDataSchema = StructType(mergedSchema.filter(f => dataSchemaFieldNames.contains(f.name.toLowerCase))) {code}) > ArrayIndexOutOfBoundsException when select in different case > > > Key: SPARK-34435 > URL: https://issues.apache.org/jira/browse/SPARK-34435 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 3.0.1 >Reporter: Enver Osmanov >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284381#comment-17284381 ] L. C. Hsieh commented on SPARK-34198: - I'd tend to take as the baseline from [https://github.com/qubole/spark-state-store|https://github.com/qubole/spark-state-store,] as we are experimenting it internally and seems we are not only one using it based on previous comments, and yea I think it is basically from the previous PR [https://github.com/apache/spark/pull/24922]. It looks newer than the first one and has a better structure. > Add RocksDB StateStore as external module > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. For the concern about adding RocksDB as a direct dependency, our > plan is to add this StateStore as an external module first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34434) Mention DS rebase options in SparkUpgradeException
[ https://issues.apache.org/jira/browse/SPARK-34434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284371#comment-17284371 ] Apache Spark commented on SPARK-34434: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31562 > Mention DS rebase options in SparkUpgradeException > --- > > Key: SPARK-34434 > URL: https://issues.apache.org/jira/browse/SPARK-34434 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > Mention the DS options added by SPARK-34404 and SPARK-34377 in > SparkUpgradeException. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34434) Mention DS rebase options in SparkUpgradeException
[ https://issues.apache.org/jira/browse/SPARK-34434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284370#comment-17284370 ] Apache Spark commented on SPARK-34434: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31562 > Mention DS rebase options in SparkUpgradeException > --- > > Key: SPARK-34434 > URL: https://issues.apache.org/jira/browse/SPARK-34434 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > Mention the DS options added by SPARK-34404 and SPARK-34377 in > SparkUpgradeException. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34434) Mention DS rebase options in SparkUpgradeException
[ https://issues.apache.org/jira/browse/SPARK-34434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34434: Assignee: Apache Spark > Mention DS rebase options in SparkUpgradeException > --- > > Key: SPARK-34434 > URL: https://issues.apache.org/jira/browse/SPARK-34434 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > Mention the DS options added by SPARK-34404 and SPARK-34377 in > SparkUpgradeException. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34434) Mention DS rebase options in SparkUpgradeException
[ https://issues.apache.org/jira/browse/SPARK-34434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34434: Assignee: (was: Apache Spark) > Mention DS rebase options in SparkUpgradeException > --- > > Key: SPARK-34434 > URL: https://issues.apache.org/jira/browse/SPARK-34434 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > Mention the DS options added by SPARK-34404 and SPARK-34377 in > SparkUpgradeException. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34427) Session window support in SS
[ https://issues.apache.org/jira/browse/SPARK-34427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284368#comment-17284368 ] L. C. Hsieh commented on SPARK-34427: - Please check the JIRA history and I don't think this is unconventional to assign JIRA issue when there are ongoing works internally without PRs submitted. This works in many years in Spark community. Again, conventionally I do see the committers assign JIRA issues to themselves or other contributors because they are working on it (even PR is not submitted yet), or they plan to do it. That is how the Spark community does in the past and now. So again, if you are against the convention, please raise a discussion to disallow it. Otherwise I don't know why these issues are special for you. We all need to plan what we want to do in Spark community. Opening JIRA issue early can help gather thoughts from others. If we don't assign it, we can easily step on others toes. From your perspective, once a JIRA issue is created and we cannot assign it, it is open for others to work on it. How does the plan work? Then I think no one will be willing to create JIRA issue before really submitting PR. We are experimenting RocksDB work internally so we create SPARK-34198 and assign it. I don't know why it means we occupy major effort in parallel and block others? So we can only work on one JIRA issue at a time? I think these issues are not active in past years. I don't know why when we want to push it and work on it, now we are blocking others??? I'm not saying that we definitely want to push our implementation for SPARK-10816 by abandoning other two efforts in the past. But before any communication ahead, it sounds too harsh to me that after we put the feature on our plan explicitly, then there comes the claim that we should leave the work, otherwise we are blocking others. > Session window support in SS > > > Key: SPARK-34427 > URL: https://issues.apache.org/jira/browse/SPARK-34427 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > > Currently structured streaming supports two kinds of windows: tumbling window > and sliding window. Another useful window function is session window. Which > is not supported by SS. We have user requirement to use session window. We'd > like to have this support in the upstream. > About session window, there is some info: > https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#session-windows. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34434) Mention DS rebase options in SparkUpgradeException
Maxim Gekk created SPARK-34434: -- Summary: Mention DS rebase options in SparkUpgradeException Key: SPARK-34434 URL: https://issues.apache.org/jira/browse/SPARK-34434 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Maxim Gekk Mention the DS options added by SPARK-34404 and SPARK-34377 in SparkUpgradeException. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org