date:20240325

[jira] [Created] (SPARK-47559) Codegen Support for variant parse_json

2024-03-25 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-47559:
---

 Summary: Codegen Support for variant parse_json
 Key: SPARK-47559
 URL: https://issues.apache.org/jira/browse/SPARK-47559
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47559) Codegen Support for variant parse_json

2024-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47559:
---
Labels: pull-request-available  (was: )

> Codegen Support for variant parse_json
> --
>
> Key: SPARK-47559
> URL: https://issues.apache.org/jira/browse/SPARK-47559
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.

2024-03-25 Thread slankka (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830804#comment-17830804
 ] 

slankka edited comment on SPARK-21711 at 3/26/24 6:47 AM:
--

Thanks to [~mahesh_ambule] and Good to know, it's annoying to see errors in 
launching outputs during submission on client machine.
{code:java}
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /stdout (Permission denied) 
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.(FileOutputStream.java:213)
    at java.io.FileOutputStream.(FileOutputStream.java:133)
    at org.apache.log4j.FileAppender.setFile(FileAppender.java:294)
    at 
org.apache.log4j.RollingFileAppender.setFile(RollingFileAppender.java:207){code}
while settings in log4j.properties like:
{code:java}
appender.file_appender.fileName=${spark.yarn.app.container.log.dir}/stdout{code}
h3. Conclusion

1. SPARK_SUBMIT_OPTS solves the problem above: client log should output to 
correct directory.

2. setting SPARK_SUBMIT_OPTS of cause will NOT affect driver options or 
executor options.
h3. Notes

modifing bin/spark-class like below won't work. 
{code:java}
"$RUNNER"  -Dlog4j.properties= -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"_ {code}
because the real submit command is partially built by 
{code:java}
org.apache.spark.launcher.AbstractCommandBuilder#buildJavaCommand {code}
*buildJavaCommand* generates command starts from java executable to classpath 
before 
*org.apache.spark.deploy.SparkSubmit*
 

Logging and debuging

[Running Spark on YARN - Spark 3.5.1 Documentation 
(apache.org)|https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application]

 

 


was (Author: adrian z):
Thanks to [~mahesh_ambule] and Good to know, it's annoying to see errors in 
launching outputs during submisstion on client machine.
{code:java}
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /stdout (Permission denied) 
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.(FileOutputStream.java:213)
    at java.io.FileOutputStream.(FileOutputStream.java:133)
    at org.apache.log4j.FileAppender.setFile(FileAppender.java:294)
    at 
org.apache.log4j.RollingFileAppender.setFile(RollingFileAppender.java:207){code}
while settings in log4j.properties like:
{code:java}
appender.file_appender.fileName=${spark.yarn.app.container.log.dir}/stdout{code}
h3. Conclusion

1. SPARK_SUBMIT_OPTS solves the problem above: client log should output to 
correct directory.

2. setting SPARK_SUBMIT_OPTS of cause will NOT affect driver options or 
executor options.
h3. Notes

modifing bin/spark-class like below won't work. 
{code:java}
"$RUNNER"  -Dlog4j.properties= -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"_ {code}
because the real submit command is partially built by 
{code:java}
org.apache.spark.launcher.AbstractCommandBuilder#buildJavaCommand {code}
*buildJavaCommand* generates command starts from java executable to classpath 
before 
*org.apache.spark.deploy.SparkSubmit*
 

Logging and debuging

[Running Spark on YARN - Spark 3.5.1 Documentation 
(apache.org)|https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application]

 

 

> spark-submit command should accept log4j configuration parameters for spark 
> client logging.
> ---
>
> Key: SPARK-21711
> URL: https://issues.apache.org/jira/browse/SPARK-21711
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Mahesh Ambule
>Priority: Minor
> Attachments: spark-submit client logs.txt
>
>
> Currently, log4j properties can be specified in spark 'conf' directory in 
> log4j.properties file.
> The spark-submit command can override these log4j properties for driver and 
> executors. 
> But it can not override these log4j properties for *spark client * 
> application.
> The user should be able to pass log4j properties for spark client using the 
> spark-submit command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.

2024-03-25 Thread slankka (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830804#comment-17830804
 ] 

slankka commented on SPARK-21711:
-

Thanks to [~mahesh_ambule] and Good to know, it's annoying to see errors in 
launching outputs during submisstion on client machine.
{code:java}
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /stdout (Permission denied) 
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.(FileOutputStream.java:213)
    at java.io.FileOutputStream.(FileOutputStream.java:133)
    at org.apache.log4j.FileAppender.setFile(FileAppender.java:294)
    at 
org.apache.log4j.RollingFileAppender.setFile(RollingFileAppender.java:207){code}
while settings in log4j.properties like:
{code:java}
appender.file_appender.fileName=${spark.yarn.app.container.log.dir}/stdout{code}
h3. Conclusion

1. SPARK_SUBMIT_OPTS solves the problem above: client log should output to 
correct directory.

2. setting SPARK_SUBMIT_OPTS of cause will NOT affect driver options or 
executor options.
h3. Notes

modifing bin/spark-class like below won't work. 
{code:java}
"$RUNNER"  -Dlog4j.properties= -cp "$LAUNCH_CLASSPATH" 
org.apache.spark.launcher.Main "$@"_ {code}
because the real submit command is partially built by 
{code:java}
org.apache.spark.launcher.AbstractCommandBuilder#buildJavaCommand {code}
*buildJavaCommand* generates command starts from java executable to classpath 
before 
*org.apache.spark.deploy.SparkSubmit*
 

Logging and debuging

[Running Spark on YARN - Spark 3.5.1 Documentation 
(apache.org)|https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application]

 

 

> spark-submit command should accept log4j configuration parameters for spark 
> client logging.
> ---
>
> Key: SPARK-21711
> URL: https://issues.apache.org/jira/browse/SPARK-21711
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Mahesh Ambule
>Priority: Minor
> Attachments: spark-submit client logs.txt
>
>
> Currently, log4j properties can be specified in spark 'conf' directory in 
> log4j.properties file.
> The spark-submit command can override these log4j properties for driver and 
> executors. 
> But it can not override these log4j properties for *spark client * 
> application.
> The user should be able to pass log4j properties for spark client using the 
> spark-submit command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47556) [K8] Spark App ID collision resulting in deleting wrong resources

2024-03-25 Thread Sundeep K (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sundeep K resolved SPARK-47556.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

[https://github.com/apache/spark/commit/fe94bf07f9acec302e7d8becd7e576c777337331]
 and

https://issues.apache.org/jira/browse/SPARK-36014 

> [K8] Spark App ID collision resulting in deleting wrong resources
> -
>
> Key: SPARK-47556
> URL: https://issues.apache.org/jira/browse/SPARK-47556
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1
>Reporter: Sundeep K
>Priority: Major
> Fix For: 3.3.0
>
>
> h3. Issue:
> We noticed that sometimes K8s executor pods go in a crash loop. Reason being 
> 'Error: MountVolume.SetUp failed for volume "spark-conf-volume-exec"'. Upon 
> investigation we noticed that there are 2 spark jobs that launched with same 
> application id and when one of them finishes first it deletes all it's 
> resources and deletes the resources of other job too.
> -> Spark application ID is created using this 
> [code|https://affirm.slack.com/archives/C06Q2GWLWKH/p1711132115304449?thread_ts=1711123500.783909&cid=C06Q2GWLWKH]
>  
> "spark-application-" + System.currentTimeMillis
> This means if 2 applications launch at the same milli second they could end 
> up having same AppId
> ->  
> [spark-app-selector|https://github.com/apache/spark/blob/93f98c0a61ddb66eb777c3940fbf29fc58e2d79b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L23]
>  label is added to all resource created by driver and it's value is 
> application Id. Kubernetes Scheduler deletes all the apps with same 
> [label|https://github.com/apache/spark/blob/2a8bb5cdd3a5a2d63428b82df5e5066a805ce878/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L162C1-L172C6]
>  upon termination.
> This results in deletion of config map and executor pods of job that's still 
> running, driver tries to relaunch the executor pods, but config map is not 
> present, so it's in crash loop
> h3. Context
> We are using [Spark of Kubernetes 
> |https://spark.apache.org/docs/latest/running-on-kubernetes.html]and launch 
> our spark jobs using PySpark. We launch multiple Spark Jobs within a given 
> k8s namespace. Each Spark job can be launched from different pods or from 
> different processes in a pod. Every time a job is launched it has a unique 
> app name. Here is how the job is launched (omitting irrelevant details):
> {code:java}
> # spark_conf has settings required for spark on k8s 
> sp = SparkSession.builder \
> .config(conf=spark_conf) \
> .appName('testapp')
> sp.master(f'k8s://{kubernetes_host}')
> session = sp.getOrCreate()
> with session:
> session.sql('SELECT 1'){code}
> h3. Repro
> Set same app id in spark config, run 2 different jobs, one that finishes 
> fast, one that runs slow. Slower job goes into crash loop
> {code:java}
> "spark.app.id": ""{code}
> h3. Workaround
> Set unique spark.app.id for all the jobs that run on k8s
> eg:
> {code:java}
> "spark.app.id": f'{AppName}-{CurrTimeInMilliSecs}-{UUId}'[:63]{code}
> h3. Fix
> Add unique hash add the end of Application ID: 
> [https://github.com/apache/spark/pull/45712] 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47556) [K8] Spark App ID collision resulting in deleting wrong resources

2024-03-25 Thread Sundeep K (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830793#comment-17830793
 ] 

Sundeep K commented on SPARK-47556:
---

This is actually fixed in 3.3 and above 
https://github.com/apache/spark/commit/fe94bf07f9acec302e7d8becd7e576c777337331

> [K8] Spark App ID collision resulting in deleting wrong resources
> -
>
> Key: SPARK-47556
> URL: https://issues.apache.org/jira/browse/SPARK-47556
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1
>Reporter: Sundeep K
>Priority: Major
>
> h3. Issue:
> We noticed that sometimes K8s executor pods go in a crash loop. Reason being 
> 'Error: MountVolume.SetUp failed for volume "spark-conf-volume-exec"'. Upon 
> investigation we noticed that there are 2 spark jobs that launched with same 
> application id and when one of them finishes first it deletes all it's 
> resources and deletes the resources of other job too.
> -> Spark application ID is created using this 
> [code|https://affirm.slack.com/archives/C06Q2GWLWKH/p1711132115304449?thread_ts=1711123500.783909&cid=C06Q2GWLWKH]
>  
> "spark-application-" + System.currentTimeMillis
> This means if 2 applications launch at the same milli second they could end 
> up having same AppId
> ->  
> [spark-app-selector|https://github.com/apache/spark/blob/93f98c0a61ddb66eb777c3940fbf29fc58e2d79b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L23]
>  label is added to all resource created by driver and it's value is 
> application Id. Kubernetes Scheduler deletes all the apps with same 
> [label|https://github.com/apache/spark/blob/2a8bb5cdd3a5a2d63428b82df5e5066a805ce878/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L162C1-L172C6]
>  upon termination.
> This results in deletion of config map and executor pods of job that's still 
> running, driver tries to relaunch the executor pods, but config map is not 
> present, so it's in crash loop
> h3. Context
> We are using [Spark of Kubernetes 
> |https://spark.apache.org/docs/latest/running-on-kubernetes.html]and launch 
> our spark jobs using PySpark. We launch multiple Spark Jobs within a given 
> k8s namespace. Each Spark job can be launched from different pods or from 
> different processes in a pod. Every time a job is launched it has a unique 
> app name. Here is how the job is launched (omitting irrelevant details):
> {code:java}
> # spark_conf has settings required for spark on k8s 
> sp = SparkSession.builder \
> .config(conf=spark_conf) \
> .appName('testapp')
> sp.master(f'k8s://{kubernetes_host}')
> session = sp.getOrCreate()
> with session:
> session.sql('SELECT 1'){code}
> h3. Repro
> Set same app id in spark config, run 2 different jobs, one that finishes 
> fast, one that runs slow. Slower job goes into crash loop
> {code:java}
> "spark.app.id": ""{code}
> h3. Workaround
> Set unique spark.app.id for all the jobs that run on k8s
> eg:
> {code:java}
> "spark.app.id": f'{AppName}-{CurrTimeInMilliSecs}-{UUId}'[:63]{code}
> h3. Fix
> Add unique hash add the end of Application ID: 
> [https://github.com/apache/spark/pull/45712] 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-47556) [K8] Spark App ID collision resulting in deleting wrong resources

2024-03-25 Thread Sundeep K (Jira)



[ https://issues.apache.org/jira/browse/SPARK-47556 ]


Sundeep K deleted comment on SPARK-47556:
---

was (Author: JIRAUSER304761):
This seems to be fix in 3.2 and above 
https://github.com/Affirm/spark/commit/fe94bf07f9acec302e7d8becd7e576c777337331

> [K8] Spark App ID collision resulting in deleting wrong resources
> -
>
> Key: SPARK-47556
> URL: https://issues.apache.org/jira/browse/SPARK-47556
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1
>Reporter: Sundeep K
>Priority: Major
>
> h3. Issue:
> We noticed that sometimes K8s executor pods go in a crash loop. Reason being 
> 'Error: MountVolume.SetUp failed for volume "spark-conf-volume-exec"'. Upon 
> investigation we noticed that there are 2 spark jobs that launched with same 
> application id and when one of them finishes first it deletes all it's 
> resources and deletes the resources of other job too.
> -> Spark application ID is created using this 
> [code|https://affirm.slack.com/archives/C06Q2GWLWKH/p1711132115304449?thread_ts=1711123500.783909&cid=C06Q2GWLWKH]
>  
> "spark-application-" + System.currentTimeMillis
> This means if 2 applications launch at the same milli second they could end 
> up having same AppId
> ->  
> [spark-app-selector|https://github.com/apache/spark/blob/93f98c0a61ddb66eb777c3940fbf29fc58e2d79b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L23]
>  label is added to all resource created by driver and it's value is 
> application Id. Kubernetes Scheduler deletes all the apps with same 
> [label|https://github.com/apache/spark/blob/2a8bb5cdd3a5a2d63428b82df5e5066a805ce878/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L162C1-L172C6]
>  upon termination.
> This results in deletion of config map and executor pods of job that's still 
> running, driver tries to relaunch the executor pods, but config map is not 
> present, so it's in crash loop
> h3. Context
> We are using [Spark of Kubernetes 
> |https://spark.apache.org/docs/latest/running-on-kubernetes.html]and launch 
> our spark jobs using PySpark. We launch multiple Spark Jobs within a given 
> k8s namespace. Each Spark job can be launched from different pods or from 
> different processes in a pod. Every time a job is launched it has a unique 
> app name. Here is how the job is launched (omitting irrelevant details):
> {code:java}
> # spark_conf has settings required for spark on k8s 
> sp = SparkSession.builder \
> .config(conf=spark_conf) \
> .appName('testapp')
> sp.master(f'k8s://{kubernetes_host}')
> session = sp.getOrCreate()
> with session:
> session.sql('SELECT 1'){code}
> h3. Repro
> Set same app id in spark config, run 2 different jobs, one that finishes 
> fast, one that runs slow. Slower job goes into crash loop
> {code:java}
> "spark.app.id": ""{code}
> h3. Workaround
> Set unique spark.app.id for all the jobs that run on k8s
> eg:
> {code:java}
> "spark.app.id": f'{AppName}-{CurrTimeInMilliSecs}-{UUId}'[:63]{code}
> h3. Fix
> Add unique hash add the end of Application ID: 
> [https://github.com/apache/spark/pull/45712] 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47556) [K8] Spark App ID collision resulting in deleting wrong resources

2024-03-25 Thread Sundeep K (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830792#comment-17830792
 ] 

Sundeep K commented on SPARK-47556:
---

This seems to be fix in 3.2 and above 
https://github.com/Affirm/spark/commit/fe94bf07f9acec302e7d8becd7e576c777337331

> [K8] Spark App ID collision resulting in deleting wrong resources
> -
>
> Key: SPARK-47556
> URL: https://issues.apache.org/jira/browse/SPARK-47556
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1
>Reporter: Sundeep K
>Priority: Major
>
> h3. Issue:
> We noticed that sometimes K8s executor pods go in a crash loop. Reason being 
> 'Error: MountVolume.SetUp failed for volume "spark-conf-volume-exec"'. Upon 
> investigation we noticed that there are 2 spark jobs that launched with same 
> application id and when one of them finishes first it deletes all it's 
> resources and deletes the resources of other job too.
> -> Spark application ID is created using this 
> [code|https://affirm.slack.com/archives/C06Q2GWLWKH/p1711132115304449?thread_ts=1711123500.783909&cid=C06Q2GWLWKH]
>  
> "spark-application-" + System.currentTimeMillis
> This means if 2 applications launch at the same milli second they could end 
> up having same AppId
> ->  
> [spark-app-selector|https://github.com/apache/spark/blob/93f98c0a61ddb66eb777c3940fbf29fc58e2d79b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L23]
>  label is added to all resource created by driver and it's value is 
> application Id. Kubernetes Scheduler deletes all the apps with same 
> [label|https://github.com/apache/spark/blob/2a8bb5cdd3a5a2d63428b82df5e5066a805ce878/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L162C1-L172C6]
>  upon termination.
> This results in deletion of config map and executor pods of job that's still 
> running, driver tries to relaunch the executor pods, but config map is not 
> present, so it's in crash loop
> h3. Context
> We are using [Spark of Kubernetes 
> |https://spark.apache.org/docs/latest/running-on-kubernetes.html]and launch 
> our spark jobs using PySpark. We launch multiple Spark Jobs within a given 
> k8s namespace. Each Spark job can be launched from different pods or from 
> different processes in a pod. Every time a job is launched it has a unique 
> app name. Here is how the job is launched (omitting irrelevant details):
> {code:java}
> # spark_conf has settings required for spark on k8s 
> sp = SparkSession.builder \
> .config(conf=spark_conf) \
> .appName('testapp')
> sp.master(f'k8s://{kubernetes_host}')
> session = sp.getOrCreate()
> with session:
> session.sql('SELECT 1'){code}
> h3. Repro
> Set same app id in spark config, run 2 different jobs, one that finishes 
> fast, one that runs slow. Slower job goes into crash loop
> {code:java}
> "spark.app.id": ""{code}
> h3. Workaround
> Set unique spark.app.id for all the jobs that run on k8s
> eg:
> {code:java}
> "spark.app.id": f'{AppName}-{CurrTimeInMilliSecs}-{UUId}'[:63]{code}
> h3. Fix
> Add unique hash add the end of Application ID: 
> [https://github.com/apache/spark/pull/45712] 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47556) [K8] Spark App ID collision resulting in deleting wrong resources

2024-03-25 Thread Sundeep K (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sundeep K updated SPARK-47556:
--
Affects Version/s: 3.1
   (was: 3.3.2)
   (was: 3.5.1)

> [K8] Spark App ID collision resulting in deleting wrong resources
> -
>
> Key: SPARK-47556
> URL: https://issues.apache.org/jira/browse/SPARK-47556
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1
>Reporter: Sundeep K
>Priority: Major
>
> h3. Issue:
> We noticed that sometimes K8s executor pods go in a crash loop. Reason being 
> 'Error: MountVolume.SetUp failed for volume "spark-conf-volume-exec"'. Upon 
> investigation we noticed that there are 2 spark jobs that launched with same 
> application id and when one of them finishes first it deletes all it's 
> resources and deletes the resources of other job too.
> -> Spark application ID is created using this 
> [code|https://affirm.slack.com/archives/C06Q2GWLWKH/p1711132115304449?thread_ts=1711123500.783909&cid=C06Q2GWLWKH]
>  
> "spark-application-" + System.currentTimeMillis
> This means if 2 applications launch at the same milli second they could end 
> up having same AppId
> ->  
> [spark-app-selector|https://github.com/apache/spark/blob/93f98c0a61ddb66eb777c3940fbf29fc58e2d79b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L23]
>  label is added to all resource created by driver and it's value is 
> application Id. Kubernetes Scheduler deletes all the apps with same 
> [label|https://github.com/apache/spark/blob/2a8bb5cdd3a5a2d63428b82df5e5066a805ce878/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L162C1-L172C6]
>  upon termination.
> This results in deletion of config map and executor pods of job that's still 
> running, driver tries to relaunch the executor pods, but config map is not 
> present, so it's in crash loop
> h3. Context
> We are using [Spark of Kubernetes 
> |https://spark.apache.org/docs/latest/running-on-kubernetes.html]and launch 
> our spark jobs using PySpark. We launch multiple Spark Jobs within a given 
> k8s namespace. Each Spark job can be launched from different pods or from 
> different processes in a pod. Every time a job is launched it has a unique 
> app name. Here is how the job is launched (omitting irrelevant details):
> {code:java}
> # spark_conf has settings required for spark on k8s 
> sp = SparkSession.builder \
> .config(conf=spark_conf) \
> .appName('testapp')
> sp.master(f'k8s://{kubernetes_host}')
> session = sp.getOrCreate()
> with session:
> session.sql('SELECT 1'){code}
> h3. Repro
> Set same app id in spark config, run 2 different jobs, one that finishes 
> fast, one that runs slow. Slower job goes into crash loop
> {code:java}
> "spark.app.id": ""{code}
> h3. Workaround
> Set unique spark.app.id for all the jobs that run on k8s
> eg:
> {code:java}
> "spark.app.id": f'{AppName}-{CurrTimeInMilliSecs}-{UUId}'[:63]{code}
> h3. Fix
> Add unique hash add the end of Application ID: 
> [https://github.com/apache/spark/pull/45712] 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47558) [Arbitrary State Support] State TTL support - ValueState

2024-03-25 Thread Bhuwan Sahni (Jira)

Bhuwan Sahni created SPARK-47558:


 Summary: [Arbitrary State Support] State TTL support - ValueState
 Key: SPARK-47558
 URL: https://issues.apache.org/jira/browse/SPARK-47558
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Bhuwan Sahni


Add support for expiring state value based on ttl for Value State in 
transformWithState operator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47450) Use R 4.3.3 in `windows` R GitHub Action job

2024-03-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47450.
--
Fix Version/s: 4.0.0
 Assignee: Dongjoon Hyun
   Resolution: Fixed

Reverted the revert 
https://github.com/apache/spark/commit/31db27d193fb79b022b7978ef9d0e715da8ade86

> Use R 4.3.3 in `windows` R GitHub Action job
> 
>
> Key: SPARK-47450
> URL: https://issues.apache.org/jira/browse/SPARK-47450
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, R
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47450) Use R 4.3.3 in `windows` R GitHub Action job

2024-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47450:
---
Labels: pull-request-available  (was: )

> Use R 4.3.3 in `windows` R GitHub Action job
> 
>
> Key: SPARK-47450
> URL: https://issues.apache.org/jira/browse/SPARK-47450
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, R
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47509) Incorrect results for LambdaFunctions or HigherOrderFunctions in subquery expression plans

2024-03-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47509:
---

Assignee: Daniel

> Incorrect results for LambdaFunctions or HigherOrderFunctions in subquery 
> expression plans
> --
>
> Key: SPARK-47509
> URL: https://issues.apache.org/jira/browse/SPARK-47509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
>  Labels: pull-request-available
>
> We can return an error for this case to fix the correctness bug. Later we can 
> look at supporting this query pattern as time allows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47509) Incorrect results for LambdaFunctions or HigherOrderFunctions in subquery expression plans

2024-03-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47509.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45652
[https://github.com/apache/spark/pull/45652]

> Incorrect results for LambdaFunctions or HigherOrderFunctions in subquery 
> expression plans
> --
>
> Key: SPARK-47509
> URL: https://issues.apache.org/jira/browse/SPARK-47509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We can return an error for this case to fix the correctness bug. Later we can 
> look at supporting this query pattern as time allows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47557) Audit MySQL ENUM/SET Types

2024-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47557:
---
Labels: pull-request-available  (was: )

> Audit MySQL ENUM/SET Types
> --
>
> Key: SPARK-47557
> URL: https://issues.apache.org/jira/browse/SPARK-47557
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47557) Audit MySQL ENUM/SET Types

2024-03-25 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-47557:
-
Parent: SPARK-47361
Issue Type: Sub-task  (was: Bug)

> Audit MySQL ENUM/SET Types
> --
>
> Key: SPARK-47557
> URL: https://issues.apache.org/jira/browse/SPARK-47557
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47450) Use R 4.3.3 in `windows` R GitHub Action job

2024-03-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-47450:
-
Fix Version/s: (was: 4.0.0)

> Use R 4.3.3 in `windows` R GitHub Action job
> 
>
> Key: SPARK-47450
> URL: https://issues.apache.org/jira/browse/SPARK-47450
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, R
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47557) Audit MySQL ENUM/SET Types

2024-03-25 Thread Kent Yao (Jira)

Kent Yao created SPARK-47557:


 Summary: Audit MySQL ENUM/SET Types
 Key: SPARK-47557
 URL: https://issues.apache.org/jira/browse/SPARK-47557
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46822) Respect spark.sql.legacy.charVarcharAsString when casting jdbc type to catalyst type in jdbc

2024-03-25 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-46822:
-
Parent: SPARK-47361
Issue Type: Sub-task  (was: Bug)

> Respect spark.sql.legacy.charVarcharAsString when casting jdbc type to 
> catalyst type in jdbc
> 
>
> Key: SPARK-46822
> URL: https://issues.apache.org/jira/browse/SPARK-46822
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47556) [K8] Spark App ID collision resulting in deleting wrong resources

2024-03-25 Thread Sundeep K (Jira)

Sundeep K created SPARK-47556:
-

 Summary: [K8] Spark App ID collision resulting in deleting wrong 
resources
 Key: SPARK-47556
 URL: https://issues.apache.org/jira/browse/SPARK-47556
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Spark Core
Affects Versions: 3.5.1, 3.3.2
Reporter: Sundeep K


h3. Issue:

We noticed that sometimes K8s executor pods go in a crash loop. Reason being 
'Error: MountVolume.SetUp failed for volume "spark-conf-volume-exec"'. Upon 
investigation we noticed that there are 2 spark jobs that launched with same 
application id and when one of them finishes first it deletes all it's 
resources and deletes the resources of other job too.

-> Spark application ID is created using this 
[code|https://affirm.slack.com/archives/C06Q2GWLWKH/p1711132115304449?thread_ts=1711123500.783909&cid=C06Q2GWLWKH]
 
"spark-application-" + System.currentTimeMillis
This means if 2 applications launch at the same milli second they could end up 
having same AppId

->  
[spark-app-selector|https://github.com/apache/spark/blob/93f98c0a61ddb66eb777c3940fbf29fc58e2d79b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L23]
 label is added to all resource created by driver and it's value is application 
Id. Kubernetes Scheduler deletes all the apps with same 
[label|https://github.com/apache/spark/blob/2a8bb5cdd3a5a2d63428b82df5e5066a805ce878/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L162C1-L172C6]
 upon termination.

This results in deletion of config map and executor pods of job that's still 
running, driver tries to relaunch the executor pods, but config map is not 
present, so it's in crash loop
h3. Context

We are using [Spark of Kubernetes 
|https://spark.apache.org/docs/latest/running-on-kubernetes.html]and launch our 
spark jobs using PySpark. We launch multiple Spark Jobs within a given k8s 
namespace. Each Spark job can be launched from different pods or from different 
processes in a pod. Every time a job is launched it has a unique app name. Here 
is how the job is launched (omitting irrelevant details):
{code:java}
# spark_conf has settings required for spark on k8s 
sp = SparkSession.builder \
.config(conf=spark_conf) \
.appName('testapp')
sp.master(f'k8s://{kubernetes_host}')
session = sp.getOrCreate()
with session:
session.sql('SELECT 1'){code}
h3. Repro

Set same app id in spark config, run 2 different jobs, one that finishes fast, 
one that runs slow. Slower job goes into crash loop
{code:java}
"spark.app.id": ""{code}
h3. Workaround

Set unique spark.app.id for all the jobs that run on k8s

eg:
{code:java}
"spark.app.id": f'{AppName}-{CurrTimeInMilliSecs}-{UUId}'[:63]{code}
h3. Fix

Add unique hash add the end of Application ID: 
[https://github.com/apache/spark/pull/45712] 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47554) Upgrade `sbt-assembly` to `2.2.0` and `sbt-protoc` to `1.0.7`

2024-03-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47554.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45696
[https://github.com/apache/spark/pull/45696]

> Upgrade `sbt-assembly` to `2.2.0` and `sbt-protoc` to `1.0.7`
> -
>
> Key: SPARK-47554
> URL: https://issues.apache.org/jira/browse/SPARK-47554
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47554) Upgrade `sbt-assembly` to `2.2.0` and `sbt-protoc` to `1.0.7`

2024-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47554:
---
Labels: pull-request-available  (was: )

> Upgrade `sbt-assembly` to `2.2.0` and `sbt-protoc` to `1.0.7`
> -
>
> Key: SPARK-47554
> URL: https://issues.apache.org/jira/browse/SPARK-47554
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47555) Record necessary raw exception log when loadTable

2024-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47555:
---
Labels: pull-request-available  (was: )

> Record necessary raw exception log when loadTable
> -
>
> Key: SPARK-47555
> URL: https://issues.apache.org/jira/browse/SPARK-47555
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: xleoken
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47555) Record necessary raw exception log when loadTable

2024-03-25 Thread xleoken (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xleoken updated SPARK-47555:

Summary: Record necessary raw exception log when loadTable  (was: Print 
necessary raw exception log when loadTable)

> Record necessary raw exception log when loadTable
> -
>
> Key: SPARK-47555
> URL: https://issues.apache.org/jira/browse/SPARK-47555
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: xleoken
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47555) Print necessary raw exception log when loadTable

2024-03-25 Thread xleoken (Jira)

xleoken created SPARK-47555:
---

 Summary: Print necessary raw exception log when loadTable
 Key: SPARK-47555
 URL: https://issues.apache.org/jira/browse/SPARK-47555
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.1
Reporter: xleoken






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47549) Remove Spark 3.0~3.2 pyspark/version.py workaround from release scripts

2024-03-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47549:
--
Parent: SPARK-44111
Issue Type: Sub-task  (was: Improvement)

> Remove Spark 3.0~3.2 pyspark/version.py workaround from release scripts
> ---
>
> Key: SPARK-47549
> URL: https://issues.apache.org/jira/browse/SPARK-47549
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47553) Add Java support tests for transformWithState operator

2024-03-25 Thread Anish Shrigondekar (Jira)

Anish Shrigondekar created SPARK-47553:
--

 Summary: Add Java support tests for transformWithState operator
 Key: SPARK-47553
 URL: https://issues.apache.org/jira/browse/SPARK-47553
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Anish Shrigondekar


Add Java support tests for transformWithState operator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47549) Remove Spark 3.0~3.2 pyspark/version.py workaround from release scripts

2024-03-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47549:


Assignee: Dongjoon Hyun

> Remove Spark 3.0~3.2 pyspark/version.py workaround from release scripts
> ---
>
> Key: SPARK-47549
> URL: https://issues.apache.org/jira/browse/SPARK-47549
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47549) Remove Spark 3.0~3.2 pyspark/version.py workaround from release scripts

2024-03-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47549.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45706
[https://github.com/apache/spark/pull/45706]

> Remove Spark 3.0~3.2 pyspark/version.py workaround from release scripts
> ---
>
> Key: SPARK-47549
> URL: https://issues.apache.org/jira/browse/SPARK-47549
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47552) Set spark.hadoop.fs.s3a.connection.establish.timeout to 30s

2024-03-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47552:
-

Assignee: Dongjoon Hyun

> Set spark.hadoop.fs.s3a.connection.establish.timeout to 30s
> ---
>
> Key: SPARK-47552
> URL: https://issues.apache.org/jira/browse/SPARK-47552
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> To suppress like HADOOP-19097
> {code}
> 24/03/25 14:46:21 WARN ConfigurationHelper: Option 
> fs.s3a.connection.establish.timeout is too low (5,000 ms). Setting to 15,000 
> ms instead
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47552) Set spark.hadoop.fs.s3a.connection.establish.timeout to 30s

2024-03-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47552.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45710
[https://github.com/apache/spark/pull/45710]

> Set spark.hadoop.fs.s3a.connection.establish.timeout to 30s
> ---
>
> Key: SPARK-47552
> URL: https://issues.apache.org/jira/browse/SPARK-47552
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> To suppress like HADOOP-19097
> {code}
> 24/03/25 14:46:21 WARN ConfigurationHelper: Option 
> fs.s3a.connection.establish.timeout is too low (5,000 ms). Setting to 15,000 
> ms instead
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47552) Set spark.hadoop.fs.s3a.connection.establish.timeout to 30s

2024-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47552:
---
Labels: pull-request-available  (was: )

> Set spark.hadoop.fs.s3a.connection.establish.timeout to 30s
> ---
>
> Key: SPARK-47552
> URL: https://issues.apache.org/jira/browse/SPARK-47552
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> To suppress like HADOOP-19097
> {code}
> 24/03/25 14:46:21 WARN ConfigurationHelper: Option 
> fs.s3a.connection.establish.timeout is too low (5,000 ms). Setting to 15,000 
> ms instead
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47552) Set spark.hadoop.fs.s3a.connection.establish.timeout to 30s

2024-03-25 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-47552:
-

 Summary: Set spark.hadoop.fs.s3a.connection.establish.timeout to 
30s
 Key: SPARK-47552
 URL: https://issues.apache.org/jira/browse/SPARK-47552
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun


To suppress like HADOOP-19097
{code}
24/03/25 14:46:21 WARN ConfigurationHelper: Option 
fs.s3a.connection.establish.timeout is too low (5,000 ms). Setting to 15,000 ms 
instead
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47550) Update kubernetes-client to 6.11.0

2024-03-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47550:
--
Parent: SPARK-44111
Issue Type: Sub-task  (was: Dependency upgrade)

> Update kubernetes-client to 6.11.0
> --
>
> Key: SPARK-47550
> URL: https://issues.apache.org/jira/browse/SPARK-47550
> Project: Spark
>  Issue Type: Sub-task
>  Components: k8s
>Affects Versions: 4.0.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [Release 
> notes|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.11.0]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47550) Upgrade kubernetes-client to 6.11.0

2024-03-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47550:
--
Summary: Upgrade kubernetes-client to 6.11.0  (was: Update 
kubernetes-client to 6.11.0)

> Upgrade kubernetes-client to 6.11.0
> ---
>
> Key: SPARK-47550
> URL: https://issues.apache.org/jira/browse/SPARK-47550
> Project: Spark
>  Issue Type: Sub-task
>  Components: k8s
>Affects Versions: 4.0.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [Release 
> notes|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.11.0]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47550) Update kubernetes-client to 6.11.0

2024-03-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47550.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45707
[https://github.com/apache/spark/pull/45707]

> Update kubernetes-client to 6.11.0
> --
>
> Key: SPARK-47550
> URL: https://issues.apache.org/jira/browse/SPARK-47550
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: k8s
>Affects Versions: 4.0.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [Release 
> notes|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.11.0]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47550) Update kubernetes-client to 6.11.0

2024-03-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47550:
-

Assignee: Bjørn Jørgensen

> Update kubernetes-client to 6.11.0
> --
>
> Key: SPARK-47550
> URL: https://issues.apache.org/jira/browse/SPARK-47550
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: k8s
>Affects Versions: 4.0.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>  Labels: pull-request-available
>
> [Release 
> notes|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.11.0]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47551) Add variant_get expression.

2024-03-25 Thread Chenhao Li (Jira)

Chenhao Li created SPARK-47551:
--

 Summary: Add variant_get expression.
 Key: SPARK-47551
 URL: https://issues.apache.org/jira/browse/SPARK-47551
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Chenhao Li






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47550) Update kubernetes-client to 6.11.0

2024-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47550:
---
Labels: pull-request-available  (was: )

> Update kubernetes-client to 6.11.0
> --
>
> Key: SPARK-47550
> URL: https://issues.apache.org/jira/browse/SPARK-47550
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: k8s
>Affects Versions: 4.0.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>  Labels: pull-request-available
>
> [Release 
> notes|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.11.0]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47550) Update kubernetes-client to 6.11.0

2024-03-25 Thread Jira

Bjørn Jørgensen created SPARK-47550:
---

 Summary: Update kubernetes-client to 6.11.0
 Key: SPARK-47550
 URL: https://issues.apache.org/jira/browse/SPARK-47550
 Project: Spark
  Issue Type: Dependency upgrade
  Components: k8s
Affects Versions: 4.0.0
Reporter: Bjørn Jørgensen


[Release 
notes|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.11.0]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47548) Remove unused `commons-beanutils` dependency

2024-03-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47548:
-

Assignee: Dongjoon Hyun

> Remove unused `commons-beanutils` dependency
> 
>
> Key: SPARK-47548
> URL: https://issues.apache.org/jira/browse/SPARK-47548
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47548) Remove unused `commons-beanutils` dependency

2024-03-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47548.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45705
[https://github.com/apache/spark/pull/45705]

> Remove unused `commons-beanutils` dependency
> 
>
> Key: SPARK-47548
> URL: https://issues.apache.org/jira/browse/SPARK-47548
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47413) Substring, Right, Left (all collations)

2024-03-25 Thread Gideon P (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830618#comment-17830618
 ] 

Gideon P commented on SPARK-47413:
--

[~davidm-db] Are you sure you don't want me to take care of it? I would be more 
than happy to take care of this.

[~uros-db] do you have another one for me, if David is taking this one over?

> Substring, Right, Left (all collations)
> ---
>
> Key: SPARK-47413
> URL: https://issues.apache.org/jira/browse/SPARK-47413
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for the *Substring* built-in string function in 
> Spark (including *Right* and *Left* functions). First confirm what is the 
> expected behaviour for these functions when given collated strings, then move 
> on to the implementation that would enable handling strings of all collation 
> types. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the {*}Substring{*}, 
> {*}Right{*}, and *Left* functions so that they support all collation types 
> currently supported in Spark. To understand what changes were introduced in 
> order to enable full collation support for other existing functions in Spark, 
> take a look at the Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47549) Remove Spark 3.0~3.2 pyspark/version.py workaround from release scripts

2024-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47549:
---
Labels: pull-request-available  (was: )

> Remove Spark 3.0~3.2 pyspark/version.py workaround from release scripts
> ---
>
> Key: SPARK-47549
> URL: https://issues.apache.org/jira/browse/SPARK-47549
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47549) Remove Spark 3.0~3.2 pyspark/version.py workaround from release scripts

2024-03-25 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-47549:
-

 Summary: Remove Spark 3.0~3.2 pyspark/version.py workaround from 
release scripts
 Key: SPARK-47549
 URL: https://issues.apache.org/jira/browse/SPARK-47549
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47256) Assign error classes to FILTER expression errors

2024-03-25 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-47256:


Assignee: David Milicevic

> Assign error classes to FILTER expression errors
> 
>
> Key: SPARK-47256
> URL: https://issues.apache.org/jira/browse/SPARK-47256
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: David Milicevic
>Priority: Minor
>  Labels: pull-request-available, starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_102[4-7]* 
> defined in {*}common/utils/src/main/resources/error/error-classes.json{*}. 
> The name should be short but complete (look at the example in 
> error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47256) Assign error classes to FILTER expression errors

2024-03-25 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-47256.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45622
[https://github.com/apache/spark/pull/45622]

> Assign error classes to FILTER expression errors
> 
>
> Key: SPARK-47256
> URL: https://issues.apache.org/jira/browse/SPARK-47256
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: David Milicevic
>Priority: Minor
>  Labels: pull-request-available, starter
> Fix For: 4.0.0
>
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_102[4-7]* 
> defined in {*}common/utils/src/main/resources/error/error-classes.json{*}. 
> The name should be short but complete (look at the example in 
> error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32350) Add batch write support on LevelDB to improve performance of HybridStore

2024-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-32350:
---
Labels: pull-request-available  (was: )

> Add batch write support on LevelDB to improve performance of HybridStore
> 
>
> Key: SPARK-32350
> URL: https://issues.apache.org/jira/browse/SPARK-32350
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Baohe Zhang
>Assignee: Baohe Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.1.0
>
>
> The idea is to improve the performance of HybridStore by adding batch write 
> support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 
> introduces HybridStore. HybridStore will write data to InMemoryStore at first 
> and use a background thread to dump data to LevelDB once the writing to 
> InMemoryStore is completed. In the comments section of 
> [https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned 
> using batch writing can improve the performance of this dumping process and 
> he wrote the code of writeAll().
> I did the comparison of the HybridStore switching time between one-by-one 
> write and batch write on an HDD disk. When the disk is free, the batch-write 
> has around 25% improvement, and when the disk is 100% busy, the batch-write 
> has 7x - 10x improvement.
> when the disk is at 0% utilization:
>  
> ||log size, jobs and tasks per job||original switching time, with 
> write()||switching time with writeAll()||
> |133m, 400 jobs, 100 tasks per job|16s|13s|
> |265m, 400 jobs, 200 tasks per job|30s|23s|
> |1.3g, 1000 jobs, 400 tasks per job|136s|108s|
>  
> when the disk is at 100% utilization:
> ||log size, jobs and tasks per job||original switching time, with 
> write()||switching time with writeAll()||
> |133m, 400 jobs, 100 tasks per job|116s|17s|
> |265m, 400 jobs, 200 tasks per job|251s|26s|
> I also ran some write related benchmarking tests on LevelDBBenchmark.java and 
> measured the total time of writing 1024 objects.
> when the disk is at 0% utilization:
>  
> ||Benchmark test||with write(), ms||with writeAll(), ms ||
> |randomUpdatesIndexed|213.060|157.356|
> |randomUpdatesNoIndex|57.869|35.439|
> |randomWritesIndexed|298.854|229.274|
> |randomWritesNoIndex|66.764|38.361|
> |sequentialUpdatesIndexed|87.019|56.219|
> |sequentialUpdatesNoIndex|61.851|41.942|
> |sequentialWritesIndexed|94.044|56.534|
> |sequentialWritesNoIndex|118.345|66.483|
>  
> when the disk is at 50% utilization:
> ||Benchmark test||with write(), ms||with writeAll(), ms||
> |randomUpdatesIndexed|230.386|180.817|
> |randomUpdatesNoIndex|58.935|50.113|
> |randomWritesIndexed|315.241|254.400|
> |randomWritesNoIndex|96.709|41.164|
> |sequentialUpdatesIndexed|89.971|70.387|
> |sequentialUpdatesNoIndex|72.021|53.769|
> |sequentialWritesIndexed|103.052|67.358|
> |sequentialWritesNoIndex|76.194|99.037|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46743) Count bug introduced for scalar subquery when using TEMPORARY VIEW, as compared to using table

2024-03-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46743:
--
Labels: correctness pull-request-available  (was: pull-request-available)

> Count bug introduced for scalar subquery when using TEMPORARY VIEW, as 
> compared to using table
> --
>
> Key: SPARK-46743
> URL: https://issues.apache.org/jira/browse/SPARK-46743
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Andy Lam
>Assignee: Andy Lam
>Priority: Major
>  Labels: correctness, pull-request-available
> Fix For: 4.0.0
>
>
> Using the temp view reproduces COUNT bug, returns nulls instead of 0.
> With a table:
> {code:java}
> scala> spark.sql("""CREATE TABLE outer_table USING parquet AS SELECT * FROM 
> VALUES
>      |     (1, 1),
>      |     (2, 1),
>      |     (3, 3),
>      |     (6, 6),
>      |     (7, 7),
>      |     (9, 9) AS inner_table(a, b)""")
> val res6: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE TABLE null_table USING parquet AS SELECT CAST(null 
> AS int) AS a, CAST(null as int) AS b ;")
> val res7: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("""SELECT ( SELECT COUNT(null_table.a) AS aggAlias FROM 
> null_table WHERE null_table.a = outer_table.a) FROM outer_table""").collect()
> val res8: Array[org.apache.spark.sql.Row] = Array([0], [0], [0], [0], [0], 
> [0]) {code}
> With a view:
>  
> {code:java}
> spark.sql("CREATE TEMPORARY VIEW outer_view(a, b) AS VALUES (1, 1), (2, 
> 1),(3, 3), (6, 6), (7, 7), (9, 9);")
> spark.sql("CREATE TEMPORARY VIEW null_view(a, b) AS SELECT CAST(null AS int), 
> CAST(null as int);")
> spark.sql("""SELECT ( SELECT COUNT(null_view.a) AS aggAlias FROM null_view 
> WHERE null_view.a = outer_view.a) FROM outer_view""").collect()
> val res2: Array[org.apache.spark.sql.Row] = Array([null], [null], [null], 
> [null], [null], [null]){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46743) Count bug introduced for scalar subquery when using TEMPORARY VIEW, as compared to using table

2024-03-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46743:
--
Component/s: SQL
 (was: Optimizer)

> Count bug introduced for scalar subquery when using TEMPORARY VIEW, as 
> compared to using table
> --
>
> Key: SPARK-46743
> URL: https://issues.apache.org/jira/browse/SPARK-46743
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Andy Lam
>Assignee: Andy Lam
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Using the temp view reproduces COUNT bug, returns nulls instead of 0.
> With a table:
> {code:java}
> scala> spark.sql("""CREATE TABLE outer_table USING parquet AS SELECT * FROM 
> VALUES
>      |     (1, 1),
>      |     (2, 1),
>      |     (3, 3),
>      |     (6, 6),
>      |     (7, 7),
>      |     (9, 9) AS inner_table(a, b)""")
> val res6: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE TABLE null_table USING parquet AS SELECT CAST(null 
> AS int) AS a, CAST(null as int) AS b ;")
> val res7: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("""SELECT ( SELECT COUNT(null_table.a) AS aggAlias FROM 
> null_table WHERE null_table.a = outer_table.a) FROM outer_table""").collect()
> val res8: Array[org.apache.spark.sql.Row] = Array([0], [0], [0], [0], [0], 
> [0]) {code}
> With a view:
>  
> {code:java}
> spark.sql("CREATE TEMPORARY VIEW outer_view(a, b) AS VALUES (1, 1), (2, 
> 1),(3, 3), (6, 6), (7, 7), (9, 9);")
> spark.sql("CREATE TEMPORARY VIEW null_view(a, b) AS SELECT CAST(null AS int), 
> CAST(null as int);")
> spark.sql("""SELECT ( SELECT COUNT(null_view.a) AS aggAlias FROM null_view 
> WHERE null_view.a = outer_view.a) FROM outer_view""").collect()
> val res2: Array[org.apache.spark.sql.Row] = Array([null], [null], [null], 
> [null], [null], [null]){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47548) Remove unused `commons-beanutils` dependency

2024-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47548:
---
Labels: pull-request-available  (was: )

> Remove unused `commons-beanutils` dependency
> 
>
> Key: SPARK-47548
> URL: https://issues.apache.org/jira/browse/SPARK-47548
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47548) Remove unused `commons-beanutils` dependency

2024-03-25 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-47548:
-

 Summary: Remove unused `commons-beanutils` dependency
 Key: SPARK-47548
 URL: https://issues.apache.org/jira/browse/SPARK-47548
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47413) Substring, Right, Left (all collations)

2024-03-25 Thread David Milicevic (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830584#comment-17830584
 ] 

David Milicevic commented on SPARK-47413:
-

Started working on this today.

> Substring, Right, Left (all collations)
> ---
>
> Key: SPARK-47413
> URL: https://issues.apache.org/jira/browse/SPARK-47413
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for the *Substring* built-in string function in 
> Spark (including *Right* and *Left* functions). First confirm what is the 
> expected behaviour for these functions when given collated strings, then move 
> on to the implementation that would enable handling strings of all collation 
> types. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the {*}Substring{*}, 
> {*}Right{*}, and *Left* functions so that they support all collation types 
> currently supported in Spark. To understand what changes were introduced in 
> order to enable full collation support for other existing functions in Spark, 
> take a look at the Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46743) Count bug introduced for scalar subquery when using TEMPORARY VIEW, as compared to using table

2024-03-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46743:
---

Assignee: Andy Lam

> Count bug introduced for scalar subquery when using TEMPORARY VIEW, as 
> compared to using table
> --
>
> Key: SPARK-46743
> URL: https://issues.apache.org/jira/browse/SPARK-46743
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.5.0
>Reporter: Andy Lam
>Assignee: Andy Lam
>Priority: Major
>  Labels: pull-request-available
>
> Using the temp view reproduces COUNT bug, returns nulls instead of 0.
> With a table:
> {code:java}
> scala> spark.sql("""CREATE TABLE outer_table USING parquet AS SELECT * FROM 
> VALUES
>      |     (1, 1),
>      |     (2, 1),
>      |     (3, 3),
>      |     (6, 6),
>      |     (7, 7),
>      |     (9, 9) AS inner_table(a, b)""")
> val res6: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE TABLE null_table USING parquet AS SELECT CAST(null 
> AS int) AS a, CAST(null as int) AS b ;")
> val res7: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("""SELECT ( SELECT COUNT(null_table.a) AS aggAlias FROM 
> null_table WHERE null_table.a = outer_table.a) FROM outer_table""").collect()
> val res8: Array[org.apache.spark.sql.Row] = Array([0], [0], [0], [0], [0], 
> [0]) {code}
> With a view:
>  
> {code:java}
> spark.sql("CREATE TEMPORARY VIEW outer_view(a, b) AS VALUES (1, 1), (2, 
> 1),(3, 3), (6, 6), (7, 7), (9, 9);")
> spark.sql("CREATE TEMPORARY VIEW null_view(a, b) AS SELECT CAST(null AS int), 
> CAST(null as int);")
> spark.sql("""SELECT ( SELECT COUNT(null_view.a) AS aggAlias FROM null_view 
> WHERE null_view.a = outer_view.a) FROM outer_view""").collect()
> val res2: Array[org.apache.spark.sql.Row] = Array([null], [null], [null], 
> [null], [null], [null]){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46743) Count bug introduced for scalar subquery when using TEMPORARY VIEW, as compared to using table

2024-03-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46743.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45125
[https://github.com/apache/spark/pull/45125]

> Count bug introduced for scalar subquery when using TEMPORARY VIEW, as 
> compared to using table
> --
>
> Key: SPARK-46743
> URL: https://issues.apache.org/jira/browse/SPARK-46743
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.5.0
>Reporter: Andy Lam
>Assignee: Andy Lam
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Using the temp view reproduces COUNT bug, returns nulls instead of 0.
> With a table:
> {code:java}
> scala> spark.sql("""CREATE TABLE outer_table USING parquet AS SELECT * FROM 
> VALUES
>      |     (1, 1),
>      |     (2, 1),
>      |     (3, 3),
>      |     (6, 6),
>      |     (7, 7),
>      |     (9, 9) AS inner_table(a, b)""")
> val res6: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE TABLE null_table USING parquet AS SELECT CAST(null 
> AS int) AS a, CAST(null as int) AS b ;")
> val res7: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("""SELECT ( SELECT COUNT(null_table.a) AS aggAlias FROM 
> null_table WHERE null_table.a = outer_table.a) FROM outer_table""").collect()
> val res8: Array[org.apache.spark.sql.Row] = Array([0], [0], [0], [0], [0], 
> [0]) {code}
> With a view:
>  
> {code:java}
> spark.sql("CREATE TEMPORARY VIEW outer_view(a, b) AS VALUES (1, 1), (2, 
> 1),(3, 3), (6, 6), (7, 7), (9, 9);")
> spark.sql("CREATE TEMPORARY VIEW null_view(a, b) AS SELECT CAST(null AS int), 
> CAST(null as int);")
> spark.sql("""SELECT ( SELECT COUNT(null_view.a) AS aggAlias FROM null_view 
> WHERE null_view.a = outer_view.a) FROM outer_view""").collect()
> val res2: Array[org.apache.spark.sql.Row] = Array([null], [null], [null], 
> [null], [null], [null]){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42452) Remove hadoop-2 profile from Apache Spark

2024-03-25 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830566#comment-17830566
 ] 

Dongjoon Hyun commented on SPARK-42452:
---

This was resolved via https://github.com/apache/spark/pull/40788

> Remove hadoop-2 profile from Apache Spark
> -
>
> Key: SPARK-42452
> URL: https://issues.apache.org/jira/browse/SPARK-42452
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> SPARK-40651 Drop Hadoop2 binary distribtuion from release process and 
> SPARK-42447 Remove Hadoop 2 GitHub Action job
>   



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47411) StringInstr, FindInSet (all collations)

2024-03-25 Thread Milan Dankovic (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830561#comment-17830561
 ] 

Milan Dankovic commented on SPARK-47411:


I am working on this

> StringInstr, FindInSet (all collations)
> ---
>
> Key: SPARK-47411
> URL: https://issues.apache.org/jira/browse/SPARK-47411
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Enable collation support for the *StringInstr* and *FindInSet* built-in 
> string functions in Spark. First confirm what is the expected behaviour for 
> these functions when given collated strings, and then move on to 
> implementation and testing. One way to go about this is to consider using 
> {_}StringSearch{_}, an efficient ICU service for string matching. Implement 
> the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests 
> (CollationSuite) to reflect how this function should be used with collation 
> in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment 
> with the existing functions to learn more about how they work. In addition, 
> look into the possible use-cases and implementation of similar functions 
> within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringInstr* and 
> *FindInSet* functions so that they support all collation types currently 
> supported in Spark. To understand what changes were introduced in order to 
> enable full collation support for other existing functions in Spark, take a 
> look at the Spark PRs and Jira tickets for completed tasks in this parent 
> (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47503) Spark history sever fails to display query for cached JDBC relation named in quotes

2024-03-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47503:
--
Fix Version/s: 3.4.3

> Spark history sever fails to display query for cached JDBC relation named in 
> quotes
> ---
>
> Key: SPARK-47503
> URL: https://issues.apache.org/jira/browse/SPARK-47503
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 4.0.0
>Reporter: alexey
>Assignee: alexey
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.3
>
> Attachments: Screenshot_11.png, eventlog_v2_local-1711020585149.rar
>
>
> Spark history sever fails to display query for cached JDBC relation (or 
> calculation derived from it)  named in quotes
> (Screenshot and generated history in attachments)
> How to reproduce:
> {code:java}
> val ticketsDf = spark.read.jdbc("jdbc:postgresql://localhost:5432/demo", """ 
> "test-schema".tickets """.trim, properties)
> val bookingDf = spark.read.parquet("path/bookings")
> ticketsDf.cache().count()
> val resultDf = bookingDf.join(ticketsDf, Seq("book_ref"))
> resultDf.write.mode(SaveMode.Overwrite).parquet("path/result") {code}
>  
> So the problem is in SparkPlanGraphNode class which creates a dot node. When 
> there is no metrics to display it simply returns tagged name and in this case 
> name contains quotes which corrupts dot file.
> Suggested solution is to escape name string
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47537) Use MySQL Connector/J for MySQL DB instead of MariaDB Connector/J

2024-03-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47537:
--
Fix Version/s: 3.4.3

> Use MySQL Connector/J for MySQL DB instead of MariaDB Connector/J 
> --
>
> Key: SPARK-47537
> URL: https://issues.apache.org/jira/browse/SPARK-47537
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.1, 4.0.0, 3.5.2
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47537) Use MySQL Connector/J for MySQL DB instead of MariaDB Connector/J

2024-03-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47537:
--
Affects Version/s: 3.5.1
   (was: 3.5.2)

> Use MySQL Connector/J for MySQL DB instead of MariaDB Connector/J 
> --
>
> Key: SPARK-47537
> URL: https://issues.apache.org/jira/browse/SPARK-47537
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.1, 4.0.0, 3.5.1
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47537) Use MySQL Connector/J for MySQL DB instead of MariaDB Connector/J

2024-03-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47537:
--
Fix Version/s: 3.5.2

> Use MySQL Connector/J for MySQL DB instead of MariaDB Connector/J 
> --
>
> Key: SPARK-47537
> URL: https://issues.apache.org/jira/browse/SPARK-47537
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.1, 4.0.0, 3.5.2
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47476) StringReplace (all collations)

2024-03-25 Thread Milan Dankovic (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830542#comment-17830542
 ] 

Milan Dankovic commented on SPARK-47476:


I am working on this

 

> StringReplace (all collations)
> --
>
> Key: SPARK-47476
> URL: https://issues.apache.org/jira/browse/SPARK-47476
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *StringReplace* built-in string function in 
> Spark. First confirm what is the expected behaviour for this function when 
> given collated strings, and then move on to implementation and testing. One 
> way to go about this is to consider using {_}StringSearch{_}, an efficient 
> ICU service for string matching. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringReplace* function so 
> it supports all collation types currently supported in Spark. To understand 
> what changes were introduced in order to enable full collation support for 
> other existing functions in Spark, take a look at the Spark PRs and Jira 
> tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-47476) StringReplace (all collations)

2024-03-25 Thread Milan Dankovic (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830542#comment-17830542
 ] 

Milan Dankovic edited comment on SPARK-47476 at 3/25/24 3:45 PM:
-

I am working on this


was (Author: JIRAUSER304529):
I am working on this

 

> StringReplace (all collations)
> --
>
> Key: SPARK-47476
> URL: https://issues.apache.org/jira/browse/SPARK-47476
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *StringReplace* built-in string function in 
> Spark. First confirm what is the expected behaviour for this function when 
> given collated strings, and then move on to implementation and testing. One 
> way to go about this is to consider using {_}StringSearch{_}, an efficient 
> ICU service for string matching. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringReplace* function so 
> it supports all collation types currently supported in Spark. To understand 
> what changes were introduced in order to enable full collation support for 
> other existing functions in Spark, take a look at the Spark PRs and Jira 
> tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47476) StringReplace (all collations)

2024-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47476:
---
Labels: pull-request-available  (was: )

> StringReplace (all collations)
> --
>
> Key: SPARK-47476
> URL: https://issues.apache.org/jira/browse/SPARK-47476
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *StringReplace* built-in string function in 
> Spark. First confirm what is the expected behaviour for this function when 
> given collated strings, and then move on to implementation and testing. One 
> way to go about this is to consider using {_}StringSearch{_}, an efficient 
> ICU service for string matching. Implement the corresponding unit tests 
> (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect 
> how this function should be used with collation in SparkSQL, and feel free to 
> use your chosen Spark SQL Editor to experiment with the existing functions to 
> learn more about how they work. In addition, look into the possible use-cases 
> and implementation of similar functions within other other open-source DBMS, 
> such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringReplace* function so 
> it supports all collation types currently supported in Spark. To understand 
> what changes were introduced in order to enable full collation support for 
> other existing functions in Spark, take a look at the Spark PRs and Jira 
> tickets for completed tasks in this parent (for example: Contains, 
> StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47547) observed false positive rate in bloom filter is greater than expected for large n

2024-03-25 Thread Nathan Conroy (Jira)

Nathan Conroy created SPARK-47547:
-

 Summary: observed false positive rate in bloom filter is greater 
than expected for large n
 Key: SPARK-47547
 URL: https://issues.apache.org/jira/browse/SPARK-47547
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Nathan Conroy


When creating a bloom filter out of a large number of elements (>400 million or 
so) with an fpp (false positive rate) of 1% in Spark, the observed false 
positive rate appears to be much higher, as much as 20%.

This is demonstrated below in this spark shell:


{noformat}
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.0-amzn-0
  /_/
 
Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 17.0.10)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import java.security.MessageDigest
import java.security.MessageDigest

scala> import scala.util.Random
import scala.util.Random

scala> import org.apache.spark.util.sketch.BloomFilter
import org.apache.spark.util.sketch.BloomFilter

scala> 

scala> // Function to generate a random SHA1 hash

scala> def generateRandomSha1(): String = {
 |   val randomString = Random.alphanumeric.take(20).mkString
 |   val sha1 = MessageDigest.getInstance("SHA-1")
 |   sha1.update(randomString.getBytes("UTF-8"))
 |   val digest = sha1.digest
 |   digest.map("%02x".format(_)).mkString
 | }
generateRandomSha1: ()String

scala> 

scala> // Generate a DataFrame with 500 million rows of random SHA1 hashes

scala> val df = spark.range(5).map(_ => 
generateRandomSha1()).toDF("Hash")
df: org.apache.spark.sql.DataFrame = [Hash: string]

scala> // Create a bloom filter out of this collection of strings.

scala> val bloom_filter = df.stat.bloomFilter("Hash", 5, 0.01)
bloom_filter: org.apache.spark.util.sketch.BloomFilter = 
org.apache.spark.util.sketch.BloomFilterImpl@a14c0ba9

scala> // Generate another 10,000 random hashes

scala> val random_sha1s = List.fill(1)(generateRandomSha1())
random_sha1s: List[String] = List(f3cbfd9bd836ea917ebc0dfc5330135cfde322a3, 
4bff8d58799e517a1ba78236db9b52353dd39b56, 
775bdd9d138a79eeae7308617f5c0d1d0e1c1697, 
abbd761b7768f3cbadbffc0c7947185856c4943d, 
343692fe61c552f73ad6bc2d2d3072cc456da1db, 
faf4430055c528c9a00a46e9fae7dc25047ffaf3, 
255b5d56c39bfba861647fff67704e6bc758d683, 
dae8e0910a368f034958ae232aa5f5285486a8ac, 
3680dbd34437ca661592a7e4d39782c9c77fb4ba, 
f5b43f7a77c9d9ea28101a1848d8b1a1c0a65b82, 
5bda825102026bc0da731dc84d56a499ccff0fe1, 
158d7b3ce949422de421d5e110e3f6903af4f8e1, 
2efcae5cb10273a0f5e89ae34fa3156238ab0555, 
8d241012d42097f80f30e8ead227d75ab77086d2, 
307495c98ae5f25026b91e60cf51d4f9f1ad7f4b, 
8fc2f55563ab67d4ec87ff7b04a4a01e821814a3, 
b413572d14ee16c6c575ca3472adff62a8cbfa3d, 9219233b0e8afe57d7d5cb6...

scala> // Check how many of these random hashes return a positive result when 
passed into mightContain

scala> random_sha1s.map(c => bloom_filter.mightContain(c)).count(_ == true)
res0: Int = 2153 {noformat}
I believe this is the result of the bloom filter implementation using 32bit 
hashes. Since the maximum value that can be returned by the k hash functions is 
~2.14 billion (max integer value in Java), bloom filters with m > ~2.14 billion 
have degraded performance resulting from not using any bits at indices greater 
than ~2.14 billion. 

This was a known bug in Guava that was fixed several years ago, but it looks 
like the fix was never ported to Spark. See 
[https://github.com/google/guava/issues/1119]

Of course, using a different hash function strategy would break existing uses 
of this code, so we should tread with caution here. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47545) [Connect] DF observe support for the scala client

2024-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47545:
---
Labels: pull-request-available  (was: )

> [Connect] DF observe support for the scala client
> -
>
> Key: SPARK-47545
> URL: https://issues.apache.org/jira/browse/SPARK-47545
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Pengfei Xu
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47545) [Connect] DF observe support for the scala client

2024-03-25 Thread Pengfei Xu (Jira)

Pengfei Xu created SPARK-47545:
--

 Summary: [Connect] DF observe support for the scala client
 Key: SPARK-47545
 URL: https://issues.apache.org/jira/browse/SPARK-47545
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 4.0.0
Reporter: Pengfei Xu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47544) [Pyspark] SparkSession builder method is incompatible with vs code intellisense

2024-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47544:
---
Labels: pull-request-available  (was: )

> [Pyspark] SparkSession builder method is incompatible with vs code 
> intellisense
> ---
>
> Key: SPARK-47544
> URL: https://issues.apache.org/jira/browse/SPARK-47544
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Niranjan Jayakar
>Priority: Major
>  Labels: pull-request-available
> Attachments: old.mov
>
>
> VS code's intellisense is unable to recognize the methods under 
> `SparkSession.builder`.
>  
> See attachment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47544) [Pyspark] SparkSession builder method is incompatible with vs code intellisense

2024-03-25 Thread Niranjan Jayakar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niranjan Jayakar updated SPARK-47544:
-
Attachment: old.mov

> [Pyspark] SparkSession builder method is incompatible with vs code 
> intellisense
> ---
>
> Key: SPARK-47544
> URL: https://issues.apache.org/jira/browse/SPARK-47544
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Niranjan Jayakar
>Priority: Major
> Attachments: old.mov
>
>
> VS code's intellisense is unable to recognize the methods under 
> `SparkSession.builder`.
>  
> See attachment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47544) [Pyspark] SparkSession builder method is incompatible with vs code intellisense

2024-03-25 Thread Niranjan Jayakar (Jira)

Niranjan Jayakar created SPARK-47544:


 Summary: [Pyspark] SparkSession builder method is incompatible 
with vs code intellisense
 Key: SPARK-47544
 URL: https://issues.apache.org/jira/browse/SPARK-47544
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Niranjan Jayakar


VS code's intellisense is unable to recognize the methods under 
`SparkSession.builder`.

 

See attachment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47543) Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation.

2024-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47543:
---
Labels: pull-request-available  (was: )

> Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame 
> creation.
> 
>
> Key: SPARK-47543
> URL: https://issues.apache.org/jira/browse/SPARK-47543
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> Currently the PyArrow infers the Pandas dictionary field as StructType 
> instead of MapType, so Spark can't handle the schema properly:
> {code:java}
> >>> pdf = pd.DataFrame({"str_col": ['second'], "dict_col": [{'first': 0.7, 
> >>> 'second': 0.3}]})
> >>> pa.Schema.from_pandas(pdf)
> str_col: string
> dict_col: struct
>   child 0, first: double
>   child 1, second: double
> {code}
> We cannot handle this case since we use PyArrow for schema creation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47543) Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation.

2024-03-25 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-47543:
---

 Summary: Inferring `dict` as `MapType` from Pandas DataFrame to 
allow DataFrame creation.
 Key: SPARK-47543
 URL: https://issues.apache.org/jira/browse/SPARK-47543
 Project: Spark
  Issue Type: Bug
  Components: Connect, PySpark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


Currently the PyArrow infers the Pandas dictionary field as StructType instead 
of MapType, so Spark can't handle the schema properly:
{code:java}
>>> pdf = pd.DataFrame({"str_col": ['second'], "dict_col": [{'first': 0.7, 
>>> 'second': 0.3}]})
>>> pa.Schema.from_pandas(pdf)
str_col: string
dict_col: struct
  child 0, first: double
  child 1, second: double
{code}
We cannot handle this case since we use PyArrow for schema creation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47256) Assign error classes to FILTER expression errors

2024-03-25 Thread David Milicevic (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830443#comment-17830443
 ] 

David Milicevic commented on SPARK-47256:
-

Working on this ticket in https://github.com/apache/spark/pull/45622.

> Assign error classes to FILTER expression errors
> 
>
> Key: SPARK-47256
> URL: https://issues.apache.org/jira/browse/SPARK-47256
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: pull-request-available, starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_102[4-7]* 
> defined in {*}common/utils/src/main/resources/error/error-classes.json{*}. 
> The name should be short but complete (look at the example in 
> error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47504) Resolve AbstractDataType simpleStrings for StringTypeCollated

2024-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47504:
---
Labels: pull-request-available  (was: )

> Resolve AbstractDataType simpleStrings for StringTypeCollated
> -
>
> Key: SPARK-47504
> URL: https://issues.apache.org/jira/browse/SPARK-47504
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>
> *SPARK-47296* introduced a change to fail all unsupported functions. Because 
> of this change expected *inputTypes* in *ExpectsInputTypes* had to be 
> changed. This change introduced a change on user side which will print 
> *"STRING_ANY_COLLATION"* in places where before we printed *"STRING"* when an 
> error occurred. Concretely if we get an input of Int where 
> *StringTypeAnyCollation* was expected, we will throw this faulty message for 
> users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47542) spark cannot hit oracle's index when column type is DATE

2024-03-25 Thread Danke Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danke Liu updated SPARK-47542:
--
Description: 
When I use spark's jdbc to pull data from oracle, it will not hit the index if 
the pushed filter's type in oralce is DATE.

Here is my scenario:

first I create a dataframe that reads from oracle:

val df = spark.read.format("jdbc").                                             
                           
          option("url", url).                                                   
                       
          option("driver", driver).                                             
                       
          option("user", user).                                                 
                  
          option("password", passwd).                                           
                                                      
          option("dbtable", "select * from foobar.tbl1") 
          .load()

then I apply a filter to the dataframe like this:

df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd 
HH:mm:ss')  """).count()

this will not hit the index on update_time column.

 

Reason：

The update_time column in Oracle is of type DATE, which is mapped to Timestamp 
in Spark (because the precision of DATE in Oracle is second). When I push a 
filter to Oracle, it triggers the following code in 
org.apache.spark.sql.jdbc.OracleDialect:

 

override def compileValue(value: Any): Any = value match

{     // The JDBC drivers support date literals in SQL statements written in 
the     // format:

{d '-mm-dd'}

and timestamp literals in SQL statements written
    // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see
    // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 
(11.1)'
    // Appendix A Reference Information.
    case stringValue: String => s"'${escapeSql(stringValue)}'"
    case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}"
    case dateValue: Date => "\{d '" + dateValue + "'}"
    case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
    case _ => value
  }

 

As a result, the condition "update_time >= \{ts '2024-03-12 06:18:17'} will 
never hit the index.

In my case, as a workaround, I changed the code to this:

{color:#cc7832}case {color}timestampValue: Timestamp 
=>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}

{dateFormat.format(timestampValue)}

,'-MM-dd HH:mi:ss')"

 

After this modification, it worked well.

 

 

 

 

  was:
When I use spark's jdbc to pull data from oracle, it will not hit the index if 
the pushed filter's type in oralce is DATE.

Here is my scenario:

first I created  a dataframe that reads from oracle:

val df = spark.read.format("jdbc").                                             
                           
          option("url", url).                                                   
                       
          option("driver", driver).                                             
                       
          option("user", user).                                                 
                  
          option("password", passwd).                                           
                                                      
          option("dbtable", "select * from foobar.tbl1") 
          .load()

then I apply a filter to the dataframe like this:

df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd 
HH:mm:ss')  """).count()

this will not hit the index on update_time column.

 

Reason：

The update_time column in Oracle is of type DATE, which is mapped to Timestamp 
in Spark (because the precision of DATE in Oracle is second). When I push a 
filter to Oracle, it triggers the following code in 
org.apache.spark.sql.jdbc.OracleDialect:

 

override def compileValue(value: Any): Any = value match

{     // The JDBC drivers support date literals in SQL statements written in 
the     // format:

{d '-mm-dd'}

and timestamp literals in SQL statements written
    // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see
    // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 
(11.1)'
    // Appendix A Reference Information.
    case stringValue: String => s"'${escapeSql(stringValue)}'"
    case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}"
    case dateValue: Date => "\{d '" + dateValue + "'}"
    case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
    case _ => value
  }

 

As a result, the condition "update_time >= \{ts '2024-03-12 06:18:17'} will 
never hit the index.

In my case, as a workaround, I changed the code to this:

{color:#cc7832}case {color}timestampValue: Timestamp 
=>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}{dateFormat.format(timestampValue)},'-MM-dd
 HH:mi:ss')"

 

After this modificati

[jira] [Updated] (SPARK-47542) spark cannot hit oracle's index when column type is DATE

2024-03-25 Thread Danke Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danke Liu updated SPARK-47542:
--
Description: 
When I use spark's jdbc to pull data from oracle, it will not hit the index if 
the pushed filter's type in oralce is DATE.

Here is my scenario:

first I created  a dataframe that reads from oracle:

val df = spark.read.format("jdbc").                                             
                           
          option("url", url).                                                   
                       
          option("driver", driver).                                             
                       
          option("user", user).                                                 
                  
          option("password", passwd).                                           
                                                      
          option("dbtable", "select * from foobar.tbl1") 
          .load()

then I apply a filter to the dataframe like this:

df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd 
HH:mm:ss')  """).count()

this will not hit the index on update_time column.

 

Reason：

The update_time column in Oracle is of type DATE, which is mapped to Timestamp 
in Spark (because the precision of DATE in Oracle is second). When I push a 
filter to Oracle, it triggers the following code in 
org.apache.spark.sql.jdbc.OracleDialect:

 

override def compileValue(value: Any): Any = value match

{     // The JDBC drivers support date literals in SQL statements written in 
the     // format:

{d '-mm-dd'}

and timestamp literals in SQL statements written
    // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see
    // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 
(11.1)'
    // Appendix A Reference Information.
    case stringValue: String => s"'${escapeSql(stringValue)}'"
    case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}"
    case dateValue: Date => "\{d '" + dateValue + "'}"
    case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
    case _ => value
  }

 

As a result, the condition "update_time >= \{ts '2024-03-12 06:18:17'} will 
never hit the index.

In my case, as a workaround, I changed the code to this:

{color:#cc7832}case {color}timestampValue: Timestamp 
=>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}{dateFormat.format(timestampValue)},'-MM-dd
 HH:mi:ss')"

 

After this modification, it worked well.

 

 

 

 

  was:
When I use spark's jdbc to pull data from oracle, it will not hit the index if 
the pushed filter's type in oralce is DATE.

Here is my scenario:

first I created  a dataframe that reads from oracle:

val df = spark.read.format("jdbc").                                             
                           
          option("url", url).                                                   
                       
          option("driver", driver).                                             
                       
          option("user", user).                                                 
                  
          option("password", passwd).                                           
                                                      
          option("dbtable", "select * from foobar.tbl1") 
          .load()

then I apply a filter to the dataframe like this:

df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd 
HH:mm:ss')  """).count()

this will not hit the index on update_time column.

 

Reason：

The update_time column in Oracle is of type DATE, which is mapped to Timestamp 
in Spark (because the precision of DATE in Oracle is second). When I push a 
filter to Oracle, it triggers the following code in 
org.apache.spark.sql.jdbc.OracleDialect:

 

override def compileValue(value: Any): Any = value match

{     // The JDBC drivers support date literals in SQL statements written in 
the     // format:

{d '-mm-dd'}

and timestamp literals in SQL statements written
    // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see
    // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 
(11.1)'
    // Appendix A Reference Information.
    case stringValue: String => s"'${escapeSql(stringValue)}'"
    case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}"
    case dateValue: Date => "\{d '" + dateValue + "'}"
    case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
    case _ => value
  }

 

As a result, the condition "update_time >= \{ts '2024-03-12 06:18:17'} will 
never hit the index.

In my case, as a workaround, I changed the code to this:

{color:#cc7832}case {color}timestampValue: Timestamp 
=>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}{\{color:#9876aa}dateFormat.format(timestampValue)},'-MM-dd
 HH:mi:ss')"

 

After

[jira] [Updated] (SPARK-47542) spark cannot hit oracle's index when column type is DATE

2024-03-25 Thread Danke Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danke Liu updated SPARK-47542:
--
Description: 
When I use spark's jdbc to pull data from oracle, it will not hit the index if 
the pushed filter's type in oralce is DATE.

Here is my scenario:

first I created  a dataframe that reads from oracle:

val df = spark.read.format("jdbc").                                             
                           
          option("url", url).                                                   
                       
          option("driver", driver).                                             
                       
          option("user", user).                                                 
                  
          option("password", passwd).                                           
                                                      
          option("dbtable", "select * from foobar.tbl1") 
          .load()

then I apply a filter to the dataframe like this:

df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd 
HH:mm:ss')  """).count()

this will not hit the index on update_time column.

 

Reason：

The update_time column in Oracle is of type DATE, which is mapped to Timestamp 
in Spark (because the precision of DATE in Oracle is second). When I push a 
filter to Oracle, it triggers the following code in 
org.apache.spark.sql.jdbc.OracleDialect:

 

override def compileValue(value: Any): Any = value match

{     // The JDBC drivers support date literals in SQL statements written in 
the     // format:

{d '-mm-dd'}

and timestamp literals in SQL statements written
    // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see
    // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 
(11.1)'
    // Appendix A Reference Information.
    case stringValue: String => s"'${escapeSql(stringValue)}'"
    case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}"
    case dateValue: Date => "\{d '" + dateValue + "'}"
    case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
    case _ => value
  }

 

As a result, the condition "update_time >= \{ts '2024-03-12 06:18:17'} will 
never hit the index.

In my case, as a workaround, I changed the code to this:

{color:#cc7832}case {color}timestampValue: Timestamp 
=>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}{\{color:#9876aa}dateFormat.format(timestampValue)},'-MM-dd
 HH:mi:ss')"

 

After this modification, it worked well.

 

 

 

 

  was:
When I use spark's jdbc to pull data from oracle, it will not hit the index if 
the pushed filter's type in oralce is DATE.

Here is my scenario:

first I created  a dataframe that reads from oracle:

val df = spark.read.format("jdbc").                                             
                           
          option("url", url).                                                   
                       
          option("driver", driver).                                             
                       
          option("user", user).                                                 
                  
          option("password", passwd).                                           
                                                      
          option("dbtable", "select * from foobar.tbl1") 
          .load()

then I apply a filter to the dataframe like this:

df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd 
HH:mm:ss')  """).count()

this will not hit the index on update_time column.

 

Reason：

The update_time column in Oracle is of type DATE, which is mapped to Timestamp 
in Spark (because the precision of DATE in Oracle is second). When I push a 
filter to Oracle, it triggers the following code in 
org.apache.spark.sql.jdbc.OracleDialect:

// class is org.apache.spark.sql.jdbc.OracleDialect

override def compileValue(value: Any): Any = value match

{     // The JDBC drivers support date literals in SQL statements written in 
the     // format:

{d '-mm-dd'}

and timestamp literals in SQL statements written
    // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see
    // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 
(11.1)'
    // Appendix A Reference Information.
    case stringValue: String => s"'${escapeSql(stringValue)}'"
    case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}"
    case dateValue: Date => "\{d '" + dateValue + "'}"
    case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
    case _ => value
  }

 

As a result, the condition "update_time >= \{ts '2024-03-12 06:18:17'} will 
never hit the index.

In my case, as a workaround, I changed the code to this:

{color:#cc7832}case {color}timestampValue: Timestamp 
=>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}{{color:#9876aa}dateF

[jira] [Updated] (SPARK-47542) spark cannot hit oracle's index when column type is DATE

2024-03-25 Thread Danke Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danke Liu updated SPARK-47542:
--
Description: 
When I use spark's jdbc to pull data from oracle, it will not hit the index if 
the pushed filter's type in oralce is DATE.

Here is my scenario:

first I created  a dataframe that reads from oracle:

val df = spark.read.format("jdbc").                                             
                           
          option("url", url).                                                   
                       
          option("driver", driver).                                             
                       
          option("user", user).                                                 
                  
          option("password", passwd).                                           
                                                      
          option("dbtable", "select * from foobar.tbl1") 
          .load()

then I apply a filter to the dataframe like this:

df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd 
HH:mm:ss')  """).count()

this will not hit the index on update_time column.

 

Reason：

The update_time column in Oracle is of type DATE, which is mapped to Timestamp 
in Spark (because the precision of DATE in Oracle is second). When I push a 
filter to Oracle, it triggers the following code in 
org.apache.spark.sql.jdbc.OracleDialect:

// class is org.apache.spark.sql.jdbc.OracleDialect

override def compileValue(value: Any): Any = value match

{     // The JDBC drivers support date literals in SQL statements written in 
the     // format:

{d '-mm-dd'}

and timestamp literals in SQL statements written
    // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see
    // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 
(11.1)'
    // Appendix A Reference Information.
    case stringValue: String => s"'${escapeSql(stringValue)}'"
    case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}"
    case dateValue: Date => "\{d '" + dateValue + "'}"
    case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
    case _ => value
  }

 

As a result, the condition "update_time >= \{ts '2024-03-12 06:18:17'} will 
never hit the index.

In my case, as a workaround, I changed the code to this:

{color:#cc7832}case {color}timestampValue: Timestamp 
=>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}{{color:#9876aa}dateFormat.format(timestampValue)},'-MM-dd
 HH:mi:ss')"{color}

 

After this modification, it worked well.

 

 

 

 

  was:
When I use spark's jdbc to pull data from oracle, it will not hit the index if 
the pushed filter's type in oralce is DATE.

Here is my scenario:

first I created  a dataframe that reads from oracle:

val df = spark.read.format("jdbc").                                             
                           
          option("url", url).                                                   
                       
          option("driver", driver).                                             
                       
          option("user", user).                                                 
                  
          option("password", passwd).                                           
                                                      
          option("dbtable", "select * from foobar.tbl1") 
          .load()

then I apply a filter to the dataframe like this:

df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd 
HH:mm:ss')  """).count()

this will not hit the index on update_time column.

 

Reason：

The update_time column in Oracle is of type DATE, which is mapped to Timestamp 
in Spark (because the precision of DATE in Oracle is second). When I push a 
filter to Oracle, it triggers the following code in 
org.apache.spark.sql.jdbc.OracleDialect:

// class is org.apache.spark.sql.jdbc.OracleDialect

override def compileValue(value: Any): Any = value match

{     // The JDBC drivers support date literals in SQL statements written in 
the     // format:

{d '-mm-dd'}

and timestamp literals in SQL statements written
    // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see
    // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 
(11.1)'
    // Appendix A Reference Information.
    case stringValue: String => s"'${escapeSql(stringValue)}'"
    case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}"
    case dateValue: Date => "\{d '" + dateValue + "'}"
    case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
    case _ => value
  }

 

As a result, the condition "update_time >= \{ts '2024-03-12 06:18:17'} will 
never hit the index.

In my case, as a workaround, I changed the code to this:

{color:#cc7832}case {color}timestampValue: Timestamp 
=>{color:#6a8759}s"{c

[jira] [Updated] (SPARK-47542) spark cannot hit oracle's index when column type is DATE

2024-03-25 Thread Danke Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danke Liu updated SPARK-47542:
--
Description: 
When I use spark's jdbc to pull data from oracle, it will not hit the index if 
the pushed filter's type in oralce is DATE.

Here is my scenario:

first I created  a dataframe that reads from oracle:

val df = spark.read.format("jdbc").                                             
                           
          option("url", url).                                                   
                       
          option("driver", driver).                                             
                       
          option("user", user).                                                 
                  
          option("password", passwd).                                           
                                                      
          option("dbtable", "select * from foobar.tbl1") 
          .load()

then I apply a filter to the dataframe like this:

df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd 
HH:mm:ss')  """).count()

this will not hit the index on update_time column.

 

Reason：

The update_time column in Oracle is of type DATE, which is mapped to Timestamp 
in Spark (because the precision of DATE in Oracle is second). When I push a 
filter to Oracle, it triggers the following code in 
org.apache.spark.sql.jdbc.OracleDialect:

// class is org.apache.spark.sql.jdbc.OracleDialect

override def compileValue(value: Any): Any = value match

{     // The JDBC drivers support date literals in SQL statements written in 
the     // format:

{d '-mm-dd'}

and timestamp literals in SQL statements written
    // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see
    // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 
(11.1)'
    // Appendix A Reference Information.
    case stringValue: String => s"'${escapeSql(stringValue)}'"
    case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}"
    case dateValue: Date => "\{d '" + dateValue + "'}"
    case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
    case _ => value
  }

 

As a result, the condition "update_time >= \{ts '2024-03-12 06:18:17'} will 
never hit the index.

In my case, as a workaround, I changed the code to this:

{color:#cc7832}case {color}timestampValue: Timestamp 
=>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}

{
{color:#9876aa}

dateFormat.format(timestampValue)},'-MM-dd HH:mi:ss')"{color}

 

then it worked well.

 

 

 

 

  was:
When I use spark's jdbc to pull data from oracle, it will not hit the index if 
the pushed filter's type in oralce is DATE.

Here is my scenario:

first I created  a dataframe that reads from oracle:

val df = spark.read.format("jdbc").                                             
                           
          option("url", url).                                                   
                       
          option("driver", driver).                                             
                       
          option("user", user).                                                 
                  
          option("password", passwd).                                           
                                                      
          option("dbtable", "select * from foobar.tbl1") 
          .load()

then I apply a filter to the dataframe like this:

df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd 
HH:mm:ss')  """).count()

this will not hit the index on update_time column.

 

Reason：

The update_time column in Oracle is of type DATE, which is mapped to Timestamp 
in Spark (because the precision of DATE in Oracle is second). When I push a 
filter to Oracle, it triggers the following code in 
org.apache.spark.sql.jdbc.OracleDialect:

// class is org.apache.spark.sql.jdbc.OracleDialect

override def compileValue(value: Any): Any = value match

{     // The JDBC drivers support date literals in SQL statements written in 
the     // format: 

{d '-mm-dd'}

and timestamp literals in SQL statements written
    // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see
    // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 
(11.1)'
    // Appendix A Reference Information.
    case stringValue: String => s"'${escapeSql(stringValue)}'"
    case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}"
    case dateValue: Date => "\{d '" + dateValue + "'}"
    case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
    case _ => value
  }

 

and this "update_time >= \{ts '2024-03-12 06:18:17'}" will never hit the index.

In my case, as a work around, I just change the code to this:

{color:#cc7832}case {color}timestampValue: Timestamp 
=>{color:#6a8759}s"{color}{color:#6a8759}to_date(

[jira] [Updated] (SPARK-47542) spark cannot hit oracle's index when column type is DATE

2024-03-25 Thread Danke Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danke Liu updated SPARK-47542:
--
Description: 
When I use spark's jdbc to pull data from oracle, it will not hit the index if 
the pushed filter's type in oralce is DATE.

Here is my scenario:

first I created  a dataframe that reads from oracle:

val df = spark.read.format("jdbc").                                             
                           
          option("url", url).                                                   
                       
          option("driver", driver).                                             
                       
          option("user", user).                                                 
                  
          option("password", passwd).                                           
                                                      
          option("dbtable", "select * from foobar.tbl1") 
          .load()

then I apply a filter to the dataframe like this:

df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd 
HH:mm:ss')  """).count()

this will not hit the index on update_time column.

 

Reason：

The update_time column in Oracle is of type DATE, which is mapped to Timestamp 
in Spark (because the precision of DATE in Oracle is second). When I push a 
filter to Oracle, it triggers the following code in 
org.apache.spark.sql.jdbc.OracleDialect:

// class is org.apache.spark.sql.jdbc.OracleDialect

override def compileValue(value: Any): Any = value match

{     // The JDBC drivers support date literals in SQL statements written in 
the     // format: 

{d '-mm-dd'}

and timestamp literals in SQL statements written
    // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see
    // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 
(11.1)'
    // Appendix A Reference Information.
    case stringValue: String => s"'${escapeSql(stringValue)}'"
    case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}"
    case dateValue: Date => "\{d '" + dateValue + "'}"
    case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
    case _ => value
  }

 

and this "update_time >= \{ts '2024-03-12 06:18:17'}" will never hit the index.

In my case, as a work around, I just change the code to this:

{color:#cc7832}case {color}timestampValue: Timestamp 
=>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}

{\\{color:#9876aa}

dateFormat.format(timestampValue)},'-MM-dd HH:mi:ss'){color:#6a8759}"{color}

 

then it worked well.

 

 

 

 

  was:
When I use spark's jdbc to pull data from oracle, it will not hit the index if 
the pushed filter's type in oralce is DATE.

Here is my scenario:

first I created  a dataframe that reads from oracle:

val df = spark.read.format("jdbc").                                             
                           
          option("url", url).                                                   
                       
          option("driver", driver).                                             
                       
          option("user", user).                                                 
                  
          option("password", passwd).                                           
                                                      
          option("dbtable", "select * from foobar.tbl1") 
          .load()

then I apply a filter to the dataframe like this:

df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd 
HH:mm:ss')  """).count()

this will not hit the index on update_time column.

 

Reason：

the update_time column in oracle is DATE type, this mapped to spark has became 
Timestamp(because precision of DATE in oracle is second), and when I pushed a 
filter to oracle, it will hit the codes bellow:

// class is org.apache.spark.sql.jdbc.OracleDialect

override def compileValue(value: Any): Any = value match

{     // The JDBC drivers support date literals in SQL statements written in 
the     // format: \\{d '-mm-dd'}

and timestamp literals in SQL statements written
    // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see
    // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 
(11.1)'
    // Appendix A Reference Information.
    case stringValue: String => s"'${escapeSql(stringValue)}'"
    case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}"
    case dateValue: Date => "\{d '" + dateValue + "'}"
    case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
    case _ => value
  }

 

and this "update_time >= \{ts '2024-03-12 06:18:17'}" will never hit the index.

In my case, as a work around, I just change the code to this:

{color:#cc7832}case {color}timestampValue: Timestamp 
=>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}{\{color:#9876aa}dateFormat.forma

[jira] [Updated] (SPARK-47541) Collated strings in complex types supporting operations reverse, array_join, concat, map

2024-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47541:
---
Labels: pull-request-available  (was: )

> Collated strings in complex types supporting operations reverse, array_join, 
> concat, map
> 
>
> Key: SPARK-47541
> URL: https://issues.apache.org/jira/browse/SPARK-47541
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
>  Labels: pull-request-available
>
> Add proper support for complex types containing collated strings in 
> operations reverse, array_join, concat, map (create). Examples:
> {code:java}
> select reverse('abc' collate utf8_binary_lcase);
> select reverse(array('a' collate utf8_binary_lcase, 'b' collate 
> utf8_binary_lcase));
> select array_join(array('a' collate utf8_binary_lcase, 'b' collate 
> utf8_binary_lcase), ', ' collate utf8_binary_lcase);
> select concat('a' collate utf8_binary_lcase, 'b' collate utf8_binary_lcase);
> select map('a' collate utf8_binary_lcase, 1, 'A' collate utf8_binary_lcase, 
> 2);{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47542) spark cannot hit oracle's index when column type is DATE

2024-03-25 Thread Danke Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danke Liu updated SPARK-47542:
--
Description: 
When I use spark's jdbc to pull data from oracle, it will not hit the index if 
the pushed filter's type in oralce is DATE.

Here is my scenario:

first I created  a dataframe that reads from oracle:

val df = spark.read.format("jdbc").                                             
                           
          option("url", url).                                                   
                       
          option("driver", driver).                                             
                       
          option("user", user).                                                 
                  
          option("password", passwd).                                           
                                                      
          option("dbtable", "select * from foobar.tbl1") 
          .load()

then I apply a filter to the dataframe like this:

df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd 
HH:mm:ss')  """).count()

this will not hit the index on update_time column.

 

Reason：

the update_time column in oracle is DATE type, this mapped to spark has became 
Timestamp(because precision of DATE in oracle is second), and when I pushed a 
filter to oracle, it will hit the codes bellow:

// class is org.apache.spark.sql.jdbc.OracleDialect

override def compileValue(value: Any): Any = value match

{     // The JDBC drivers support date literals in SQL statements written in 
the     // format: \\{d '-mm-dd'}

and timestamp literals in SQL statements written
    // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see
    // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 
(11.1)'
    // Appendix A Reference Information.
    case stringValue: String => s"'${escapeSql(stringValue)}'"
    case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}"
    case dateValue: Date => "\{d '" + dateValue + "'}"
    case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
    case _ => value
  }

 

and this "update_time >= \{ts '2024-03-12 06:18:17'}" will never hit the index.

In my case, as a work around, I just change the code to this:

{color:#cc7832}case {color}timestampValue: Timestamp 
=>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}{\{color:#9876aa}dateFormat.format(timestampValue)},'-MM-dd
 HH:mi:ss'){color:#6a8759}"{color}

 

then it worked well.

 

 

 

 

  was:
When I use spark's jdbc to pull data from oracle, it will not hit the index if 
the pushed filter's type in oralce is DATE.

Here is my scenario:

first I created  a dataframe that read from oracle:

val df = spark.read.format("jdbc").                                             
                           
          option("url", url).                                                   
                       
          option("driver", driver).                                             
                       
          option("user", user).                                                 
                  
          option("password", passwd).                                           
                                                      
          option("dbtable", "select * from foobar.tbl1") 
          .load()

then I pushed a filter to the dataframe like this:

df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd 
HH:mm:ss')  """).count()

this will not hit the index on update_time column.

 

Reason：

the update_time column in oracle is DATE type, this mapped to spark has became 
Timestamp(because precision of DATE in oracle is second), and when I pushed a 
filter to oracle, it will hit the codes bellow:

// class is org.apache.spark.sql.jdbc.OracleDialect

override def compileValue(value: Any): Any = value match {
    // The JDBC drivers support date literals in SQL statements written in the
    // format: \{d '-mm-dd'} and timestamp literals in SQL statements 
written
    // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see
    // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 
(11.1)'
    // Appendix A Reference Information.
    case stringValue: String => s"'${escapeSql(stringValue)}'"
    case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}"
    case dateValue: Date => "\{d '" + dateValue + "'}"
    case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
    case _ => value
  }

 

and this "update_time >= \{ts '2024-03-12 06:18:17'}" will never hit the index.

In my case, as a work around, I just change the code to this:

{color:#cc7832}case {color}timestampValue: Timestamp 
=>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}{{color:#9876aa}dateFormat{color}.format(timestampValue)}{color:#6a8759},'-MM-dd

[jira] [Created] (SPARK-47542) spark cannot hit oracle's index when column type is DATE

2024-03-25 Thread Danke Liu (Jira)

Danke Liu created SPARK-47542:
-

 Summary: spark cannot hit oracle's index when column type is DATE
 Key: SPARK-47542
 URL: https://issues.apache.org/jira/browse/SPARK-47542
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.4
Reporter: Danke Liu


When I use spark's jdbc to pull data from oracle, it will not hit the index if 
the pushed filter's type in oralce is DATE.

Here is my scenario:

first I created  a dataframe that read from oracle:

val df = spark.read.format("jdbc").                                             
                           
          option("url", url).                                                   
                       
          option("driver", driver).                                             
                       
          option("user", user).                                                 
                  
          option("password", passwd).                                           
                                                      
          option("dbtable", "select * from foobar.tbl1") 
          .load()

then I pushed a filter to the dataframe like this:

df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd 
HH:mm:ss')  """).count()

this will not hit the index on update_time column.

 

Reason：

the update_time column in oracle is DATE type, this mapped to spark has became 
Timestamp(because precision of DATE in oracle is second), and when I pushed a 
filter to oracle, it will hit the codes bellow:

// class is org.apache.spark.sql.jdbc.OracleDialect

override def compileValue(value: Any): Any = value match {
    // The JDBC drivers support date literals in SQL statements written in the
    // format: \{d '-mm-dd'} and timestamp literals in SQL statements 
written
    // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see
    // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 
(11.1)'
    // Appendix A Reference Information.
    case stringValue: String => s"'${escapeSql(stringValue)}'"
    case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}"
    case dateValue: Date => "\{d '" + dateValue + "'}"
    case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
    case _ => value
  }

 

and this "update_time >= \{ts '2024-03-12 06:18:17'}" will never hit the index.

In my case, as a work around, I just change the code to this:

{color:#cc7832}case {color}timestampValue: Timestamp 
=>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}{{color:#9876aa}dateFormat{color}.format(timestampValue)}{color:#6a8759},'-MM-dd
 HH:mi:ss'){color}{color:#6a8759}"{color}

 

then it worked well.

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-47539) Make the return value of method `castToString` be `Any => UTF8String`

2024-03-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47539:


Assignee: BingKun Pan

> Make the return value of method `castToString` be `Any => UTF8String`
> -
>
> Key: SPARK-47539
> URL: https://issues.apache.org/jira/browse/SPARK-47539
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47539) Make the return value of method `castToString` be `Any => UTF8String`

2024-03-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-47539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47539.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45688
[https://github.com/apache/spark/pull/45688]

> Make the return value of method `castToString` be `Any => UTF8String`
> -
>
> Key: SPARK-47539
> URL: https://issues.apache.org/jira/browse/SPARK-47539
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

85 matches

Mail list logo