[jira] [Created] (SPARK-47559) Codegen Support for variant parse_json
BingKun Pan created SPARK-47559: --- Summary: Codegen Support for variant parse_json Key: SPARK-47559 URL: https://issues.apache.org/jira/browse/SPARK-47559 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47559) Codegen Support for variant parse_json
[ https://issues.apache.org/jira/browse/SPARK-47559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47559: --- Labels: pull-request-available (was: ) > Codegen Support for variant parse_json > -- > > Key: SPARK-47559 > URL: https://issues.apache.org/jira/browse/SPARK-47559 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.
[ https://issues.apache.org/jira/browse/SPARK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830804#comment-17830804 ] slankka edited comment on SPARK-21711 at 3/26/24 6:47 AM: -- Thanks to [~mahesh_ambule] and Good to know, it's annoying to see errors in launching outputs during submission on client machine. {code:java} log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: /stdout (Permission denied) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.(FileOutputStream.java:213) at java.io.FileOutputStream.(FileOutputStream.java:133) at org.apache.log4j.FileAppender.setFile(FileAppender.java:294) at org.apache.log4j.RollingFileAppender.setFile(RollingFileAppender.java:207){code} while settings in log4j.properties like: {code:java} appender.file_appender.fileName=${spark.yarn.app.container.log.dir}/stdout{code} h3. Conclusion 1. SPARK_SUBMIT_OPTS solves the problem above: client log should output to correct directory. 2. setting SPARK_SUBMIT_OPTS of cause will NOT affect driver options or executor options. h3. Notes modifing bin/spark-class like below won't work. {code:java} "$RUNNER" -Dlog4j.properties= -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"_ {code} because the real submit command is partially built by {code:java} org.apache.spark.launcher.AbstractCommandBuilder#buildJavaCommand {code} *buildJavaCommand* generates command starts from java executable to classpath before *org.apache.spark.deploy.SparkSubmit* Logging and debuging [Running Spark on YARN - Spark 3.5.1 Documentation (apache.org)|https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application] was (Author: adrian z): Thanks to [~mahesh_ambule] and Good to know, it's annoying to see errors in launching outputs during submisstion on client machine. {code:java} log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: /stdout (Permission denied) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.(FileOutputStream.java:213) at java.io.FileOutputStream.(FileOutputStream.java:133) at org.apache.log4j.FileAppender.setFile(FileAppender.java:294) at org.apache.log4j.RollingFileAppender.setFile(RollingFileAppender.java:207){code} while settings in log4j.properties like: {code:java} appender.file_appender.fileName=${spark.yarn.app.container.log.dir}/stdout{code} h3. Conclusion 1. SPARK_SUBMIT_OPTS solves the problem above: client log should output to correct directory. 2. setting SPARK_SUBMIT_OPTS of cause will NOT affect driver options or executor options. h3. Notes modifing bin/spark-class like below won't work. {code:java} "$RUNNER" -Dlog4j.properties= -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"_ {code} because the real submit command is partially built by {code:java} org.apache.spark.launcher.AbstractCommandBuilder#buildJavaCommand {code} *buildJavaCommand* generates command starts from java executable to classpath before *org.apache.spark.deploy.SparkSubmit* Logging and debuging [Running Spark on YARN - Spark 3.5.1 Documentation (apache.org)|https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application] > spark-submit command should accept log4j configuration parameters for spark > client logging. > --- > > Key: SPARK-21711 > URL: https://issues.apache.org/jira/browse/SPARK-21711 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 1.6.0, 2.1.0 >Reporter: Mahesh Ambule >Priority: Minor > Attachments: spark-submit client logs.txt > > > Currently, log4j properties can be specified in spark 'conf' directory in > log4j.properties file. > The spark-submit command can override these log4j properties for driver and > executors. > But it can not override these log4j properties for *spark client * > application. > The user should be able to pass log4j properties for spark client using the > spark-submit command. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.
[ https://issues.apache.org/jira/browse/SPARK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830804#comment-17830804 ] slankka commented on SPARK-21711: - Thanks to [~mahesh_ambule] and Good to know, it's annoying to see errors in launching outputs during submisstion on client machine. {code:java} log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: /stdout (Permission denied) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.(FileOutputStream.java:213) at java.io.FileOutputStream.(FileOutputStream.java:133) at org.apache.log4j.FileAppender.setFile(FileAppender.java:294) at org.apache.log4j.RollingFileAppender.setFile(RollingFileAppender.java:207){code} while settings in log4j.properties like: {code:java} appender.file_appender.fileName=${spark.yarn.app.container.log.dir}/stdout{code} h3. Conclusion 1. SPARK_SUBMIT_OPTS solves the problem above: client log should output to correct directory. 2. setting SPARK_SUBMIT_OPTS of cause will NOT affect driver options or executor options. h3. Notes modifing bin/spark-class like below won't work. {code:java} "$RUNNER" -Dlog4j.properties= -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"_ {code} because the real submit command is partially built by {code:java} org.apache.spark.launcher.AbstractCommandBuilder#buildJavaCommand {code} *buildJavaCommand* generates command starts from java executable to classpath before *org.apache.spark.deploy.SparkSubmit* Logging and debuging [Running Spark on YARN - Spark 3.5.1 Documentation (apache.org)|https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application] > spark-submit command should accept log4j configuration parameters for spark > client logging. > --- > > Key: SPARK-21711 > URL: https://issues.apache.org/jira/browse/SPARK-21711 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 1.6.0, 2.1.0 >Reporter: Mahesh Ambule >Priority: Minor > Attachments: spark-submit client logs.txt > > > Currently, log4j properties can be specified in spark 'conf' directory in > log4j.properties file. > The spark-submit command can override these log4j properties for driver and > executors. > But it can not override these log4j properties for *spark client * > application. > The user should be able to pass log4j properties for spark client using the > spark-submit command. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47556) [K8] Spark App ID collision resulting in deleting wrong resources
[ https://issues.apache.org/jira/browse/SPARK-47556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sundeep K resolved SPARK-47556. --- Fix Version/s: 3.3.0 Resolution: Fixed [https://github.com/apache/spark/commit/fe94bf07f9acec302e7d8becd7e576c777337331] and https://issues.apache.org/jira/browse/SPARK-36014 > [K8] Spark App ID collision resulting in deleting wrong resources > - > > Key: SPARK-47556 > URL: https://issues.apache.org/jira/browse/SPARK-47556 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.1 >Reporter: Sundeep K >Priority: Major > Fix For: 3.3.0 > > > h3. Issue: > We noticed that sometimes K8s executor pods go in a crash loop. Reason being > 'Error: MountVolume.SetUp failed for volume "spark-conf-volume-exec"'. Upon > investigation we noticed that there are 2 spark jobs that launched with same > application id and when one of them finishes first it deletes all it's > resources and deletes the resources of other job too. > -> Spark application ID is created using this > [code|https://affirm.slack.com/archives/C06Q2GWLWKH/p1711132115304449?thread_ts=1711123500.783909&cid=C06Q2GWLWKH] > > "spark-application-" + System.currentTimeMillis > This means if 2 applications launch at the same milli second they could end > up having same AppId > -> > [spark-app-selector|https://github.com/apache/spark/blob/93f98c0a61ddb66eb777c3940fbf29fc58e2d79b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L23] > label is added to all resource created by driver and it's value is > application Id. Kubernetes Scheduler deletes all the apps with same > [label|https://github.com/apache/spark/blob/2a8bb5cdd3a5a2d63428b82df5e5066a805ce878/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L162C1-L172C6] > upon termination. > This results in deletion of config map and executor pods of job that's still > running, driver tries to relaunch the executor pods, but config map is not > present, so it's in crash loop > h3. Context > We are using [Spark of Kubernetes > |https://spark.apache.org/docs/latest/running-on-kubernetes.html]and launch > our spark jobs using PySpark. We launch multiple Spark Jobs within a given > k8s namespace. Each Spark job can be launched from different pods or from > different processes in a pod. Every time a job is launched it has a unique > app name. Here is how the job is launched (omitting irrelevant details): > {code:java} > # spark_conf has settings required for spark on k8s > sp = SparkSession.builder \ > .config(conf=spark_conf) \ > .appName('testapp') > sp.master(f'k8s://{kubernetes_host}') > session = sp.getOrCreate() > with session: > session.sql('SELECT 1'){code} > h3. Repro > Set same app id in spark config, run 2 different jobs, one that finishes > fast, one that runs slow. Slower job goes into crash loop > {code:java} > "spark.app.id": ""{code} > h3. Workaround > Set unique spark.app.id for all the jobs that run on k8s > eg: > {code:java} > "spark.app.id": f'{AppName}-{CurrTimeInMilliSecs}-{UUId}'[:63]{code} > h3. Fix > Add unique hash add the end of Application ID: > [https://github.com/apache/spark/pull/45712] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47556) [K8] Spark App ID collision resulting in deleting wrong resources
[ https://issues.apache.org/jira/browse/SPARK-47556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830793#comment-17830793 ] Sundeep K commented on SPARK-47556: --- This is actually fixed in 3.3 and above https://github.com/apache/spark/commit/fe94bf07f9acec302e7d8becd7e576c777337331 > [K8] Spark App ID collision resulting in deleting wrong resources > - > > Key: SPARK-47556 > URL: https://issues.apache.org/jira/browse/SPARK-47556 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.1 >Reporter: Sundeep K >Priority: Major > > h3. Issue: > We noticed that sometimes K8s executor pods go in a crash loop. Reason being > 'Error: MountVolume.SetUp failed for volume "spark-conf-volume-exec"'. Upon > investigation we noticed that there are 2 spark jobs that launched with same > application id and when one of them finishes first it deletes all it's > resources and deletes the resources of other job too. > -> Spark application ID is created using this > [code|https://affirm.slack.com/archives/C06Q2GWLWKH/p1711132115304449?thread_ts=1711123500.783909&cid=C06Q2GWLWKH] > > "spark-application-" + System.currentTimeMillis > This means if 2 applications launch at the same milli second they could end > up having same AppId > -> > [spark-app-selector|https://github.com/apache/spark/blob/93f98c0a61ddb66eb777c3940fbf29fc58e2d79b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L23] > label is added to all resource created by driver and it's value is > application Id. Kubernetes Scheduler deletes all the apps with same > [label|https://github.com/apache/spark/blob/2a8bb5cdd3a5a2d63428b82df5e5066a805ce878/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L162C1-L172C6] > upon termination. > This results in deletion of config map and executor pods of job that's still > running, driver tries to relaunch the executor pods, but config map is not > present, so it's in crash loop > h3. Context > We are using [Spark of Kubernetes > |https://spark.apache.org/docs/latest/running-on-kubernetes.html]and launch > our spark jobs using PySpark. We launch multiple Spark Jobs within a given > k8s namespace. Each Spark job can be launched from different pods or from > different processes in a pod. Every time a job is launched it has a unique > app name. Here is how the job is launched (omitting irrelevant details): > {code:java} > # spark_conf has settings required for spark on k8s > sp = SparkSession.builder \ > .config(conf=spark_conf) \ > .appName('testapp') > sp.master(f'k8s://{kubernetes_host}') > session = sp.getOrCreate() > with session: > session.sql('SELECT 1'){code} > h3. Repro > Set same app id in spark config, run 2 different jobs, one that finishes > fast, one that runs slow. Slower job goes into crash loop > {code:java} > "spark.app.id": ""{code} > h3. Workaround > Set unique spark.app.id for all the jobs that run on k8s > eg: > {code:java} > "spark.app.id": f'{AppName}-{CurrTimeInMilliSecs}-{UUId}'[:63]{code} > h3. Fix > Add unique hash add the end of Application ID: > [https://github.com/apache/spark/pull/45712] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-47556) [K8] Spark App ID collision resulting in deleting wrong resources
[ https://issues.apache.org/jira/browse/SPARK-47556 ] Sundeep K deleted comment on SPARK-47556: --- was (Author: JIRAUSER304761): This seems to be fix in 3.2 and above https://github.com/Affirm/spark/commit/fe94bf07f9acec302e7d8becd7e576c777337331 > [K8] Spark App ID collision resulting in deleting wrong resources > - > > Key: SPARK-47556 > URL: https://issues.apache.org/jira/browse/SPARK-47556 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.1 >Reporter: Sundeep K >Priority: Major > > h3. Issue: > We noticed that sometimes K8s executor pods go in a crash loop. Reason being > 'Error: MountVolume.SetUp failed for volume "spark-conf-volume-exec"'. Upon > investigation we noticed that there are 2 spark jobs that launched with same > application id and when one of them finishes first it deletes all it's > resources and deletes the resources of other job too. > -> Spark application ID is created using this > [code|https://affirm.slack.com/archives/C06Q2GWLWKH/p1711132115304449?thread_ts=1711123500.783909&cid=C06Q2GWLWKH] > > "spark-application-" + System.currentTimeMillis > This means if 2 applications launch at the same milli second they could end > up having same AppId > -> > [spark-app-selector|https://github.com/apache/spark/blob/93f98c0a61ddb66eb777c3940fbf29fc58e2d79b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L23] > label is added to all resource created by driver and it's value is > application Id. Kubernetes Scheduler deletes all the apps with same > [label|https://github.com/apache/spark/blob/2a8bb5cdd3a5a2d63428b82df5e5066a805ce878/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L162C1-L172C6] > upon termination. > This results in deletion of config map and executor pods of job that's still > running, driver tries to relaunch the executor pods, but config map is not > present, so it's in crash loop > h3. Context > We are using [Spark of Kubernetes > |https://spark.apache.org/docs/latest/running-on-kubernetes.html]and launch > our spark jobs using PySpark. We launch multiple Spark Jobs within a given > k8s namespace. Each Spark job can be launched from different pods or from > different processes in a pod. Every time a job is launched it has a unique > app name. Here is how the job is launched (omitting irrelevant details): > {code:java} > # spark_conf has settings required for spark on k8s > sp = SparkSession.builder \ > .config(conf=spark_conf) \ > .appName('testapp') > sp.master(f'k8s://{kubernetes_host}') > session = sp.getOrCreate() > with session: > session.sql('SELECT 1'){code} > h3. Repro > Set same app id in spark config, run 2 different jobs, one that finishes > fast, one that runs slow. Slower job goes into crash loop > {code:java} > "spark.app.id": ""{code} > h3. Workaround > Set unique spark.app.id for all the jobs that run on k8s > eg: > {code:java} > "spark.app.id": f'{AppName}-{CurrTimeInMilliSecs}-{UUId}'[:63]{code} > h3. Fix > Add unique hash add the end of Application ID: > [https://github.com/apache/spark/pull/45712] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47556) [K8] Spark App ID collision resulting in deleting wrong resources
[ https://issues.apache.org/jira/browse/SPARK-47556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830792#comment-17830792 ] Sundeep K commented on SPARK-47556: --- This seems to be fix in 3.2 and above https://github.com/Affirm/spark/commit/fe94bf07f9acec302e7d8becd7e576c777337331 > [K8] Spark App ID collision resulting in deleting wrong resources > - > > Key: SPARK-47556 > URL: https://issues.apache.org/jira/browse/SPARK-47556 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.1 >Reporter: Sundeep K >Priority: Major > > h3. Issue: > We noticed that sometimes K8s executor pods go in a crash loop. Reason being > 'Error: MountVolume.SetUp failed for volume "spark-conf-volume-exec"'. Upon > investigation we noticed that there are 2 spark jobs that launched with same > application id and when one of them finishes first it deletes all it's > resources and deletes the resources of other job too. > -> Spark application ID is created using this > [code|https://affirm.slack.com/archives/C06Q2GWLWKH/p1711132115304449?thread_ts=1711123500.783909&cid=C06Q2GWLWKH] > > "spark-application-" + System.currentTimeMillis > This means if 2 applications launch at the same milli second they could end > up having same AppId > -> > [spark-app-selector|https://github.com/apache/spark/blob/93f98c0a61ddb66eb777c3940fbf29fc58e2d79b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L23] > label is added to all resource created by driver and it's value is > application Id. Kubernetes Scheduler deletes all the apps with same > [label|https://github.com/apache/spark/blob/2a8bb5cdd3a5a2d63428b82df5e5066a805ce878/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L162C1-L172C6] > upon termination. > This results in deletion of config map and executor pods of job that's still > running, driver tries to relaunch the executor pods, but config map is not > present, so it's in crash loop > h3. Context > We are using [Spark of Kubernetes > |https://spark.apache.org/docs/latest/running-on-kubernetes.html]and launch > our spark jobs using PySpark. We launch multiple Spark Jobs within a given > k8s namespace. Each Spark job can be launched from different pods or from > different processes in a pod. Every time a job is launched it has a unique > app name. Here is how the job is launched (omitting irrelevant details): > {code:java} > # spark_conf has settings required for spark on k8s > sp = SparkSession.builder \ > .config(conf=spark_conf) \ > .appName('testapp') > sp.master(f'k8s://{kubernetes_host}') > session = sp.getOrCreate() > with session: > session.sql('SELECT 1'){code} > h3. Repro > Set same app id in spark config, run 2 different jobs, one that finishes > fast, one that runs slow. Slower job goes into crash loop > {code:java} > "spark.app.id": ""{code} > h3. Workaround > Set unique spark.app.id for all the jobs that run on k8s > eg: > {code:java} > "spark.app.id": f'{AppName}-{CurrTimeInMilliSecs}-{UUId}'[:63]{code} > h3. Fix > Add unique hash add the end of Application ID: > [https://github.com/apache/spark/pull/45712] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47556) [K8] Spark App ID collision resulting in deleting wrong resources
[ https://issues.apache.org/jira/browse/SPARK-47556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sundeep K updated SPARK-47556: -- Affects Version/s: 3.1 (was: 3.3.2) (was: 3.5.1) > [K8] Spark App ID collision resulting in deleting wrong resources > - > > Key: SPARK-47556 > URL: https://issues.apache.org/jira/browse/SPARK-47556 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.1 >Reporter: Sundeep K >Priority: Major > > h3. Issue: > We noticed that sometimes K8s executor pods go in a crash loop. Reason being > 'Error: MountVolume.SetUp failed for volume "spark-conf-volume-exec"'. Upon > investigation we noticed that there are 2 spark jobs that launched with same > application id and when one of them finishes first it deletes all it's > resources and deletes the resources of other job too. > -> Spark application ID is created using this > [code|https://affirm.slack.com/archives/C06Q2GWLWKH/p1711132115304449?thread_ts=1711123500.783909&cid=C06Q2GWLWKH] > > "spark-application-" + System.currentTimeMillis > This means if 2 applications launch at the same milli second they could end > up having same AppId > -> > [spark-app-selector|https://github.com/apache/spark/blob/93f98c0a61ddb66eb777c3940fbf29fc58e2d79b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L23] > label is added to all resource created by driver and it's value is > application Id. Kubernetes Scheduler deletes all the apps with same > [label|https://github.com/apache/spark/blob/2a8bb5cdd3a5a2d63428b82df5e5066a805ce878/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L162C1-L172C6] > upon termination. > This results in deletion of config map and executor pods of job that's still > running, driver tries to relaunch the executor pods, but config map is not > present, so it's in crash loop > h3. Context > We are using [Spark of Kubernetes > |https://spark.apache.org/docs/latest/running-on-kubernetes.html]and launch > our spark jobs using PySpark. We launch multiple Spark Jobs within a given > k8s namespace. Each Spark job can be launched from different pods or from > different processes in a pod. Every time a job is launched it has a unique > app name. Here is how the job is launched (omitting irrelevant details): > {code:java} > # spark_conf has settings required for spark on k8s > sp = SparkSession.builder \ > .config(conf=spark_conf) \ > .appName('testapp') > sp.master(f'k8s://{kubernetes_host}') > session = sp.getOrCreate() > with session: > session.sql('SELECT 1'){code} > h3. Repro > Set same app id in spark config, run 2 different jobs, one that finishes > fast, one that runs slow. Slower job goes into crash loop > {code:java} > "spark.app.id": ""{code} > h3. Workaround > Set unique spark.app.id for all the jobs that run on k8s > eg: > {code:java} > "spark.app.id": f'{AppName}-{CurrTimeInMilliSecs}-{UUId}'[:63]{code} > h3. Fix > Add unique hash add the end of Application ID: > [https://github.com/apache/spark/pull/45712] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47558) [Arbitrary State Support] State TTL support - ValueState
Bhuwan Sahni created SPARK-47558: Summary: [Arbitrary State Support] State TTL support - ValueState Key: SPARK-47558 URL: https://issues.apache.org/jira/browse/SPARK-47558 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Bhuwan Sahni Add support for expiring state value based on ttl for Value State in transformWithState operator. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47450) Use R 4.3.3 in `windows` R GitHub Action job
[ https://issues.apache.org/jira/browse/SPARK-47450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-47450. -- Fix Version/s: 4.0.0 Assignee: Dongjoon Hyun Resolution: Fixed Reverted the revert https://github.com/apache/spark/commit/31db27d193fb79b022b7978ef9d0e715da8ade86 > Use R 4.3.3 in `windows` R GitHub Action job > > > Key: SPARK-47450 > URL: https://issues.apache.org/jira/browse/SPARK-47450 > Project: Spark > Issue Type: Sub-task > Components: Project Infra, R >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47450) Use R 4.3.3 in `windows` R GitHub Action job
[ https://issues.apache.org/jira/browse/SPARK-47450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47450: --- Labels: pull-request-available (was: ) > Use R 4.3.3 in `windows` R GitHub Action job > > > Key: SPARK-47450 > URL: https://issues.apache.org/jira/browse/SPARK-47450 > Project: Spark > Issue Type: Sub-task > Components: Project Infra, R >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47509) Incorrect results for LambdaFunctions or HigherOrderFunctions in subquery expression plans
[ https://issues.apache.org/jira/browse/SPARK-47509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-47509: --- Assignee: Daniel > Incorrect results for LambdaFunctions or HigherOrderFunctions in subquery > expression plans > -- > > Key: SPARK-47509 > URL: https://issues.apache.org/jira/browse/SPARK-47509 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Labels: pull-request-available > > We can return an error for this case to fix the correctness bug. Later we can > look at supporting this query pattern as time allows. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47509) Incorrect results for LambdaFunctions or HigherOrderFunctions in subquery expression plans
[ https://issues.apache.org/jira/browse/SPARK-47509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-47509. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45652 [https://github.com/apache/spark/pull/45652] > Incorrect results for LambdaFunctions or HigherOrderFunctions in subquery > expression plans > -- > > Key: SPARK-47509 > URL: https://issues.apache.org/jira/browse/SPARK-47509 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > We can return an error for this case to fix the correctness bug. Later we can > look at supporting this query pattern as time allows. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47557) Audit MySQL ENUM/SET Types
[ https://issues.apache.org/jira/browse/SPARK-47557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47557: --- Labels: pull-request-available (was: ) > Audit MySQL ENUM/SET Types > -- > > Key: SPARK-47557 > URL: https://issues.apache.org/jira/browse/SPARK-47557 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47557) Audit MySQL ENUM/SET Types
[ https://issues.apache.org/jira/browse/SPARK-47557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-47557: - Parent: SPARK-47361 Issue Type: Sub-task (was: Bug) > Audit MySQL ENUM/SET Types > -- > > Key: SPARK-47557 > URL: https://issues.apache.org/jira/browse/SPARK-47557 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47450) Use R 4.3.3 in `windows` R GitHub Action job
[ https://issues.apache.org/jira/browse/SPARK-47450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-47450: - Fix Version/s: (was: 4.0.0) > Use R 4.3.3 in `windows` R GitHub Action job > > > Key: SPARK-47450 > URL: https://issues.apache.org/jira/browse/SPARK-47450 > Project: Spark > Issue Type: Sub-task > Components: Project Infra, R >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47557) Audit MySQL ENUM/SET Types
Kent Yao created SPARK-47557: Summary: Audit MySQL ENUM/SET Types Key: SPARK-47557 URL: https://issues.apache.org/jira/browse/SPARK-47557 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46822) Respect spark.sql.legacy.charVarcharAsString when casting jdbc type to catalyst type in jdbc
[ https://issues.apache.org/jira/browse/SPARK-46822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-46822: - Parent: SPARK-47361 Issue Type: Sub-task (was: Bug) > Respect spark.sql.legacy.charVarcharAsString when casting jdbc type to > catalyst type in jdbc > > > Key: SPARK-46822 > URL: https://issues.apache.org/jira/browse/SPARK-46822 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47556) [K8] Spark App ID collision resulting in deleting wrong resources
Sundeep K created SPARK-47556: - Summary: [K8] Spark App ID collision resulting in deleting wrong resources Key: SPARK-47556 URL: https://issues.apache.org/jira/browse/SPARK-47556 Project: Spark Issue Type: Bug Components: Kubernetes, Spark Core Affects Versions: 3.5.1, 3.3.2 Reporter: Sundeep K h3. Issue: We noticed that sometimes K8s executor pods go in a crash loop. Reason being 'Error: MountVolume.SetUp failed for volume "spark-conf-volume-exec"'. Upon investigation we noticed that there are 2 spark jobs that launched with same application id and when one of them finishes first it deletes all it's resources and deletes the resources of other job too. -> Spark application ID is created using this [code|https://affirm.slack.com/archives/C06Q2GWLWKH/p1711132115304449?thread_ts=1711123500.783909&cid=C06Q2GWLWKH] "spark-application-" + System.currentTimeMillis This means if 2 applications launch at the same milli second they could end up having same AppId -> [spark-app-selector|https://github.com/apache/spark/blob/93f98c0a61ddb66eb777c3940fbf29fc58e2d79b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L23] label is added to all resource created by driver and it's value is application Id. Kubernetes Scheduler deletes all the apps with same [label|https://github.com/apache/spark/blob/2a8bb5cdd3a5a2d63428b82df5e5066a805ce878/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L162C1-L172C6] upon termination. This results in deletion of config map and executor pods of job that's still running, driver tries to relaunch the executor pods, but config map is not present, so it's in crash loop h3. Context We are using [Spark of Kubernetes |https://spark.apache.org/docs/latest/running-on-kubernetes.html]and launch our spark jobs using PySpark. We launch multiple Spark Jobs within a given k8s namespace. Each Spark job can be launched from different pods or from different processes in a pod. Every time a job is launched it has a unique app name. Here is how the job is launched (omitting irrelevant details): {code:java} # spark_conf has settings required for spark on k8s sp = SparkSession.builder \ .config(conf=spark_conf) \ .appName('testapp') sp.master(f'k8s://{kubernetes_host}') session = sp.getOrCreate() with session: session.sql('SELECT 1'){code} h3. Repro Set same app id in spark config, run 2 different jobs, one that finishes fast, one that runs slow. Slower job goes into crash loop {code:java} "spark.app.id": ""{code} h3. Workaround Set unique spark.app.id for all the jobs that run on k8s eg: {code:java} "spark.app.id": f'{AppName}-{CurrTimeInMilliSecs}-{UUId}'[:63]{code} h3. Fix Add unique hash add the end of Application ID: [https://github.com/apache/spark/pull/45712] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47554) Upgrade `sbt-assembly` to `2.2.0` and `sbt-protoc` to `1.0.7`
[ https://issues.apache.org/jira/browse/SPARK-47554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-47554. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45696 [https://github.com/apache/spark/pull/45696] > Upgrade `sbt-assembly` to `2.2.0` and `sbt-protoc` to `1.0.7` > - > > Key: SPARK-47554 > URL: https://issues.apache.org/jira/browse/SPARK-47554 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47554) Upgrade `sbt-assembly` to `2.2.0` and `sbt-protoc` to `1.0.7`
[ https://issues.apache.org/jira/browse/SPARK-47554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47554: --- Labels: pull-request-available (was: ) > Upgrade `sbt-assembly` to `2.2.0` and `sbt-protoc` to `1.0.7` > - > > Key: SPARK-47554 > URL: https://issues.apache.org/jira/browse/SPARK-47554 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47555) Record necessary raw exception log when loadTable
[ https://issues.apache.org/jira/browse/SPARK-47555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47555: --- Labels: pull-request-available (was: ) > Record necessary raw exception log when loadTable > - > > Key: SPARK-47555 > URL: https://issues.apache.org/jira/browse/SPARK-47555 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.1 >Reporter: xleoken >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47555) Record necessary raw exception log when loadTable
[ https://issues.apache.org/jira/browse/SPARK-47555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xleoken updated SPARK-47555: Summary: Record necessary raw exception log when loadTable (was: Print necessary raw exception log when loadTable) > Record necessary raw exception log when loadTable > - > > Key: SPARK-47555 > URL: https://issues.apache.org/jira/browse/SPARK-47555 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.1 >Reporter: xleoken >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47555) Print necessary raw exception log when loadTable
xleoken created SPARK-47555: --- Summary: Print necessary raw exception log when loadTable Key: SPARK-47555 URL: https://issues.apache.org/jira/browse/SPARK-47555 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.1 Reporter: xleoken -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47549) Remove Spark 3.0~3.2 pyspark/version.py workaround from release scripts
[ https://issues.apache.org/jira/browse/SPARK-47549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47549: -- Parent: SPARK-44111 Issue Type: Sub-task (was: Improvement) > Remove Spark 3.0~3.2 pyspark/version.py workaround from release scripts > --- > > Key: SPARK-47549 > URL: https://issues.apache.org/jira/browse/SPARK-47549 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47553) Add Java support tests for transformWithState operator
Anish Shrigondekar created SPARK-47553: -- Summary: Add Java support tests for transformWithState operator Key: SPARK-47553 URL: https://issues.apache.org/jira/browse/SPARK-47553 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Anish Shrigondekar Add Java support tests for transformWithState operator -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47549) Remove Spark 3.0~3.2 pyspark/version.py workaround from release scripts
[ https://issues.apache.org/jira/browse/SPARK-47549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-47549: Assignee: Dongjoon Hyun > Remove Spark 3.0~3.2 pyspark/version.py workaround from release scripts > --- > > Key: SPARK-47549 > URL: https://issues.apache.org/jira/browse/SPARK-47549 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47549) Remove Spark 3.0~3.2 pyspark/version.py workaround from release scripts
[ https://issues.apache.org/jira/browse/SPARK-47549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-47549. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45706 [https://github.com/apache/spark/pull/45706] > Remove Spark 3.0~3.2 pyspark/version.py workaround from release scripts > --- > > Key: SPARK-47549 > URL: https://issues.apache.org/jira/browse/SPARK-47549 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47552) Set spark.hadoop.fs.s3a.connection.establish.timeout to 30s
[ https://issues.apache.org/jira/browse/SPARK-47552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47552: - Assignee: Dongjoon Hyun > Set spark.hadoop.fs.s3a.connection.establish.timeout to 30s > --- > > Key: SPARK-47552 > URL: https://issues.apache.org/jira/browse/SPARK-47552 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > To suppress like HADOOP-19097 > {code} > 24/03/25 14:46:21 WARN ConfigurationHelper: Option > fs.s3a.connection.establish.timeout is too low (5,000 ms). Setting to 15,000 > ms instead > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47552) Set spark.hadoop.fs.s3a.connection.establish.timeout to 30s
[ https://issues.apache.org/jira/browse/SPARK-47552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47552. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45710 [https://github.com/apache/spark/pull/45710] > Set spark.hadoop.fs.s3a.connection.establish.timeout to 30s > --- > > Key: SPARK-47552 > URL: https://issues.apache.org/jira/browse/SPARK-47552 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > To suppress like HADOOP-19097 > {code} > 24/03/25 14:46:21 WARN ConfigurationHelper: Option > fs.s3a.connection.establish.timeout is too low (5,000 ms). Setting to 15,000 > ms instead > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47552) Set spark.hadoop.fs.s3a.connection.establish.timeout to 30s
[ https://issues.apache.org/jira/browse/SPARK-47552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47552: --- Labels: pull-request-available (was: ) > Set spark.hadoop.fs.s3a.connection.establish.timeout to 30s > --- > > Key: SPARK-47552 > URL: https://issues.apache.org/jira/browse/SPARK-47552 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > To suppress like HADOOP-19097 > {code} > 24/03/25 14:46:21 WARN ConfigurationHelper: Option > fs.s3a.connection.establish.timeout is too low (5,000 ms). Setting to 15,000 > ms instead > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47552) Set spark.hadoop.fs.s3a.connection.establish.timeout to 30s
Dongjoon Hyun created SPARK-47552: - Summary: Set spark.hadoop.fs.s3a.connection.establish.timeout to 30s Key: SPARK-47552 URL: https://issues.apache.org/jira/browse/SPARK-47552 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Dongjoon Hyun To suppress like HADOOP-19097 {code} 24/03/25 14:46:21 WARN ConfigurationHelper: Option fs.s3a.connection.establish.timeout is too low (5,000 ms). Setting to 15,000 ms instead {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47550) Update kubernetes-client to 6.11.0
[ https://issues.apache.org/jira/browse/SPARK-47550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47550: -- Parent: SPARK-44111 Issue Type: Sub-task (was: Dependency upgrade) > Update kubernetes-client to 6.11.0 > -- > > Key: SPARK-47550 > URL: https://issues.apache.org/jira/browse/SPARK-47550 > Project: Spark > Issue Type: Sub-task > Components: k8s >Affects Versions: 4.0.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > [Release > notes|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.11.0] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47550) Upgrade kubernetes-client to 6.11.0
[ https://issues.apache.org/jira/browse/SPARK-47550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47550: -- Summary: Upgrade kubernetes-client to 6.11.0 (was: Update kubernetes-client to 6.11.0) > Upgrade kubernetes-client to 6.11.0 > --- > > Key: SPARK-47550 > URL: https://issues.apache.org/jira/browse/SPARK-47550 > Project: Spark > Issue Type: Sub-task > Components: k8s >Affects Versions: 4.0.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > [Release > notes|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.11.0] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47550) Update kubernetes-client to 6.11.0
[ https://issues.apache.org/jira/browse/SPARK-47550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47550. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45707 [https://github.com/apache/spark/pull/45707] > Update kubernetes-client to 6.11.0 > -- > > Key: SPARK-47550 > URL: https://issues.apache.org/jira/browse/SPARK-47550 > Project: Spark > Issue Type: Dependency upgrade > Components: k8s >Affects Versions: 4.0.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > [Release > notes|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.11.0] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47550) Update kubernetes-client to 6.11.0
[ https://issues.apache.org/jira/browse/SPARK-47550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47550: - Assignee: Bjørn Jørgensen > Update kubernetes-client to 6.11.0 > -- > > Key: SPARK-47550 > URL: https://issues.apache.org/jira/browse/SPARK-47550 > Project: Spark > Issue Type: Dependency upgrade > Components: k8s >Affects Versions: 4.0.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Labels: pull-request-available > > [Release > notes|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.11.0] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47551) Add variant_get expression.
Chenhao Li created SPARK-47551: -- Summary: Add variant_get expression. Key: SPARK-47551 URL: https://issues.apache.org/jira/browse/SPARK-47551 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Chenhao Li -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47550) Update kubernetes-client to 6.11.0
[ https://issues.apache.org/jira/browse/SPARK-47550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47550: --- Labels: pull-request-available (was: ) > Update kubernetes-client to 6.11.0 > -- > > Key: SPARK-47550 > URL: https://issues.apache.org/jira/browse/SPARK-47550 > Project: Spark > Issue Type: Dependency upgrade > Components: k8s >Affects Versions: 4.0.0 >Reporter: Bjørn Jørgensen >Priority: Major > Labels: pull-request-available > > [Release > notes|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.11.0] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47550) Update kubernetes-client to 6.11.0
Bjørn Jørgensen created SPARK-47550: --- Summary: Update kubernetes-client to 6.11.0 Key: SPARK-47550 URL: https://issues.apache.org/jira/browse/SPARK-47550 Project: Spark Issue Type: Dependency upgrade Components: k8s Affects Versions: 4.0.0 Reporter: Bjørn Jørgensen [Release notes|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.11.0] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47548) Remove unused `commons-beanutils` dependency
[ https://issues.apache.org/jira/browse/SPARK-47548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47548: - Assignee: Dongjoon Hyun > Remove unused `commons-beanutils` dependency > > > Key: SPARK-47548 > URL: https://issues.apache.org/jira/browse/SPARK-47548 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47548) Remove unused `commons-beanutils` dependency
[ https://issues.apache.org/jira/browse/SPARK-47548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47548. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45705 [https://github.com/apache/spark/pull/45705] > Remove unused `commons-beanutils` dependency > > > Key: SPARK-47548 > URL: https://issues.apache.org/jira/browse/SPARK-47548 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47413) Substring, Right, Left (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830618#comment-17830618 ] Gideon P commented on SPARK-47413: -- [~davidm-db] Are you sure you don't want me to take care of it? I would be more than happy to take care of this. [~uros-db] do you have another one for me, if David is taking this one over? > Substring, Right, Left (all collations) > --- > > Key: SPARK-47413 > URL: https://issues.apache.org/jira/browse/SPARK-47413 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for the *Substring* built-in string function in > Spark (including *Right* and *Left* functions). First confirm what is the > expected behaviour for these functions when given collated strings, then move > on to the implementation that would enable handling strings of all collation > types. Implement the corresponding unit tests > (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect > how this function should be used with collation in SparkSQL, and feel free to > use your chosen Spark SQL Editor to experiment with the existing functions to > learn more about how they work. In addition, look into the possible use-cases > and implementation of similar functions within other other open-source DBMS, > such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the {*}Substring{*}, > {*}Right{*}, and *Left* functions so that they support all collation types > currently supported in Spark. To understand what changes were introduced in > order to enable full collation support for other existing functions in Spark, > take a look at the Spark PRs and Jira tickets for completed tasks in this > parent (for example: Contains, StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47549) Remove Spark 3.0~3.2 pyspark/version.py workaround from release scripts
[ https://issues.apache.org/jira/browse/SPARK-47549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47549: --- Labels: pull-request-available (was: ) > Remove Spark 3.0~3.2 pyspark/version.py workaround from release scripts > --- > > Key: SPARK-47549 > URL: https://issues.apache.org/jira/browse/SPARK-47549 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47549) Remove Spark 3.0~3.2 pyspark/version.py workaround from release scripts
Dongjoon Hyun created SPARK-47549: - Summary: Remove Spark 3.0~3.2 pyspark/version.py workaround from release scripts Key: SPARK-47549 URL: https://issues.apache.org/jira/browse/SPARK-47549 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47256) Assign error classes to FILTER expression errors
[ https://issues.apache.org/jira/browse/SPARK-47256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-47256: Assignee: David Milicevic > Assign error classes to FILTER expression errors > > > Key: SPARK-47256 > URL: https://issues.apache.org/jira/browse/SPARK-47256 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: David Milicevic >Priority: Minor > Labels: pull-request-available, starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_102[4-7]* > defined in {*}common/utils/src/main/resources/error/error-classes.json{*}. > The name should be short but complete (look at the example in > error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47256) Assign error classes to FILTER expression errors
[ https://issues.apache.org/jira/browse/SPARK-47256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-47256. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45622 [https://github.com/apache/spark/pull/45622] > Assign error classes to FILTER expression errors > > > Key: SPARK-47256 > URL: https://issues.apache.org/jira/browse/SPARK-47256 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: David Milicevic >Priority: Minor > Labels: pull-request-available, starter > Fix For: 4.0.0 > > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_102[4-7]* > defined in {*}common/utils/src/main/resources/error/error-classes.json{*}. > The name should be short but complete (look at the example in > error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32350) Add batch write support on LevelDB to improve performance of HybridStore
[ https://issues.apache.org/jira/browse/SPARK-32350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-32350: --- Labels: pull-request-available (was: ) > Add batch write support on LevelDB to improve performance of HybridStore > > > Key: SPARK-32350 > URL: https://issues.apache.org/jira/browse/SPARK-32350 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.1, 3.1.0 >Reporter: Baohe Zhang >Assignee: Baohe Zhang >Priority: Major > Labels: pull-request-available > Fix For: 3.1.0 > > > The idea is to improve the performance of HybridStore by adding batch write > support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 > introduces HybridStore. HybridStore will write data to InMemoryStore at first > and use a background thread to dump data to LevelDB once the writing to > InMemoryStore is completed. In the comments section of > [https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned > using batch writing can improve the performance of this dumping process and > he wrote the code of writeAll(). > I did the comparison of the HybridStore switching time between one-by-one > write and batch write on an HDD disk. When the disk is free, the batch-write > has around 25% improvement, and when the disk is 100% busy, the batch-write > has 7x - 10x improvement. > when the disk is at 0% utilization: > > ||log size, jobs and tasks per job||original switching time, with > write()||switching time with writeAll()|| > |133m, 400 jobs, 100 tasks per job|16s|13s| > |265m, 400 jobs, 200 tasks per job|30s|23s| > |1.3g, 1000 jobs, 400 tasks per job|136s|108s| > > when the disk is at 100% utilization: > ||log size, jobs and tasks per job||original switching time, with > write()||switching time with writeAll()|| > |133m, 400 jobs, 100 tasks per job|116s|17s| > |265m, 400 jobs, 200 tasks per job|251s|26s| > I also ran some write related benchmarking tests on LevelDBBenchmark.java and > measured the total time of writing 1024 objects. > when the disk is at 0% utilization: > > ||Benchmark test||with write(), ms||with writeAll(), ms || > |randomUpdatesIndexed|213.060|157.356| > |randomUpdatesNoIndex|57.869|35.439| > |randomWritesIndexed|298.854|229.274| > |randomWritesNoIndex|66.764|38.361| > |sequentialUpdatesIndexed|87.019|56.219| > |sequentialUpdatesNoIndex|61.851|41.942| > |sequentialWritesIndexed|94.044|56.534| > |sequentialWritesNoIndex|118.345|66.483| > > when the disk is at 50% utilization: > ||Benchmark test||with write(), ms||with writeAll(), ms|| > |randomUpdatesIndexed|230.386|180.817| > |randomUpdatesNoIndex|58.935|50.113| > |randomWritesIndexed|315.241|254.400| > |randomWritesNoIndex|96.709|41.164| > |sequentialUpdatesIndexed|89.971|70.387| > |sequentialUpdatesNoIndex|72.021|53.769| > |sequentialWritesIndexed|103.052|67.358| > |sequentialWritesNoIndex|76.194|99.037| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46743) Count bug introduced for scalar subquery when using TEMPORARY VIEW, as compared to using table
[ https://issues.apache.org/jira/browse/SPARK-46743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-46743: -- Labels: correctness pull-request-available (was: pull-request-available) > Count bug introduced for scalar subquery when using TEMPORARY VIEW, as > compared to using table > -- > > Key: SPARK-46743 > URL: https://issues.apache.org/jira/browse/SPARK-46743 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Andy Lam >Assignee: Andy Lam >Priority: Major > Labels: correctness, pull-request-available > Fix For: 4.0.0 > > > Using the temp view reproduces COUNT bug, returns nulls instead of 0. > With a table: > {code:java} > scala> spark.sql("""CREATE TABLE outer_table USING parquet AS SELECT * FROM > VALUES > | (1, 1), > | (2, 1), > | (3, 3), > | (6, 6), > | (7, 7), > | (9, 9) AS inner_table(a, b)""") > val res6: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE TABLE null_table USING parquet AS SELECT CAST(null > AS int) AS a, CAST(null as int) AS b ;") > val res7: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("""SELECT ( SELECT COUNT(null_table.a) AS aggAlias FROM > null_table WHERE null_table.a = outer_table.a) FROM outer_table""").collect() > val res8: Array[org.apache.spark.sql.Row] = Array([0], [0], [0], [0], [0], > [0]) {code} > With a view: > > {code:java} > spark.sql("CREATE TEMPORARY VIEW outer_view(a, b) AS VALUES (1, 1), (2, > 1),(3, 3), (6, 6), (7, 7), (9, 9);") > spark.sql("CREATE TEMPORARY VIEW null_view(a, b) AS SELECT CAST(null AS int), > CAST(null as int);") > spark.sql("""SELECT ( SELECT COUNT(null_view.a) AS aggAlias FROM null_view > WHERE null_view.a = outer_view.a) FROM outer_view""").collect() > val res2: Array[org.apache.spark.sql.Row] = Array([null], [null], [null], > [null], [null], [null]){code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46743) Count bug introduced for scalar subquery when using TEMPORARY VIEW, as compared to using table
[ https://issues.apache.org/jira/browse/SPARK-46743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-46743: -- Component/s: SQL (was: Optimizer) > Count bug introduced for scalar subquery when using TEMPORARY VIEW, as > compared to using table > -- > > Key: SPARK-46743 > URL: https://issues.apache.org/jira/browse/SPARK-46743 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Andy Lam >Assignee: Andy Lam >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Using the temp view reproduces COUNT bug, returns nulls instead of 0. > With a table: > {code:java} > scala> spark.sql("""CREATE TABLE outer_table USING parquet AS SELECT * FROM > VALUES > | (1, 1), > | (2, 1), > | (3, 3), > | (6, 6), > | (7, 7), > | (9, 9) AS inner_table(a, b)""") > val res6: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE TABLE null_table USING parquet AS SELECT CAST(null > AS int) AS a, CAST(null as int) AS b ;") > val res7: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("""SELECT ( SELECT COUNT(null_table.a) AS aggAlias FROM > null_table WHERE null_table.a = outer_table.a) FROM outer_table""").collect() > val res8: Array[org.apache.spark.sql.Row] = Array([0], [0], [0], [0], [0], > [0]) {code} > With a view: > > {code:java} > spark.sql("CREATE TEMPORARY VIEW outer_view(a, b) AS VALUES (1, 1), (2, > 1),(3, 3), (6, 6), (7, 7), (9, 9);") > spark.sql("CREATE TEMPORARY VIEW null_view(a, b) AS SELECT CAST(null AS int), > CAST(null as int);") > spark.sql("""SELECT ( SELECT COUNT(null_view.a) AS aggAlias FROM null_view > WHERE null_view.a = outer_view.a) FROM outer_view""").collect() > val res2: Array[org.apache.spark.sql.Row] = Array([null], [null], [null], > [null], [null], [null]){code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47548) Remove unused `commons-beanutils` dependency
[ https://issues.apache.org/jira/browse/SPARK-47548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47548: --- Labels: pull-request-available (was: ) > Remove unused `commons-beanutils` dependency > > > Key: SPARK-47548 > URL: https://issues.apache.org/jira/browse/SPARK-47548 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47548) Remove unused `commons-beanutils` dependency
Dongjoon Hyun created SPARK-47548: - Summary: Remove unused `commons-beanutils` dependency Key: SPARK-47548 URL: https://issues.apache.org/jira/browse/SPARK-47548 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47413) Substring, Right, Left (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830584#comment-17830584 ] David Milicevic commented on SPARK-47413: - Started working on this today. > Substring, Right, Left (all collations) > --- > > Key: SPARK-47413 > URL: https://issues.apache.org/jira/browse/SPARK-47413 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for the *Substring* built-in string function in > Spark (including *Right* and *Left* functions). First confirm what is the > expected behaviour for these functions when given collated strings, then move > on to the implementation that would enable handling strings of all collation > types. Implement the corresponding unit tests > (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect > how this function should be used with collation in SparkSQL, and feel free to > use your chosen Spark SQL Editor to experiment with the existing functions to > learn more about how they work. In addition, look into the possible use-cases > and implementation of similar functions within other other open-source DBMS, > such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the {*}Substring{*}, > {*}Right{*}, and *Left* functions so that they support all collation types > currently supported in Spark. To understand what changes were introduced in > order to enable full collation support for other existing functions in Spark, > take a look at the Spark PRs and Jira tickets for completed tasks in this > parent (for example: Contains, StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46743) Count bug introduced for scalar subquery when using TEMPORARY VIEW, as compared to using table
[ https://issues.apache.org/jira/browse/SPARK-46743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-46743: --- Assignee: Andy Lam > Count bug introduced for scalar subquery when using TEMPORARY VIEW, as > compared to using table > -- > > Key: SPARK-46743 > URL: https://issues.apache.org/jira/browse/SPARK-46743 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.5.0 >Reporter: Andy Lam >Assignee: Andy Lam >Priority: Major > Labels: pull-request-available > > Using the temp view reproduces COUNT bug, returns nulls instead of 0. > With a table: > {code:java} > scala> spark.sql("""CREATE TABLE outer_table USING parquet AS SELECT * FROM > VALUES > | (1, 1), > | (2, 1), > | (3, 3), > | (6, 6), > | (7, 7), > | (9, 9) AS inner_table(a, b)""") > val res6: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE TABLE null_table USING parquet AS SELECT CAST(null > AS int) AS a, CAST(null as int) AS b ;") > val res7: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("""SELECT ( SELECT COUNT(null_table.a) AS aggAlias FROM > null_table WHERE null_table.a = outer_table.a) FROM outer_table""").collect() > val res8: Array[org.apache.spark.sql.Row] = Array([0], [0], [0], [0], [0], > [0]) {code} > With a view: > > {code:java} > spark.sql("CREATE TEMPORARY VIEW outer_view(a, b) AS VALUES (1, 1), (2, > 1),(3, 3), (6, 6), (7, 7), (9, 9);") > spark.sql("CREATE TEMPORARY VIEW null_view(a, b) AS SELECT CAST(null AS int), > CAST(null as int);") > spark.sql("""SELECT ( SELECT COUNT(null_view.a) AS aggAlias FROM null_view > WHERE null_view.a = outer_view.a) FROM outer_view""").collect() > val res2: Array[org.apache.spark.sql.Row] = Array([null], [null], [null], > [null], [null], [null]){code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46743) Count bug introduced for scalar subquery when using TEMPORARY VIEW, as compared to using table
[ https://issues.apache.org/jira/browse/SPARK-46743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-46743. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45125 [https://github.com/apache/spark/pull/45125] > Count bug introduced for scalar subquery when using TEMPORARY VIEW, as > compared to using table > -- > > Key: SPARK-46743 > URL: https://issues.apache.org/jira/browse/SPARK-46743 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.5.0 >Reporter: Andy Lam >Assignee: Andy Lam >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Using the temp view reproduces COUNT bug, returns nulls instead of 0. > With a table: > {code:java} > scala> spark.sql("""CREATE TABLE outer_table USING parquet AS SELECT * FROM > VALUES > | (1, 1), > | (2, 1), > | (3, 3), > | (6, 6), > | (7, 7), > | (9, 9) AS inner_table(a, b)""") > val res6: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE TABLE null_table USING parquet AS SELECT CAST(null > AS int) AS a, CAST(null as int) AS b ;") > val res7: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("""SELECT ( SELECT COUNT(null_table.a) AS aggAlias FROM > null_table WHERE null_table.a = outer_table.a) FROM outer_table""").collect() > val res8: Array[org.apache.spark.sql.Row] = Array([0], [0], [0], [0], [0], > [0]) {code} > With a view: > > {code:java} > spark.sql("CREATE TEMPORARY VIEW outer_view(a, b) AS VALUES (1, 1), (2, > 1),(3, 3), (6, 6), (7, 7), (9, 9);") > spark.sql("CREATE TEMPORARY VIEW null_view(a, b) AS SELECT CAST(null AS int), > CAST(null as int);") > spark.sql("""SELECT ( SELECT COUNT(null_view.a) AS aggAlias FROM null_view > WHERE null_view.a = outer_view.a) FROM outer_view""").collect() > val res2: Array[org.apache.spark.sql.Row] = Array([null], [null], [null], > [null], [null], [null]){code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42452) Remove hadoop-2 profile from Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-42452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830566#comment-17830566 ] Dongjoon Hyun commented on SPARK-42452: --- This was resolved via https://github.com/apache/spark/pull/40788 > Remove hadoop-2 profile from Apache Spark > - > > Key: SPARK-42452 > URL: https://issues.apache.org/jira/browse/SPARK-42452 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > > SPARK-40651 Drop Hadoop2 binary distribtuion from release process and > SPARK-42447 Remove Hadoop 2 GitHub Action job > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47411) StringInstr, FindInSet (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830561#comment-17830561 ] Milan Dankovic commented on SPARK-47411: I am working on this > StringInstr, FindInSet (all collations) > --- > > Key: SPARK-47411 > URL: https://issues.apache.org/jira/browse/SPARK-47411 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for the *StringInstr* and *FindInSet* built-in > string functions in Spark. First confirm what is the expected behaviour for > these functions when given collated strings, and then move on to > implementation and testing. One way to go about this is to consider using > {_}StringSearch{_}, an efficient ICU service for string matching. Implement > the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests > (CollationSuite) to reflect how this function should be used with collation > in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment > with the existing functions to learn more about how they work. In addition, > look into the possible use-cases and implementation of similar functions > within other other open-source DBMS, such as > [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *StringInstr* and > *FindInSet* functions so that they support all collation types currently > supported in Spark. To understand what changes were introduced in order to > enable full collation support for other existing functions in Spark, take a > look at the Spark PRs and Jira tickets for completed tasks in this parent > (for example: Contains, StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class, as well as _StringSearch_ using the > [ICU user > guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] > and [ICU > docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html]. > Also, refer to the Unicode Technical Standard for string > [searching|https://www.unicode.org/reports/tr10/#Searching] and > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47503) Spark history sever fails to display query for cached JDBC relation named in quotes
[ https://issues.apache.org/jira/browse/SPARK-47503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47503: -- Fix Version/s: 3.4.3 > Spark history sever fails to display query for cached JDBC relation named in > quotes > --- > > Key: SPARK-47503 > URL: https://issues.apache.org/jira/browse/SPARK-47503 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 4.0.0 >Reporter: alexey >Assignee: alexey >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2, 3.4.3 > > Attachments: Screenshot_11.png, eventlog_v2_local-1711020585149.rar > > > Spark history sever fails to display query for cached JDBC relation (or > calculation derived from it) named in quotes > (Screenshot and generated history in attachments) > How to reproduce: > {code:java} > val ticketsDf = spark.read.jdbc("jdbc:postgresql://localhost:5432/demo", """ > "test-schema".tickets """.trim, properties) > val bookingDf = spark.read.parquet("path/bookings") > ticketsDf.cache().count() > val resultDf = bookingDf.join(ticketsDf, Seq("book_ref")) > resultDf.write.mode(SaveMode.Overwrite).parquet("path/result") {code} > > So the problem is in SparkPlanGraphNode class which creates a dot node. When > there is no metrics to display it simply returns tagged name and in this case > name contains quotes which corrupts dot file. > Suggested solution is to escape name string > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47537) Use MySQL Connector/J for MySQL DB instead of MariaDB Connector/J
[ https://issues.apache.org/jira/browse/SPARK-47537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47537: -- Fix Version/s: 3.4.3 > Use MySQL Connector/J for MySQL DB instead of MariaDB Connector/J > -- > > Key: SPARK-47537 > URL: https://issues.apache.org/jira/browse/SPARK-47537 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.1, 4.0.0, 3.5.2 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2, 3.4.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47537) Use MySQL Connector/J for MySQL DB instead of MariaDB Connector/J
[ https://issues.apache.org/jira/browse/SPARK-47537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47537: -- Affects Version/s: 3.5.1 (was: 3.5.2) > Use MySQL Connector/J for MySQL DB instead of MariaDB Connector/J > -- > > Key: SPARK-47537 > URL: https://issues.apache.org/jira/browse/SPARK-47537 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.1, 4.0.0, 3.5.1 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2, 3.4.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47537) Use MySQL Connector/J for MySQL DB instead of MariaDB Connector/J
[ https://issues.apache.org/jira/browse/SPARK-47537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47537: -- Fix Version/s: 3.5.2 > Use MySQL Connector/J for MySQL DB instead of MariaDB Connector/J > -- > > Key: SPARK-47537 > URL: https://issues.apache.org/jira/browse/SPARK-47537 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.1, 4.0.0, 3.5.2 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47476) StringReplace (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830542#comment-17830542 ] Milan Dankovic commented on SPARK-47476: I am working on this > StringReplace (all collations) > -- > > Key: SPARK-47476 > URL: https://issues.apache.org/jira/browse/SPARK-47476 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > > Enable collation support for the *StringReplace* built-in string function in > Spark. First confirm what is the expected behaviour for this function when > given collated strings, and then move on to implementation and testing. One > way to go about this is to consider using {_}StringSearch{_}, an efficient > ICU service for string matching. Implement the corresponding unit tests > (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect > how this function should be used with collation in SparkSQL, and feel free to > use your chosen Spark SQL Editor to experiment with the existing functions to > learn more about how they work. In addition, look into the possible use-cases > and implementation of similar functions within other other open-source DBMS, > such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *StringReplace* function so > it supports all collation types currently supported in Spark. To understand > what changes were introduced in order to enable full collation support for > other existing functions in Spark, take a look at the Spark PRs and Jira > tickets for completed tasks in this parent (for example: Contains, > StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class, as well as _StringSearch_ using the > [ICU user > guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] > and [ICU > docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html]. > Also, refer to the Unicode Technical Standard for string > [searching|https://www.unicode.org/reports/tr10/#Searching] and > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-47476) StringReplace (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830542#comment-17830542 ] Milan Dankovic edited comment on SPARK-47476 at 3/25/24 3:45 PM: - I am working on this was (Author: JIRAUSER304529): I am working on this > StringReplace (all collations) > -- > > Key: SPARK-47476 > URL: https://issues.apache.org/jira/browse/SPARK-47476 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > > Enable collation support for the *StringReplace* built-in string function in > Spark. First confirm what is the expected behaviour for this function when > given collated strings, and then move on to implementation and testing. One > way to go about this is to consider using {_}StringSearch{_}, an efficient > ICU service for string matching. Implement the corresponding unit tests > (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect > how this function should be used with collation in SparkSQL, and feel free to > use your chosen Spark SQL Editor to experiment with the existing functions to > learn more about how they work. In addition, look into the possible use-cases > and implementation of similar functions within other other open-source DBMS, > such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *StringReplace* function so > it supports all collation types currently supported in Spark. To understand > what changes were introduced in order to enable full collation support for > other existing functions in Spark, take a look at the Spark PRs and Jira > tickets for completed tasks in this parent (for example: Contains, > StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class, as well as _StringSearch_ using the > [ICU user > guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] > and [ICU > docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html]. > Also, refer to the Unicode Technical Standard for string > [searching|https://www.unicode.org/reports/tr10/#Searching] and > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47476) StringReplace (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47476: --- Labels: pull-request-available (was: ) > StringReplace (all collations) > -- > > Key: SPARK-47476 > URL: https://issues.apache.org/jira/browse/SPARK-47476 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > > Enable collation support for the *StringReplace* built-in string function in > Spark. First confirm what is the expected behaviour for this function when > given collated strings, and then move on to implementation and testing. One > way to go about this is to consider using {_}StringSearch{_}, an efficient > ICU service for string matching. Implement the corresponding unit tests > (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect > how this function should be used with collation in SparkSQL, and feel free to > use your chosen Spark SQL Editor to experiment with the existing functions to > learn more about how they work. In addition, look into the possible use-cases > and implementation of similar functions within other other open-source DBMS, > such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *StringReplace* function so > it supports all collation types currently supported in Spark. To understand > what changes were introduced in order to enable full collation support for > other existing functions in Spark, take a look at the Spark PRs and Jira > tickets for completed tasks in this parent (for example: Contains, > StartsWith, EndsWith). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class, as well as _StringSearch_ using the > [ICU user > guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] > and [ICU > docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html]. > Also, refer to the Unicode Technical Standard for string > [searching|https://www.unicode.org/reports/tr10/#Searching] and > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47547) observed false positive rate in bloom filter is greater than expected for large n
Nathan Conroy created SPARK-47547: - Summary: observed false positive rate in bloom filter is greater than expected for large n Key: SPARK-47547 URL: https://issues.apache.org/jira/browse/SPARK-47547 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.0 Reporter: Nathan Conroy When creating a bloom filter out of a large number of elements (>400 million or so) with an fpp (false positive rate) of 1% in Spark, the observed false positive rate appears to be much higher, as much as 20%. This is demonstrated below in this spark shell: {noformat} __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.5.0-amzn-0 /_/ Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 17.0.10) Type in expressions to have them evaluated. Type :help for more information. scala> import java.security.MessageDigest import java.security.MessageDigest scala> import scala.util.Random import scala.util.Random scala> import org.apache.spark.util.sketch.BloomFilter import org.apache.spark.util.sketch.BloomFilter scala> scala> // Function to generate a random SHA1 hash scala> def generateRandomSha1(): String = { | val randomString = Random.alphanumeric.take(20).mkString | val sha1 = MessageDigest.getInstance("SHA-1") | sha1.update(randomString.getBytes("UTF-8")) | val digest = sha1.digest | digest.map("%02x".format(_)).mkString | } generateRandomSha1: ()String scala> scala> // Generate a DataFrame with 500 million rows of random SHA1 hashes scala> val df = spark.range(5).map(_ => generateRandomSha1()).toDF("Hash") df: org.apache.spark.sql.DataFrame = [Hash: string] scala> // Create a bloom filter out of this collection of strings. scala> val bloom_filter = df.stat.bloomFilter("Hash", 5, 0.01) bloom_filter: org.apache.spark.util.sketch.BloomFilter = org.apache.spark.util.sketch.BloomFilterImpl@a14c0ba9 scala> // Generate another 10,000 random hashes scala> val random_sha1s = List.fill(1)(generateRandomSha1()) random_sha1s: List[String] = List(f3cbfd9bd836ea917ebc0dfc5330135cfde322a3, 4bff8d58799e517a1ba78236db9b52353dd39b56, 775bdd9d138a79eeae7308617f5c0d1d0e1c1697, abbd761b7768f3cbadbffc0c7947185856c4943d, 343692fe61c552f73ad6bc2d2d3072cc456da1db, faf4430055c528c9a00a46e9fae7dc25047ffaf3, 255b5d56c39bfba861647fff67704e6bc758d683, dae8e0910a368f034958ae232aa5f5285486a8ac, 3680dbd34437ca661592a7e4d39782c9c77fb4ba, f5b43f7a77c9d9ea28101a1848d8b1a1c0a65b82, 5bda825102026bc0da731dc84d56a499ccff0fe1, 158d7b3ce949422de421d5e110e3f6903af4f8e1, 2efcae5cb10273a0f5e89ae34fa3156238ab0555, 8d241012d42097f80f30e8ead227d75ab77086d2, 307495c98ae5f25026b91e60cf51d4f9f1ad7f4b, 8fc2f55563ab67d4ec87ff7b04a4a01e821814a3, b413572d14ee16c6c575ca3472adff62a8cbfa3d, 9219233b0e8afe57d7d5cb6... scala> // Check how many of these random hashes return a positive result when passed into mightContain scala> random_sha1s.map(c => bloom_filter.mightContain(c)).count(_ == true) res0: Int = 2153 {noformat} I believe this is the result of the bloom filter implementation using 32bit hashes. Since the maximum value that can be returned by the k hash functions is ~2.14 billion (max integer value in Java), bloom filters with m > ~2.14 billion have degraded performance resulting from not using any bits at indices greater than ~2.14 billion. This was a known bug in Guava that was fixed several years ago, but it looks like the fix was never ported to Spark. See [https://github.com/google/guava/issues/1119] Of course, using a different hash function strategy would break existing uses of this code, so we should tread with caution here. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47545) [Connect] DF observe support for the scala client
[ https://issues.apache.org/jira/browse/SPARK-47545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47545: --- Labels: pull-request-available (was: ) > [Connect] DF observe support for the scala client > - > > Key: SPARK-47545 > URL: https://issues.apache.org/jira/browse/SPARK-47545 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 4.0.0 >Reporter: Pengfei Xu >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47545) [Connect] DF observe support for the scala client
Pengfei Xu created SPARK-47545: -- Summary: [Connect] DF observe support for the scala client Key: SPARK-47545 URL: https://issues.apache.org/jira/browse/SPARK-47545 Project: Spark Issue Type: New Feature Components: Connect Affects Versions: 4.0.0 Reporter: Pengfei Xu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47544) [Pyspark] SparkSession builder method is incompatible with vs code intellisense
[ https://issues.apache.org/jira/browse/SPARK-47544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47544: --- Labels: pull-request-available (was: ) > [Pyspark] SparkSession builder method is incompatible with vs code > intellisense > --- > > Key: SPARK-47544 > URL: https://issues.apache.org/jira/browse/SPARK-47544 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Niranjan Jayakar >Priority: Major > Labels: pull-request-available > Attachments: old.mov > > > VS code's intellisense is unable to recognize the methods under > `SparkSession.builder`. > > See attachment. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47544) [Pyspark] SparkSession builder method is incompatible with vs code intellisense
[ https://issues.apache.org/jira/browse/SPARK-47544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niranjan Jayakar updated SPARK-47544: - Attachment: old.mov > [Pyspark] SparkSession builder method is incompatible with vs code > intellisense > --- > > Key: SPARK-47544 > URL: https://issues.apache.org/jira/browse/SPARK-47544 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Niranjan Jayakar >Priority: Major > Attachments: old.mov > > > VS code's intellisense is unable to recognize the methods under > `SparkSession.builder`. > > See attachment. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47544) [Pyspark] SparkSession builder method is incompatible with vs code intellisense
Niranjan Jayakar created SPARK-47544: Summary: [Pyspark] SparkSession builder method is incompatible with vs code intellisense Key: SPARK-47544 URL: https://issues.apache.org/jira/browse/SPARK-47544 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 4.0.0 Reporter: Niranjan Jayakar VS code's intellisense is unable to recognize the methods under `SparkSession.builder`. See attachment. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47543) Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation.
[ https://issues.apache.org/jira/browse/SPARK-47543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47543: --- Labels: pull-request-available (was: ) > Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame > creation. > > > Key: SPARK-47543 > URL: https://issues.apache.org/jira/browse/SPARK-47543 > Project: Spark > Issue Type: Bug > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > Labels: pull-request-available > > Currently the PyArrow infers the Pandas dictionary field as StructType > instead of MapType, so Spark can't handle the schema properly: > {code:java} > >>> pdf = pd.DataFrame({"str_col": ['second'], "dict_col": [{'first': 0.7, > >>> 'second': 0.3}]}) > >>> pa.Schema.from_pandas(pdf) > str_col: string > dict_col: struct > child 0, first: double > child 1, second: double > {code} > We cannot handle this case since we use PyArrow for schema creation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47543) Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation.
Haejoon Lee created SPARK-47543: --- Summary: Inferring `dict` as `MapType` from Pandas DataFrame to allow DataFrame creation. Key: SPARK-47543 URL: https://issues.apache.org/jira/browse/SPARK-47543 Project: Spark Issue Type: Bug Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Haejoon Lee Currently the PyArrow infers the Pandas dictionary field as StructType instead of MapType, so Spark can't handle the schema properly: {code:java} >>> pdf = pd.DataFrame({"str_col": ['second'], "dict_col": [{'first': 0.7, >>> 'second': 0.3}]}) >>> pa.Schema.from_pandas(pdf) str_col: string dict_col: struct child 0, first: double child 1, second: double {code} We cannot handle this case since we use PyArrow for schema creation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47256) Assign error classes to FILTER expression errors
[ https://issues.apache.org/jira/browse/SPARK-47256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830443#comment-17830443 ] David Milicevic commented on SPARK-47256: - Working on this ticket in https://github.com/apache/spark/pull/45622. > Assign error classes to FILTER expression errors > > > Key: SPARK-47256 > URL: https://issues.apache.org/jira/browse/SPARK-47256 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Minor > Labels: pull-request-available, starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_102[4-7]* > defined in {*}common/utils/src/main/resources/error/error-classes.json{*}. > The name should be short but complete (look at the example in > error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47504) Resolve AbstractDataType simpleStrings for StringTypeCollated
[ https://issues.apache.org/jira/browse/SPARK-47504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47504: --- Labels: pull-request-available (was: ) > Resolve AbstractDataType simpleStrings for StringTypeCollated > - > > Key: SPARK-47504 > URL: https://issues.apache.org/jira/browse/SPARK-47504 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Priority: Major > Labels: pull-request-available > > *SPARK-47296* introduced a change to fail all unsupported functions. Because > of this change expected *inputTypes* in *ExpectsInputTypes* had to be > changed. This change introduced a change on user side which will print > *"STRING_ANY_COLLATION"* in places where before we printed *"STRING"* when an > error occurred. Concretely if we get an input of Int where > *StringTypeAnyCollation* was expected, we will throw this faulty message for > users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47542) spark cannot hit oracle's index when column type is DATE
[ https://issues.apache.org/jira/browse/SPARK-47542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danke Liu updated SPARK-47542: -- Description: When I use spark's jdbc to pull data from oracle, it will not hit the index if the pushed filter's type in oralce is DATE. Here is my scenario: first I create a dataframe that reads from oracle: val df = spark.read.format("jdbc"). option("url", url). option("driver", driver). option("user", user). option("password", passwd). option("dbtable", "select * from foobar.tbl1") .load() then I apply a filter to the dataframe like this: df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd HH:mm:ss') """).count() this will not hit the index on update_time column. Reason: The update_time column in Oracle is of type DATE, which is mapped to Timestamp in Spark (because the precision of DATE in Oracle is second). When I push a filter to Oracle, it triggers the following code in org.apache.spark.sql.jdbc.OracleDialect: override def compileValue(value: Any): Any = value match { // The JDBC drivers support date literals in SQL statements written in the // format: {d '-mm-dd'} and timestamp literals in SQL statements written // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 (11.1)' // Appendix A Reference Information. case stringValue: String => s"'${escapeSql(stringValue)}'" case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}" case dateValue: Date => "\{d '" + dateValue + "'}" case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ") case _ => value } As a result, the condition "update_time >= \{ts '2024-03-12 06:18:17'} will never hit the index. In my case, as a workaround, I changed the code to this: {color:#cc7832}case {color}timestampValue: Timestamp =>{color:#6a8759}s"{color}{color:#6a8759}to_date({color} {dateFormat.format(timestampValue)} ,'-MM-dd HH:mi:ss')" After this modification, it worked well. was: When I use spark's jdbc to pull data from oracle, it will not hit the index if the pushed filter's type in oralce is DATE. Here is my scenario: first I created a dataframe that reads from oracle: val df = spark.read.format("jdbc"). option("url", url). option("driver", driver). option("user", user). option("password", passwd). option("dbtable", "select * from foobar.tbl1") .load() then I apply a filter to the dataframe like this: df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd HH:mm:ss') """).count() this will not hit the index on update_time column. Reason: The update_time column in Oracle is of type DATE, which is mapped to Timestamp in Spark (because the precision of DATE in Oracle is second). When I push a filter to Oracle, it triggers the following code in org.apache.spark.sql.jdbc.OracleDialect: override def compileValue(value: Any): Any = value match { // The JDBC drivers support date literals in SQL statements written in the // format: {d '-mm-dd'} and timestamp literals in SQL statements written // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 (11.1)' // Appendix A Reference Information. case stringValue: String => s"'${escapeSql(stringValue)}'" case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}" case dateValue: Date => "\{d '" + dateValue + "'}" case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ") case _ => value } As a result, the condition "update_time >= \{ts '2024-03-12 06:18:17'} will never hit the index. In my case, as a workaround, I changed the code to this: {color:#cc7832}case {color}timestampValue: Timestamp =>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}{dateFormat.format(timestampValue)},'-MM-dd HH:mi:ss')" After this modificati
[jira] [Updated] (SPARK-47542) spark cannot hit oracle's index when column type is DATE
[ https://issues.apache.org/jira/browse/SPARK-47542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danke Liu updated SPARK-47542: -- Description: When I use spark's jdbc to pull data from oracle, it will not hit the index if the pushed filter's type in oralce is DATE. Here is my scenario: first I created a dataframe that reads from oracle: val df = spark.read.format("jdbc"). option("url", url). option("driver", driver). option("user", user). option("password", passwd). option("dbtable", "select * from foobar.tbl1") .load() then I apply a filter to the dataframe like this: df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd HH:mm:ss') """).count() this will not hit the index on update_time column. Reason: The update_time column in Oracle is of type DATE, which is mapped to Timestamp in Spark (because the precision of DATE in Oracle is second). When I push a filter to Oracle, it triggers the following code in org.apache.spark.sql.jdbc.OracleDialect: override def compileValue(value: Any): Any = value match { // The JDBC drivers support date literals in SQL statements written in the // format: {d '-mm-dd'} and timestamp literals in SQL statements written // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 (11.1)' // Appendix A Reference Information. case stringValue: String => s"'${escapeSql(stringValue)}'" case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}" case dateValue: Date => "\{d '" + dateValue + "'}" case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ") case _ => value } As a result, the condition "update_time >= \{ts '2024-03-12 06:18:17'} will never hit the index. In my case, as a workaround, I changed the code to this: {color:#cc7832}case {color}timestampValue: Timestamp =>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}{dateFormat.format(timestampValue)},'-MM-dd HH:mi:ss')" After this modification, it worked well. was: When I use spark's jdbc to pull data from oracle, it will not hit the index if the pushed filter's type in oralce is DATE. Here is my scenario: first I created a dataframe that reads from oracle: val df = spark.read.format("jdbc"). option("url", url). option("driver", driver). option("user", user). option("password", passwd). option("dbtable", "select * from foobar.tbl1") .load() then I apply a filter to the dataframe like this: df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd HH:mm:ss') """).count() this will not hit the index on update_time column. Reason: The update_time column in Oracle is of type DATE, which is mapped to Timestamp in Spark (because the precision of DATE in Oracle is second). When I push a filter to Oracle, it triggers the following code in org.apache.spark.sql.jdbc.OracleDialect: override def compileValue(value: Any): Any = value match { // The JDBC drivers support date literals in SQL statements written in the // format: {d '-mm-dd'} and timestamp literals in SQL statements written // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 (11.1)' // Appendix A Reference Information. case stringValue: String => s"'${escapeSql(stringValue)}'" case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}" case dateValue: Date => "\{d '" + dateValue + "'}" case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ") case _ => value } As a result, the condition "update_time >= \{ts '2024-03-12 06:18:17'} will never hit the index. In my case, as a workaround, I changed the code to this: {color:#cc7832}case {color}timestampValue: Timestamp =>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}{\{color:#9876aa}dateFormat.format(timestampValue)},'-MM-dd HH:mi:ss')" After
[jira] [Updated] (SPARK-47542) spark cannot hit oracle's index when column type is DATE
[ https://issues.apache.org/jira/browse/SPARK-47542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danke Liu updated SPARK-47542: -- Description: When I use spark's jdbc to pull data from oracle, it will not hit the index if the pushed filter's type in oralce is DATE. Here is my scenario: first I created a dataframe that reads from oracle: val df = spark.read.format("jdbc"). option("url", url). option("driver", driver). option("user", user). option("password", passwd). option("dbtable", "select * from foobar.tbl1") .load() then I apply a filter to the dataframe like this: df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd HH:mm:ss') """).count() this will not hit the index on update_time column. Reason: The update_time column in Oracle is of type DATE, which is mapped to Timestamp in Spark (because the precision of DATE in Oracle is second). When I push a filter to Oracle, it triggers the following code in org.apache.spark.sql.jdbc.OracleDialect: override def compileValue(value: Any): Any = value match { // The JDBC drivers support date literals in SQL statements written in the // format: {d '-mm-dd'} and timestamp literals in SQL statements written // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 (11.1)' // Appendix A Reference Information. case stringValue: String => s"'${escapeSql(stringValue)}'" case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}" case dateValue: Date => "\{d '" + dateValue + "'}" case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ") case _ => value } As a result, the condition "update_time >= \{ts '2024-03-12 06:18:17'} will never hit the index. In my case, as a workaround, I changed the code to this: {color:#cc7832}case {color}timestampValue: Timestamp =>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}{\{color:#9876aa}dateFormat.format(timestampValue)},'-MM-dd HH:mi:ss')" After this modification, it worked well. was: When I use spark's jdbc to pull data from oracle, it will not hit the index if the pushed filter's type in oralce is DATE. Here is my scenario: first I created a dataframe that reads from oracle: val df = spark.read.format("jdbc"). option("url", url). option("driver", driver). option("user", user). option("password", passwd). option("dbtable", "select * from foobar.tbl1") .load() then I apply a filter to the dataframe like this: df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd HH:mm:ss') """).count() this will not hit the index on update_time column. Reason: The update_time column in Oracle is of type DATE, which is mapped to Timestamp in Spark (because the precision of DATE in Oracle is second). When I push a filter to Oracle, it triggers the following code in org.apache.spark.sql.jdbc.OracleDialect: // class is org.apache.spark.sql.jdbc.OracleDialect override def compileValue(value: Any): Any = value match { // The JDBC drivers support date literals in SQL statements written in the // format: {d '-mm-dd'} and timestamp literals in SQL statements written // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 (11.1)' // Appendix A Reference Information. case stringValue: String => s"'${escapeSql(stringValue)}'" case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}" case dateValue: Date => "\{d '" + dateValue + "'}" case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ") case _ => value } As a result, the condition "update_time >= \{ts '2024-03-12 06:18:17'} will never hit the index. In my case, as a workaround, I changed the code to this: {color:#cc7832}case {color}timestampValue: Timestamp =>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}{{color:#9876aa}dateF
[jira] [Updated] (SPARK-47542) spark cannot hit oracle's index when column type is DATE
[ https://issues.apache.org/jira/browse/SPARK-47542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danke Liu updated SPARK-47542: -- Description: When I use spark's jdbc to pull data from oracle, it will not hit the index if the pushed filter's type in oralce is DATE. Here is my scenario: first I created a dataframe that reads from oracle: val df = spark.read.format("jdbc"). option("url", url). option("driver", driver). option("user", user). option("password", passwd). option("dbtable", "select * from foobar.tbl1") .load() then I apply a filter to the dataframe like this: df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd HH:mm:ss') """).count() this will not hit the index on update_time column. Reason: The update_time column in Oracle is of type DATE, which is mapped to Timestamp in Spark (because the precision of DATE in Oracle is second). When I push a filter to Oracle, it triggers the following code in org.apache.spark.sql.jdbc.OracleDialect: // class is org.apache.spark.sql.jdbc.OracleDialect override def compileValue(value: Any): Any = value match { // The JDBC drivers support date literals in SQL statements written in the // format: {d '-mm-dd'} and timestamp literals in SQL statements written // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 (11.1)' // Appendix A Reference Information. case stringValue: String => s"'${escapeSql(stringValue)}'" case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}" case dateValue: Date => "\{d '" + dateValue + "'}" case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ") case _ => value } As a result, the condition "update_time >= \{ts '2024-03-12 06:18:17'} will never hit the index. In my case, as a workaround, I changed the code to this: {color:#cc7832}case {color}timestampValue: Timestamp =>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}{{color:#9876aa}dateFormat.format(timestampValue)},'-MM-dd HH:mi:ss')"{color} After this modification, it worked well. was: When I use spark's jdbc to pull data from oracle, it will not hit the index if the pushed filter's type in oralce is DATE. Here is my scenario: first I created a dataframe that reads from oracle: val df = spark.read.format("jdbc"). option("url", url). option("driver", driver). option("user", user). option("password", passwd). option("dbtable", "select * from foobar.tbl1") .load() then I apply a filter to the dataframe like this: df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd HH:mm:ss') """).count() this will not hit the index on update_time column. Reason: The update_time column in Oracle is of type DATE, which is mapped to Timestamp in Spark (because the precision of DATE in Oracle is second). When I push a filter to Oracle, it triggers the following code in org.apache.spark.sql.jdbc.OracleDialect: // class is org.apache.spark.sql.jdbc.OracleDialect override def compileValue(value: Any): Any = value match { // The JDBC drivers support date literals in SQL statements written in the // format: {d '-mm-dd'} and timestamp literals in SQL statements written // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 (11.1)' // Appendix A Reference Information. case stringValue: String => s"'${escapeSql(stringValue)}'" case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}" case dateValue: Date => "\{d '" + dateValue + "'}" case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ") case _ => value } As a result, the condition "update_time >= \{ts '2024-03-12 06:18:17'} will never hit the index. In my case, as a workaround, I changed the code to this: {color:#cc7832}case {color}timestampValue: Timestamp =>{color:#6a8759}s"{c
[jira] [Updated] (SPARK-47542) spark cannot hit oracle's index when column type is DATE
[ https://issues.apache.org/jira/browse/SPARK-47542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danke Liu updated SPARK-47542: -- Description: When I use spark's jdbc to pull data from oracle, it will not hit the index if the pushed filter's type in oralce is DATE. Here is my scenario: first I created a dataframe that reads from oracle: val df = spark.read.format("jdbc"). option("url", url). option("driver", driver). option("user", user). option("password", passwd). option("dbtable", "select * from foobar.tbl1") .load() then I apply a filter to the dataframe like this: df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd HH:mm:ss') """).count() this will not hit the index on update_time column. Reason: The update_time column in Oracle is of type DATE, which is mapped to Timestamp in Spark (because the precision of DATE in Oracle is second). When I push a filter to Oracle, it triggers the following code in org.apache.spark.sql.jdbc.OracleDialect: // class is org.apache.spark.sql.jdbc.OracleDialect override def compileValue(value: Any): Any = value match { // The JDBC drivers support date literals in SQL statements written in the // format: {d '-mm-dd'} and timestamp literals in SQL statements written // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 (11.1)' // Appendix A Reference Information. case stringValue: String => s"'${escapeSql(stringValue)}'" case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}" case dateValue: Date => "\{d '" + dateValue + "'}" case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ") case _ => value } As a result, the condition "update_time >= \{ts '2024-03-12 06:18:17'} will never hit the index. In my case, as a workaround, I changed the code to this: {color:#cc7832}case {color}timestampValue: Timestamp =>{color:#6a8759}s"{color}{color:#6a8759}to_date({color} { {color:#9876aa} dateFormat.format(timestampValue)},'-MM-dd HH:mi:ss')"{color} then it worked well. was: When I use spark's jdbc to pull data from oracle, it will not hit the index if the pushed filter's type in oralce is DATE. Here is my scenario: first I created a dataframe that reads from oracle: val df = spark.read.format("jdbc"). option("url", url). option("driver", driver). option("user", user). option("password", passwd). option("dbtable", "select * from foobar.tbl1") .load() then I apply a filter to the dataframe like this: df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd HH:mm:ss') """).count() this will not hit the index on update_time column. Reason: The update_time column in Oracle is of type DATE, which is mapped to Timestamp in Spark (because the precision of DATE in Oracle is second). When I push a filter to Oracle, it triggers the following code in org.apache.spark.sql.jdbc.OracleDialect: // class is org.apache.spark.sql.jdbc.OracleDialect override def compileValue(value: Any): Any = value match { // The JDBC drivers support date literals in SQL statements written in the // format: {d '-mm-dd'} and timestamp literals in SQL statements written // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 (11.1)' // Appendix A Reference Information. case stringValue: String => s"'${escapeSql(stringValue)}'" case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}" case dateValue: Date => "\{d '" + dateValue + "'}" case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ") case _ => value } and this "update_time >= \{ts '2024-03-12 06:18:17'}" will never hit the index. In my case, as a work around, I just change the code to this: {color:#cc7832}case {color}timestampValue: Timestamp =>{color:#6a8759}s"{color}{color:#6a8759}to_date(
[jira] [Updated] (SPARK-47542) spark cannot hit oracle's index when column type is DATE
[ https://issues.apache.org/jira/browse/SPARK-47542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danke Liu updated SPARK-47542: -- Description: When I use spark's jdbc to pull data from oracle, it will not hit the index if the pushed filter's type in oralce is DATE. Here is my scenario: first I created a dataframe that reads from oracle: val df = spark.read.format("jdbc"). option("url", url). option("driver", driver). option("user", user). option("password", passwd). option("dbtable", "select * from foobar.tbl1") .load() then I apply a filter to the dataframe like this: df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd HH:mm:ss') """).count() this will not hit the index on update_time column. Reason: The update_time column in Oracle is of type DATE, which is mapped to Timestamp in Spark (because the precision of DATE in Oracle is second). When I push a filter to Oracle, it triggers the following code in org.apache.spark.sql.jdbc.OracleDialect: // class is org.apache.spark.sql.jdbc.OracleDialect override def compileValue(value: Any): Any = value match { // The JDBC drivers support date literals in SQL statements written in the // format: {d '-mm-dd'} and timestamp literals in SQL statements written // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 (11.1)' // Appendix A Reference Information. case stringValue: String => s"'${escapeSql(stringValue)}'" case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}" case dateValue: Date => "\{d '" + dateValue + "'}" case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ") case _ => value } and this "update_time >= \{ts '2024-03-12 06:18:17'}" will never hit the index. In my case, as a work around, I just change the code to this: {color:#cc7832}case {color}timestampValue: Timestamp =>{color:#6a8759}s"{color}{color:#6a8759}to_date({color} {\\{color:#9876aa} dateFormat.format(timestampValue)},'-MM-dd HH:mi:ss'){color:#6a8759}"{color} then it worked well. was: When I use spark's jdbc to pull data from oracle, it will not hit the index if the pushed filter's type in oralce is DATE. Here is my scenario: first I created a dataframe that reads from oracle: val df = spark.read.format("jdbc"). option("url", url). option("driver", driver). option("user", user). option("password", passwd). option("dbtable", "select * from foobar.tbl1") .load() then I apply a filter to the dataframe like this: df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd HH:mm:ss') """).count() this will not hit the index on update_time column. Reason: the update_time column in oracle is DATE type, this mapped to spark has became Timestamp(because precision of DATE in oracle is second), and when I pushed a filter to oracle, it will hit the codes bellow: // class is org.apache.spark.sql.jdbc.OracleDialect override def compileValue(value: Any): Any = value match { // The JDBC drivers support date literals in SQL statements written in the // format: \\{d '-mm-dd'} and timestamp literals in SQL statements written // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 (11.1)' // Appendix A Reference Information. case stringValue: String => s"'${escapeSql(stringValue)}'" case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}" case dateValue: Date => "\{d '" + dateValue + "'}" case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ") case _ => value } and this "update_time >= \{ts '2024-03-12 06:18:17'}" will never hit the index. In my case, as a work around, I just change the code to this: {color:#cc7832}case {color}timestampValue: Timestamp =>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}{\{color:#9876aa}dateFormat.forma
[jira] [Updated] (SPARK-47541) Collated strings in complex types supporting operations reverse, array_join, concat, map
[ https://issues.apache.org/jira/browse/SPARK-47541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47541: --- Labels: pull-request-available (was: ) > Collated strings in complex types supporting operations reverse, array_join, > concat, map > > > Key: SPARK-47541 > URL: https://issues.apache.org/jira/browse/SPARK-47541 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Priority: Major > Labels: pull-request-available > > Add proper support for complex types containing collated strings in > operations reverse, array_join, concat, map (create). Examples: > {code:java} > select reverse('abc' collate utf8_binary_lcase); > select reverse(array('a' collate utf8_binary_lcase, 'b' collate > utf8_binary_lcase)); > select array_join(array('a' collate utf8_binary_lcase, 'b' collate > utf8_binary_lcase), ', ' collate utf8_binary_lcase); > select concat('a' collate utf8_binary_lcase, 'b' collate utf8_binary_lcase); > select map('a' collate utf8_binary_lcase, 1, 'A' collate utf8_binary_lcase, > 2);{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47542) spark cannot hit oracle's index when column type is DATE
[ https://issues.apache.org/jira/browse/SPARK-47542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danke Liu updated SPARK-47542: -- Description: When I use spark's jdbc to pull data from oracle, it will not hit the index if the pushed filter's type in oralce is DATE. Here is my scenario: first I created a dataframe that reads from oracle: val df = spark.read.format("jdbc"). option("url", url). option("driver", driver). option("user", user). option("password", passwd). option("dbtable", "select * from foobar.tbl1") .load() then I apply a filter to the dataframe like this: df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd HH:mm:ss') """).count() this will not hit the index on update_time column. Reason: the update_time column in oracle is DATE type, this mapped to spark has became Timestamp(because precision of DATE in oracle is second), and when I pushed a filter to oracle, it will hit the codes bellow: // class is org.apache.spark.sql.jdbc.OracleDialect override def compileValue(value: Any): Any = value match { // The JDBC drivers support date literals in SQL statements written in the // format: \\{d '-mm-dd'} and timestamp literals in SQL statements written // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 (11.1)' // Appendix A Reference Information. case stringValue: String => s"'${escapeSql(stringValue)}'" case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}" case dateValue: Date => "\{d '" + dateValue + "'}" case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ") case _ => value } and this "update_time >= \{ts '2024-03-12 06:18:17'}" will never hit the index. In my case, as a work around, I just change the code to this: {color:#cc7832}case {color}timestampValue: Timestamp =>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}{\{color:#9876aa}dateFormat.format(timestampValue)},'-MM-dd HH:mi:ss'){color:#6a8759}"{color} then it worked well. was: When I use spark's jdbc to pull data from oracle, it will not hit the index if the pushed filter's type in oralce is DATE. Here is my scenario: first I created a dataframe that read from oracle: val df = spark.read.format("jdbc"). option("url", url). option("driver", driver). option("user", user). option("password", passwd). option("dbtable", "select * from foobar.tbl1") .load() then I pushed a filter to the dataframe like this: df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd HH:mm:ss') """).count() this will not hit the index on update_time column. Reason: the update_time column in oracle is DATE type, this mapped to spark has became Timestamp(because precision of DATE in oracle is second), and when I pushed a filter to oracle, it will hit the codes bellow: // class is org.apache.spark.sql.jdbc.OracleDialect override def compileValue(value: Any): Any = value match { // The JDBC drivers support date literals in SQL statements written in the // format: \{d '-mm-dd'} and timestamp literals in SQL statements written // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 (11.1)' // Appendix A Reference Information. case stringValue: String => s"'${escapeSql(stringValue)}'" case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}" case dateValue: Date => "\{d '" + dateValue + "'}" case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ") case _ => value } and this "update_time >= \{ts '2024-03-12 06:18:17'}" will never hit the index. In my case, as a work around, I just change the code to this: {color:#cc7832}case {color}timestampValue: Timestamp =>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}{{color:#9876aa}dateFormat{color}.format(timestampValue)}{color:#6a8759},'-MM-dd
[jira] [Created] (SPARK-47542) spark cannot hit oracle's index when column type is DATE
Danke Liu created SPARK-47542: - Summary: spark cannot hit oracle's index when column type is DATE Key: SPARK-47542 URL: https://issues.apache.org/jira/browse/SPARK-47542 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.4 Reporter: Danke Liu When I use spark's jdbc to pull data from oracle, it will not hit the index if the pushed filter's type in oralce is DATE. Here is my scenario: first I created a dataframe that read from oracle: val df = spark.read.format("jdbc"). option("url", url). option("driver", driver). option("user", user). option("password", passwd). option("dbtable", "select * from foobar.tbl1") .load() then I pushed a filter to the dataframe like this: df.filter("""`update_time` >= to_date('2024-03-12 06:18:17', '-MM-dd HH:mm:ss') """).count() this will not hit the index on update_time column. Reason: the update_time column in oracle is DATE type, this mapped to spark has became Timestamp(because precision of DATE in oracle is second), and when I pushed a filter to oracle, it will hit the codes bellow: // class is org.apache.spark.sql.jdbc.OracleDialect override def compileValue(value: Any): Any = value match { // The JDBC drivers support date literals in SQL statements written in the // format: \{d '-mm-dd'} and timestamp literals in SQL statements written // in the format: \{ts '-mm-dd hh:mm:ss.f...'}. For details, see // 'Oracle Database JDBC Developer’s Guide and Reference, 11g Release 1 (11.1)' // Appendix A Reference Information. case stringValue: String => s"'${escapeSql(stringValue)}'" case timestampValue: Timestamp => "\{ts '" + timestampValue + "'}" case dateValue: Date => "\{d '" + dateValue + "'}" case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ") case _ => value } and this "update_time >= \{ts '2024-03-12 06:18:17'}" will never hit the index. In my case, as a work around, I just change the code to this: {color:#cc7832}case {color}timestampValue: Timestamp =>{color:#6a8759}s"{color}{color:#6a8759}to_date({color}{{color:#9876aa}dateFormat{color}.format(timestampValue)}{color:#6a8759},'-MM-dd HH:mi:ss'){color}{color:#6a8759}"{color} then it worked well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47539) Make the return value of method `castToString` be `Any => UTF8String`
[ https://issues.apache.org/jira/browse/SPARK-47539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-47539: Assignee: BingKun Pan > Make the return value of method `castToString` be `Any => UTF8String` > - > > Key: SPARK-47539 > URL: https://issues.apache.org/jira/browse/SPARK-47539 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47539) Make the return value of method `castToString` be `Any => UTF8String`
[ https://issues.apache.org/jira/browse/SPARK-47539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-47539. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45688 [https://github.com/apache/spark/pull/45688] > Make the return value of method `castToString` be `Any => UTF8String` > - > > Key: SPARK-47539 > URL: https://issues.apache.org/jira/browse/SPARK-47539 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org