[jira] [Updated] (SPARK-47766) Extend spark 3.5.1 to support hadoop-client-api 3.4.0, hadoop-client-runtime-3.4.0
[ https://issues.apache.org/jira/browse/SPARK-47766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna updated SPARK-47766: Description: We have some HIGH CVEs which are coming from hadoop-client-runtime 3.3.4 and hence we need to address those com.fasterxml.jackson.core:jackson-databind causing *CVE-2022-42003* and *CVE-2022-42004* (org.apache.hadoop_hadoop-client-runtime-3.3.4.jar) com.google.protobuf:protobuf-java (org.apache.hadoop_hadoop-client-runtime-3.3.4.jar) causing *CVE-2021-22569,* *CVE-2021-22570,* *CVE-2022-3509* and *CVE-2022-3510* net.minidev:json-smart causing *CVE-2021-31684,* *CVE-2023-1370* (org.apache.hadoop_hadoop-client-runtime-3.3.4.jar) org.apache.avro:avro (org.apache.hadoop_hadoop-client-runtime-3.3.4.jar) causing *CVE-2023-39410* org.apache.commons:commons-compress causing *CVE-2024-25710, CVE-2024-26308* (org.apache.hadoop_hadoop-client-runtime-3.3.4.jar) Most of these have gone in hadoop client runtime 3.4.0 Is there a plan to support hadoop 3.4.0 ? was: I have a data pipeline set up in such a way that it reads data from a Kafka source, does some transformation on the data using pyspark, then writes the output into a sink (Kafka, Redis, etc). My entire pipeline in written in SQL, so I wish to use the .sql() method to execute SQL on my streaming source directly. However, I'm running into the issue where my watermark is not being recognized by the downstream query via the .sql() method. ``` Python 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:49:36) [Clang 16.0.6 ] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import pyspark >>> print(pyspark.__version__) 3.5.1 >>> from pyspark.sql import SparkSession >>> >>> session = SparkSession.builder \ ... .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1")\ ... .getOrCreate() >>> from pyspark.sql.functions import col, from_json >>> from pyspark.sql.types import StructField, StructType, TimestampType, >>> LongType, DoubleType, IntegerType >>> schema = StructType( ... [ ... StructField('createTime', TimestampType(), True), ... StructField('orderId', LongType(), True), ... StructField('payAmount', DoubleType(), True), ... StructField('payPlatform', IntegerType(), True), ... StructField('provinceId', IntegerType(), True), ... ]) >>> >>> streaming_df = session.readStream\ ... .format("kafka")\ ... .option("kafka.bootstrap.servers", "localhost:9092")\ ... .option("subscribe", "payment_msg")\ ... .option("startingOffsets","earliest")\ ... .load()\ ... .select(from_json(col("value").cast("string"), schema).alias("parsed_value"))\ ... .select("parsed_value.*")\ ... .withWatermark("createTime", "10 seconds") >>> >>> streaming_df.createOrReplaceTempView("streaming_df") >>> session.sql(""" ... SELECT ... window.start, window.end, provinceId, sum(payAmount) as totalPayAmount ... FROM streaming_df ... GROUP BY provinceId, window('createTime', '1 hour', '30 minutes') ... ORDER BY window.start ... """)\ ... .writeStream\ ... .format("kafka") \ ... .option("checkpointLocation", "checkpoint") \ ... .option("kafka.bootstrap.servers", "localhost:9092") \ ... .option("topic", "sink") \ ... .start() ``` This throws exception ``` pyspark.errors.exceptions.captured.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark; line 6 pos 4; ``` > Extend spark 3.5.1 to support hadoop-client-api 3.4.0, > hadoop-client-runtime-3.4.0 > -- > > Key: SPARK-47766 > URL: https://issues.apache.org/jira/browse/SPARK-47766 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.1 >Reporter: Ramakrishna >Priority: Blocker > Labels: pull-request-available > > We have some HIGH CVEs which are coming from hadoop-client-runtime 3.3.4 and > hence we need to address those > > com.fasterxml.jackson.core:jackson-databind causing > *CVE-2022-42003* and *CVE-2022-42004* > (org.apache.hadoop_hadoop-client-runtime-3.3.4.jar) > > > com.google.protobuf:protobuf-java > (org.apache.hadoop_hadoop-client-runtime-3.3.4.jar) causing > *CVE-2021-22569,* *CVE-2021-22570,* *CVE-2022-3509* and *CVE-2022-3510* > > net.minidev:json-smart > causing *CVE-2021-31684,* *CVE-2023-1370* > (org.apache.hadoop_hadoop-client-runtime-3.3.4.jar) > > > org.apache.avro:avro > (org.
[jira] [Created] (SPARK-47766) Extend spark 3.5.1 to support hadoop-client-api 3.4.0, hadoop-client-runtime-3.4.0
Ramakrishna created SPARK-47766: --- Summary: Extend spark 3.5.1 to support hadoop-client-api 3.4.0, hadoop-client-runtime-3.4.0 Key: SPARK-47766 URL: https://issues.apache.org/jira/browse/SPARK-47766 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.5.1 Reporter: Ramakrishna I have a data pipeline set up in such a way that it reads data from a Kafka source, does some transformation on the data using pyspark, then writes the output into a sink (Kafka, Redis, etc). My entire pipeline in written in SQL, so I wish to use the .sql() method to execute SQL on my streaming source directly. However, I'm running into the issue where my watermark is not being recognized by the downstream query via the .sql() method. ``` Python 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:49:36) [Clang 16.0.6 ] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import pyspark >>> print(pyspark.__version__) 3.5.1 >>> from pyspark.sql import SparkSession >>> >>> session = SparkSession.builder \ ... .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1")\ ... .getOrCreate() >>> from pyspark.sql.functions import col, from_json >>> from pyspark.sql.types import StructField, StructType, TimestampType, >>> LongType, DoubleType, IntegerType >>> schema = StructType( ... [ ... StructField('createTime', TimestampType(), True), ... StructField('orderId', LongType(), True), ... StructField('payAmount', DoubleType(), True), ... StructField('payPlatform', IntegerType(), True), ... StructField('provinceId', IntegerType(), True), ... ]) >>> >>> streaming_df = session.readStream\ ... .format("kafka")\ ... .option("kafka.bootstrap.servers", "localhost:9092")\ ... .option("subscribe", "payment_msg")\ ... .option("startingOffsets","earliest")\ ... .load()\ ... .select(from_json(col("value").cast("string"), schema).alias("parsed_value"))\ ... .select("parsed_value.*")\ ... .withWatermark("createTime", "10 seconds") >>> >>> streaming_df.createOrReplaceTempView("streaming_df") >>> session.sql(""" ... SELECT ... window.start, window.end, provinceId, sum(payAmount) as totalPayAmount ... FROM streaming_df ... GROUP BY provinceId, window('createTime', '1 hour', '30 minutes') ... ORDER BY window.start ... """)\ ... .writeStream\ ... .format("kafka") \ ... .option("checkpointLocation", "checkpoint") \ ... .option("kafka.bootstrap.servers", "localhost:9092") \ ... .option("topic", "sink") \ ... .start() ``` This throws exception ``` pyspark.errors.exceptions.captured.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark; line 6 pos 4; ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40782) Upgrade Jackson-databind to 2.13.4.1
[ https://issues.apache.org/jira/browse/SPARK-40782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834801#comment-17834801 ] Ramakrishna commented on SPARK-40782: - Hi this seems to be an issue still as transitive dependency in hadoop │ com.fasterxml.jackson.core:jackson-databind │ CVE-2022-42003 │ HIGH │ fixed │ 2.12.7 │ 2.12.7.1, 2.13.4.2 │ jackson-databind: deep wrapper array nesting wrt │ │ (org.apache.hadoop_hadoop-client-runtime-3.3.4.jar) Is thrre a fix for this ? > Upgrade Jackson-databind to 2.13.4.1 > > > Key: SPARK-40782 > URL: https://issues.apache.org/jira/browse/SPARK-40782 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.1, 3.4.0 > > > #3590: Add check in primitive value deserializers to avoid deep wrapper array > nesting wrt `UNWRAP_SINGLE_VALUE_ARRAYS` [CVE-2022-42003] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21595) introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 breaks existing workflow
[ https://issues.apache.org/jira/browse/SPARK-21595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17819249#comment-17819249 ] Ramakrishna edited comment on SPARK-21595 at 2/21/24 1:34 PM: -- [~Rakesh_Shah] How did you manage to solve this ? I am getting this in my streaming query, it does aggregations similar to other streaming queries in same job. However it fails and I get {"timestamp":"21/02/2024 07:11:35","logLevel":"ERROR","class":"MapOutputTracker","thread":"Executor task launch worker for task 25.0 in stage 2.1 (TID 75)","message":"Missing an output location for shuffle 5 partition 35"} Can you please help ? [~tejasp] Can you please help ? My spark version is 3.4.0 was (Author: hande): [~Rakesh_Shah] How did you manage to solve this ? I am getting this in my streaming query, it does aggregations similar to other streaming queries in same job. However it fails and I get {"timestamp":"21/02/2024 07:11:35","logLevel":"ERROR","class":"MapOutputTracker","thread":"Executor task launch worker for task 25.0 in stage 2.1 (TID 75)","message":"Missing an output location for shuffle 5 partition 35"} Can you please help ? [~tejasp] Can you please help ? > introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 > breaks existing workflow > - > > Key: SPARK-21595 > URL: https://issues.apache.org/jira/browse/SPARK-21595 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 2.2.0 > Environment: pyspark on linux >Reporter: Stephan Reiling >Assignee: Tejas Patil >Priority: Minor > Labels: documentation, regression > Fix For: 2.2.1, 2.3.0 > > > My pyspark code has the following statement: > {code:java} > # assign row key for tracking > df = df.withColumn( > 'association_idx', > sqlf.row_number().over( > Window.orderBy('uid1', 'uid2') > ) > ) > {code} > where df is a long, skinny (450M rows, 10 columns) dataframe. So this creates > one large window for the whole dataframe to sort over. > In spark 2.1 this works without problem, in spark 2.2 this fails either with > out of memory exception or too many open files exception, depending on memory > settings (which is what I tried first to fix this). > Monitoring the blockmgr, I see that spark 2.1 creates 152 files, spark 2.2 > creates >110,000 files. > In the log I see the following messages (110,000 of these): > {noformat} > 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of > spilledRecords crossed the threshold 4096 > 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of > 64.1 MB to disk (0 time so far) > 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of > spilledRecords crossed the threshold 4096 > 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of > 64.1 MB to disk (1 time so far) > {noformat} > So I started hunting for clues in UnsafeExternalSorter, without luck. What I > had missed was this one message: > {noformat} > 17/08/01 08:55:37 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill > threshold of 4096 rows, switching to > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter > {noformat} > Which allowed me to track down the issue. > By changing the configuration to include: > {code:java} > spark.sql.windowExec.buffer.spill.threshold 2097152 > {code} > I got it to work again and with the same performance as spark 2.1. > I have workflows where I use windowing functions that do not fail, but took a > performance hit due to the excessive spilling when using the default of 4096. > I think to make it easier to track down these issues this config variable > should be included in the configuration documentation. > Maybe 4096 is too small of a default value? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21595) introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 breaks existing workflow
[ https://issues.apache.org/jira/browse/SPARK-21595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17819249#comment-17819249 ] Ramakrishna commented on SPARK-21595: - [~Rakesh_Shah] How did you manage to solve this ? I am getting this in my streaming query, it does aggregations similar to other streaming queries in same job. However it fails and I get {"timestamp":"21/02/2024 07:11:35","logLevel":"ERROR","class":"MapOutputTracker","thread":"Executor task launch worker for task 25.0 in stage 2.1 (TID 75)","message":"Missing an output location for shuffle 5 partition 35"} Can you please help ? [~tejasp] Can you please help ? > introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 > breaks existing workflow > - > > Key: SPARK-21595 > URL: https://issues.apache.org/jira/browse/SPARK-21595 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 2.2.0 > Environment: pyspark on linux >Reporter: Stephan Reiling >Assignee: Tejas Patil >Priority: Minor > Labels: documentation, regression > Fix For: 2.2.1, 2.3.0 > > > My pyspark code has the following statement: > {code:java} > # assign row key for tracking > df = df.withColumn( > 'association_idx', > sqlf.row_number().over( > Window.orderBy('uid1', 'uid2') > ) > ) > {code} > where df is a long, skinny (450M rows, 10 columns) dataframe. So this creates > one large window for the whole dataframe to sort over. > In spark 2.1 this works without problem, in spark 2.2 this fails either with > out of memory exception or too many open files exception, depending on memory > settings (which is what I tried first to fix this). > Monitoring the blockmgr, I see that spark 2.1 creates 152 files, spark 2.2 > creates >110,000 files. > In the log I see the following messages (110,000 of these): > {noformat} > 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of > spilledRecords crossed the threshold 4096 > 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of > 64.1 MB to disk (0 time so far) > 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of > spilledRecords crossed the threshold 4096 > 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of > 64.1 MB to disk (1 time so far) > {noformat} > So I started hunting for clues in UnsafeExternalSorter, without luck. What I > had missed was this one message: > {noformat} > 17/08/01 08:55:37 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill > threshold of 4096 rows, switching to > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter > {noformat} > Which allowed me to track down the issue. > By changing the configuration to include: > {code:java} > spark.sql.windowExec.buffer.spill.threshold 2097152 > {code} > I got it to work again and with the same performance as spark 2.1. > I have workflows where I use windowing functions that do not fail, but took a > performance hit due to the excessive spilling when using the default of 4096. > I think to make it easier to track down these issues this config variable > should be included in the configuration documentation. > Maybe 4096 is too small of a default value? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746516#comment-17746516 ] Ramakrishna edited comment on SPARK-44152 at 7/24/23 4:06 PM: -- Hello [~sdehaes] It should work if you copy jar to /usr/local/bin folder of your docker container . It worked for us was (Author: hande): Hello [~sdehaes] It should work if you copy jar to /usr/local/bin folder . It worked for us > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44152 > URL: https://issues.apache.org/jira/browse/SPARK-44152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > *{{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}}* > > I have this in deployment.yaml of the app > > *mainApplicationFile: "local:spark-assembly-1.0.jar"* > > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746516#comment-17746516 ] Ramakrishna commented on SPARK-44152: - Hello [~sdehaes] It should work if you copy jar to /usr/local/bin folder . It worked for us > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44152 > URL: https://issues.apache.org/jira/browse/SPARK-44152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > *{{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}}* > > I have this in deployment.yaml of the app > > *mainApplicationFile: "local:spark-assembly-1.0.jar"* > > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739627#comment-17739627 ] Ramakrishna commented on SPARK-44152: - Hi [~iainm] Yes, I spent probably bit longer to understand what is happening, because it did not sound like a permission issue looking at the error. Because it just says jar not found. Probably adding more detailed documentation helps especially migrating from 3.3.2 to 3.4.0 > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44152 > URL: https://issues.apache.org/jira/browse/SPARK-44152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > *{{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}}* > > I have this in deployment.yaml of the app > > *mainApplicationFile: "local:spark-assembly-1.0.jar"* > > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44152) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17736404#comment-17736404 ] Ramakrishna commented on SPARK-44152: - [~gurwls223] Is this an issue spark 3.4.0 ? at least I am facing this issue, with all other constraints remaining . > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44152 > URL: https://issues.apache.org/jira/browse/SPARK-44152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > *{{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}}* > > I have this in deployment.yaml of the app > > *mainApplicationFile: "local:spark-assembly-1.0.jar"* > > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna updated SPARK-44135: Description: I have a spark application that is deployed using k8s and it is of version 3.3.2 Recently there were some vulneabilities in spark 3.3.2 I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my application jar is built on spark 3.4.0 However while deploying, I get this error *{{Exception in thread "main" java.nio.file.NoSuchFileException: /spark-assembly-1.0.jar}}* I have this in deployment.yaml of the app *mainApplicationFile: "local:spark-assembly-1.0.jar"* and I have not changed anything related to that. I see that some code has changed in spark 3.4.0 core's source code regarding jar location. Has it really changed the functionality ? Is there anyone who is facing same issue as me ? Should the path be specified in a different way? was: I have a spark application that is deployed using k8s and it is of version 3.3.2 Recently there were some vulneabilities in spark 3.3.2 I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my application jar is built on spark 3.4.0 However while deploying, I get this error *{{Exception in thread "main" java.nio.file.NoSuchFileException: /spark-assembly-1.0.jar}}* I have this in deployment.yaml of the app *mainApplicationFile: "local:spark-assembly-1.0.jar"* and I have not changed anything related to that. I see that some code has changed in spark 3.4.0 core's source code regarding jar location. Has it really changed the functionality ? Is there anyone who is facing same issue as me ? Should the path be specified in a different way? {{}} {{}} > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44135 > URL: https://issues.apache.org/jira/browse/SPARK-44135 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ramakrishna >Priority: Blocker > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > *{{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}}* > > I have this in deployment.yaml of the app > > *mainApplicationFile: "local:spark-assembly-1.0.jar"* > > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna updated SPARK-44135: Priority: Blocker (was: Critical) > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44135 > URL: https://issues.apache.org/jira/browse/SPARK-44135 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ramakrishna >Priority: Blocker > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > *{{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}}* > > I have this in deployment.yaml of the app > > *mainApplicationFile: "local:spark-assembly-1.0.jar"* > > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? > {{}} > {{}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna updated SPARK-44135: Issue Type: Bug (was: Improvement) > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44135 > URL: https://issues.apache.org/jira/browse/SPARK-44135 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ramakrishna >Priority: Critical > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > *{{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}}* > > I have this in deployment.yaml of the app > > *mainApplicationFile: "local:spark-assembly-1.0.jar"* > > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? > {{}} > {{}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna updated SPARK-44135: Description: I have a spark application that is deployed using k8s and it is of version 3.3.2 Recently there were some vulneabilities in spark 3.3.2 I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my application jar is built on spark 3.4.0 However while deploying, I get this error {{Exception in thread "main" java.nio.file.NoSuchFileException: /spark-assembly-1.0.jar}} I have this in deployment.yaml of the app {\{ mainApplicationFile: "local:spark-assembly-1.0.jar"}} and I have not changed anything related to that. I see that some code has changed in spark 3.4.0 core's source code regarding jar location. Has it really changed the functionality ? Is there anyone who is facing same issue as me ? Should the path be specified in a different way? {{}} {{}} was: I have a spark application that is deployed using k8s and it is of version 3.3.2 Recently there were some vulneabilities in spark 3.3.2 I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my application jar is built on spark 3.4.0 However while deploying, I get this error {{Exception in thread "main" java.nio.file.NoSuchFileException: /spark-assembly-1.0.jar}} {{}} {{}} I have this in deployment.yaml of the app {{ mainApplicationFile: "local:spark-assembly-1.0.jar"}} and I have not changed anything related to that. I see that some code has changed in spark 3.4.0 core's source code regarding jar location. Has it really changed the functionality ? Is there anyone who is facing same issue as me ? Should the path be specified in a different way? {{}} {{}} > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44135 > URL: https://issues.apache.org/jira/browse/SPARK-44135 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ramakrishna >Priority: Critical > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > {{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}} > > I have this in deployment.yaml of the app > > {\{ mainApplicationFile: "local:spark-assembly-1.0.jar"}} > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? > {{}} > {{}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna updated SPARK-44135: Description: I have a spark application that is deployed using k8s and it is of version 3.3.2 Recently there were some vulneabilities in spark 3.3.2 I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my application jar is built on spark 3.4.0 However while deploying, I get this error *{{Exception in thread "main" java.nio.file.NoSuchFileException: /spark-assembly-1.0.jar}}* I have this in deployment.yaml of the app *mainApplicationFile: "local:spark-assembly-1.0.jar"* and I have not changed anything related to that. I see that some code has changed in spark 3.4.0 core's source code regarding jar location. Has it really changed the functionality ? Is there anyone who is facing same issue as me ? Should the path be specified in a different way? {{}} {{}} was: I have a spark application that is deployed using k8s and it is of version 3.3.2 Recently there were some vulneabilities in spark 3.3.2 I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my application jar is built on spark 3.4.0 However while deploying, I get this error {{Exception in thread "main" java.nio.file.NoSuchFileException: /spark-assembly-1.0.jar}} I have this in deployment.yaml of the app {\{ mainApplicationFile: "local:spark-assembly-1.0.jar"}} and I have not changed anything related to that. I see that some code has changed in spark 3.4.0 core's source code regarding jar location. Has it really changed the functionality ? Is there anyone who is facing same issue as me ? Should the path be specified in a different way? {{}} {{}} > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44135 > URL: https://issues.apache.org/jira/browse/SPARK-44135 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ramakrishna >Priority: Critical > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > *{{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}}* > > I have this in deployment.yaml of the app > > *mainApplicationFile: "local:spark-assembly-1.0.jar"* > > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? > {{}} > {{}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna updated SPARK-44135: Component/s: Spark Core (was: Shuffle) > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44135 > URL: https://issues.apache.org/jira/browse/SPARK-44135 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ramakrishna >Priority: Critical > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > {{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}} > {{}} > {{}} > I have this in deployment.yaml of the app > > {{ mainApplicationFile: "local:spark-assembly-1.0.jar"}} > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? > {{}} > {{}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna updated SPARK-44135: Affects Version/s: (was: 3.2.0) > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44135 > URL: https://issues.apache.org/jira/browse/SPARK-44135 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.4.0 >Reporter: Ramakrishna >Priority: Critical > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > {{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}} > {{}} > {{}} > I have this in deployment.yaml of the app > > {{ mainApplicationFile: "local:spark-assembly-1.0.jar"}} > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? > {{}} > {{}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna updated SPARK-44135: Description: I have a spark application that is deployed using k8s and it is of version 3.3.2 Recently there were some vulneabilities in spark 3.3.2 I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my application jar is built on spark 3.4.0 However while deploying, I get this error {{Exception in thread "main" java.nio.file.NoSuchFileException: /spark-assembly-1.0.jar}} {{}} {{}} I have this in deployment.yaml of the app {{ mainApplicationFile: "local:spark-assembly-1.0.jar"}} and I have not changed anything related to that. I see that some code has changed in spark 3.4.0 core's source code regarding jar location. Has it really changed the functionality ? Is there anyone who is facing same issue as me ? Should the path be specified in a different way? {{}} {{}} > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44135 > URL: https://issues.apache.org/jira/browse/SPARK-44135 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.2.0, 3.4.0 >Reporter: Ramakrishna >Priority: Critical > > > I have a spark application that is deployed using k8s and it is of version > 3.3.2 Recently there were some vulneabilities in spark 3.3.2 > I changed my dockerfile to download 3.4.0 instead of 3.3.2 and also my > application jar is built on spark 3.4.0 > However while deploying, I get this error > > {{Exception in thread "main" java.nio.file.NoSuchFileException: > /spark-assembly-1.0.jar}} > {{}} > {{}} > I have this in deployment.yaml of the app > > {{ mainApplicationFile: "local:spark-assembly-1.0.jar"}} > > > > and I have not changed anything related to that. I see that some code has > changed in spark 3.4.0 core's source code regarding jar location. > Has it really changed the functionality ? Is there anyone who is facing same > issue as me ? Should the path be specified in a different way? > {{}} > {{}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
[ https://issues.apache.org/jira/browse/SPARK-44135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna updated SPARK-44135: Description: (was: In our production environment, _finalizeShuffleMerge_ processing took longer time (p90 is around 20s) than other PRC requests. This is due to _finalizeShuffleMerge_ invoking IO operations like truncate and file open/close. More importantly, processing this _finalizeShuffleMerge_ can block other critical lightweight messages like authentications, which can cause authentication timeout as well as fetch failures. Those timeout and fetch failures affect the stability of the Spark job executions. ) > Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" > java.nio.file.NoSuchFileException: , although jar is present in the location > --- > > Key: SPARK-44135 > URL: https://issues.apache.org/jira/browse/SPARK-44135 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.2.0, 3.4.0 >Reporter: Ramakrishna >Priority: Critical > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44135) Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location
Ramakrishna created SPARK-44135: --- Summary: Upgrade to spark 3.4.0 from 3.3.2 gives Exception in thread "main" java.nio.file.NoSuchFileException: , although jar is present in the location Key: SPARK-44135 URL: https://issues.apache.org/jira/browse/SPARK-44135 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 3.2.0, 3.4.0 Reporter: Ramakrishna In our production environment, _finalizeShuffleMerge_ processing took longer time (p90 is around 20s) than other PRC requests. This is due to _finalizeShuffleMerge_ invoking IO operations like truncate and file open/close. More importantly, processing this _finalizeShuffleMerge_ can block other critical lightweight messages like authentications, which can cause authentication timeout as well as fetch failures. Those timeout and fetch failures affect the stability of the Spark job executions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41298) Getting Count on data frame is giving the performance issue
[ https://issues.apache.org/jira/browse/SPARK-41298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644164#comment-17644164 ] Ramakrishna commented on SPARK-41298: - Can some one please check behavior and update me asap. > Getting Count on data frame is giving the performance issue > --- > > Key: SPARK-41298 > URL: https://issues.apache.org/jira/browse/SPARK-41298 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Ramakrishna >Priority: Major > > We are invoking below query on Teradata > 1) Dataframe df = spark.format("jdbc"). . . load(); > 2) int count = df.count(); > When we executed the df.count spark internally issuing the below query on > teradata which is wasting the lot of CPU on teradata and DBAs are making > noise by seeing this query. > > Query : SELECT 1 FROM ()SPARK_SUB_TAB > Response: > 1 > 1 > 1 > 1 > 1 > .. > 1 > > Is this expected behavior from spark or is it bug. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41298) Getting Count on data frame is giving the performance issue
[ https://issues.apache.org/jira/browse/SPARK-41298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna updated SPARK-41298: Description: We are invoking below query on Teradata 1) Dataframe df = spark.format("jdbc"). . . load(); 2) int count = df.count(); When we executed the df.count spark internally issuing the below query on teradata which is wasting the lot of CPU on teradata and DBAs are making noise by seeing this query. Query : SELECT 1 FROM ()SPARK_SUB_TAB Response: 1 1 1 1 1 .. 1 Is this expected behavior from spark or is it bug. was: We are invoking below query on Teradata 1) Dataframe df = spark.format("jdbc"). . . load(); 2) int count = df.count(); When we executed the df.count spark internally issuing the below query on teradata which is wasting the lot of CPU on teradata and DBAs are making noise by seeing this query. Query : SELECT 1 FROM ()SPARK_SUB_TAB Response: 1 1 1 1 1 .. 1 Is this expected behavior form spark. > Getting Count on data frame is giving the performance issue > --- > > Key: SPARK-41298 > URL: https://issues.apache.org/jira/browse/SPARK-41298 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Ramakrishna >Priority: Major > > We are invoking below query on Teradata > 1) Dataframe df = spark.format("jdbc"). . . load(); > 2) int count = df.count(); > When we executed the df.count spark internally issuing the below query on > teradata which is wasting the lot of CPU on teradata and DBAs are making > noise by seeing this query. > > Query : SELECT 1 FROM ()SPARK_SUB_TAB > Response: > 1 > 1 > 1 > 1 > 1 > .. > 1 > > Is this expected behavior from spark or is it bug. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41298) Getting Count on data frame is giving the performance issue
Ramakrishna created SPARK-41298: --- Summary: Getting Count on data frame is giving the performance issue Key: SPARK-41298 URL: https://issues.apache.org/jira/browse/SPARK-41298 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4 Reporter: Ramakrishna We are invoking below query on Teradata 1) Dataframe df = spark.format("jdbc"). . . load(); 2) int count = df.count(); When we executed the df.count spark internally issuing the below query on teradata which is wasting the lot of CPU on teradata and DBAs are making noise by seeing this query. Query : SELECT 1 FROM ()SPARK_SUB_TAB Response: 1 1 1 1 1 .. 1 Is this expected behavior form spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41070) Performance issue when Spark SQL connects with TeraData
[ https://issues.apache.org/jira/browse/SPARK-41070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna resolved SPARK-41070. - Resolution: Done > Performance issue when Spark SQL connects with TeraData > > > Key: SPARK-41070 > URL: https://issues.apache.org/jira/browse/SPARK-41070 > Project: Spark > Issue Type: Question > Components: Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Ramakrishna >Priority: Major > > We are connecting Tera data from spark SQL with below API > {color:#ff8b00}Dataset jdbcDF = spark.read().jdbc(connectionUrl, > tableQuery, connectionProperties);{color} > We are facing one issue when we execute above logic on large table with > million rows every time we are seeing below extra query is executing every > time as this resulting performance hit on DB. > This below information we got from DBA. We dont have any logs on SPARK SQL. > SELECT 1 FROM ONE_MILLION_ROWS_TABLE; > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > > Can you please clarify why this query is executing or is there any chance > that this type of query is executing from our code it self while check for > rows count from dataframe. > > Please provide me your inputs on this. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41170) Performance issue when Spark SQL connects with TeraData
[ https://issues.apache.org/jira/browse/SPARK-41170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna updated SPARK-41170: Summary: Performance issue when Spark SQL connects with TeraData (was: CLONE - Performance issue when Spark SQL connects with TeraData ) > Performance issue when Spark SQL connects with TeraData > > > Key: SPARK-41170 > URL: https://issues.apache.org/jira/browse/SPARK-41170 > Project: Spark > Issue Type: Question > Components: Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Ramakrishna >Priority: Major > > We are connecting Tera data from spark SQL with below API > {color:#ff8b00}Dataset jdbcDF = spark.read().jdbc(connectionUrl, > tableQuery, connectionProperties);{color} > We are facing one issue when we execute above logic on large table with > million rows every time we are seeing below extra query is executing every > time as this resulting performance hit on DB. > This below information we got from DBA. We dont have any logs on SPARK SQL. > SELECT 1 FROM ONE_MILLION_ROWS_TABLE; > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > > Can you please clarify why this query is executing or is there any chance > that this type of query is executing from our code it self while check for > rows count from dataframe. > > Please provide me your inputs on this. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41170) CLONE - Performance issue when Spark SQL connects with TeraData
Ramakrishna created SPARK-41170: --- Summary: CLONE - Performance issue when Spark SQL connects with TeraData Key: SPARK-41170 URL: https://issues.apache.org/jira/browse/SPARK-41170 Project: Spark Issue Type: Question Components: Spark Core, SQL Affects Versions: 2.4.4 Reporter: Ramakrishna We are connecting Tera data from spark SQL with below API {color:#ff8b00}Dataset jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, connectionProperties);{color} We are facing one issue when we execute above logic on large table with million rows every time we are seeing below extra query is executing every time as this resulting performance hit on DB. This below information we got from DBA. We dont have any logs on SPARK SQL. SELECT 1 FROM ONE_MILLION_ROWS_TABLE; |1| |1| |1| |1| |1| |1| |1| |1| |1| |1| Can you please clarify why this query is executing or is there any chance that this type of query is executing from our code it self while check for rows count from dataframe. Please provide me your inputs on this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41070) Performance issue when Spark SQL connects with TeraData
[ https://issues.apache.org/jira/browse/SPARK-41070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635223#comment-17635223 ] Ramakrishna commented on SPARK-41070: - I converted issue to a questions. > Performance issue when Spark SQL connects with TeraData > > > Key: SPARK-41070 > URL: https://issues.apache.org/jira/browse/SPARK-41070 > Project: Spark > Issue Type: Question > Components: Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Ramakrishna >Priority: Major > > We are connecting Tera data from spark SQL with below API > {color:#ff8b00}Dataset jdbcDF = spark.read().jdbc(connectionUrl, > tableQuery, connectionProperties);{color} > We are facing one issue when we execute above logic on large table with > million rows every time we are seeing below extra query is executing every > time as this resulting performance hit on DB. > This below information we got from DBA. We dont have any logs on SPARK SQL. > SELECT 1 FROM ONE_MILLION_ROWS_TABLE; > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > > Can you please clarify why this query is executing or is there any chance > that this type of query is executing from our code it self while check for > rows count from dataframe. > > Please provide me your inputs on this. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-41070) Performance issue when Spark SQL connects with TeraData
[ https://issues.apache.org/jira/browse/SPARK-41070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna reopened SPARK-41070: - > Performance issue when Spark SQL connects with TeraData > > > Key: SPARK-41070 > URL: https://issues.apache.org/jira/browse/SPARK-41070 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Ramakrishna >Priority: Major > > We are connecting Tera data from spark SQL with below API > {color:#ff8b00}Dataset jdbcDF = spark.read().jdbc(connectionUrl, > tableQuery, connectionProperties);{color} > We are facing one issue when we execute above logic on large table with > million rows every time we are seeing below extra query is executing every > time as this resulting performance hit on DB. > This below information we got from DBA. We dont have any logs on SPARK SQL. > SELECT 1 FROM ONE_MILLION_ROWS_TABLE; > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > > Can you please clarify why this query is executing or is there any chance > that this type of query is executing from our code it self while check for > rows count from dataframe. > > Please provide me your inputs on this. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41070) Performance issue when Spark SQL connects with TeraData
[ https://issues.apache.org/jira/browse/SPARK-41070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna updated SPARK-41070: Issue Type: Question (was: Bug) > Performance issue when Spark SQL connects with TeraData > > > Key: SPARK-41070 > URL: https://issues.apache.org/jira/browse/SPARK-41070 > Project: Spark > Issue Type: Question > Components: Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Ramakrishna >Priority: Major > > We are connecting Tera data from spark SQL with below API > {color:#ff8b00}Dataset jdbcDF = spark.read().jdbc(connectionUrl, > tableQuery, connectionProperties);{color} > We are facing one issue when we execute above logic on large table with > million rows every time we are seeing below extra query is executing every > time as this resulting performance hit on DB. > This below information we got from DBA. We dont have any logs on SPARK SQL. > SELECT 1 FROM ONE_MILLION_ROWS_TABLE; > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > > Can you please clarify why this query is executing or is there any chance > that this type of query is executing from our code it self while check for > rows count from dataframe. > > Please provide me your inputs on this. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41070) Performance issue when Spark SQL connects with TeraData
[ https://issues.apache.org/jira/browse/SPARK-41070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635083#comment-17635083 ] Ramakrishna commented on SPARK-41070: - Do I need to do raise any ticket for this. > Performance issue when Spark SQL connects with TeraData > > > Key: SPARK-41070 > URL: https://issues.apache.org/jira/browse/SPARK-41070 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Ramakrishna >Priority: Major > > We are connecting Tera data from spark SQL with below API > {color:#ff8b00}Dataset jdbcDF = spark.read().jdbc(connectionUrl, > tableQuery, connectionProperties);{color} > We are facing one issue when we execute above logic on large table with > million rows every time we are seeing below extra query is executing every > time as this resulting performance hit on DB. > This below information we got from DBA. We dont have any logs on SPARK SQL. > SELECT 1 FROM ONE_MILLION_ROWS_TABLE; > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > > Can you please clarify why this query is executing or is there any chance > that this type of query is executing from our code it self while check for > rows count from dataframe. > > Please provide me your inputs on this. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41070) Performance issue when Spark SQL connects with TeraData
[ https://issues.apache.org/jira/browse/SPARK-41070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna updated SPARK-41070: Component/s: SQL > Performance issue when Spark SQL connects with TeraData > > > Key: SPARK-41070 > URL: https://issues.apache.org/jira/browse/SPARK-41070 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Ramakrishna >Priority: Major > > We are connecting Tera data from spark SQL with below API > {color:#ff8b00}Dataset jdbcDF = spark.read().jdbc(connectionUrl, > tableQuery, connectionProperties);{color} > We are facing one issue when we execute above logic on large table with > million rows every time we are seeing below extra query is executing every > time as this resulting performance hit on DB. > This below information we got from DBA. We dont have any logs on SPARK SQL. > SELECT 1 FROM ONE_MILLION_ROWS_TABLE; > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > > Can you please clarify why this query is executing or is there any chance > that this type of query is executing from our code it self while check for > rows count from dataframe. > > Please provide me your inputs on this. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41070) Performance issue when Spark SQL connects with TeraData
[ https://issues.apache.org/jira/browse/SPARK-41070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna updated SPARK-41070: Description: We are connecting Tera data from spark SQL with below API {color:#ff8b00}Dataset jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, connectionProperties);{color} We are facing one issue when we execute above logic on large table with million rows every time we are seeing below extra query is executing every time as this resulting performance hit on DB. This below information we got from DBA. We dont have any logs on SPARK SQL. SELECT 1 FROM ONE_MILLION_ROWS_TABLE; |1| |1| |1| |1| |1| |1| |1| |1| |1| |1| Can you please clarify why this query is executing or is there any chance that this type of query is executing from our code it self while check for rows count from dataframe. Please provide me your inputs on this. was: We are connecting Tera data from spark SQL with below API {color:#ff8b00}Dataset jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, connectionProperties);{color} We are facing one issue when we execute above logic on large table with million rows every time we are seeing below extra query is executing every time as this resulting performance hit on DB. This below information we got from DBA. We dont have any logs on SPARK SQL. SELECT 1 FROM ONE_MILLION_ROWS_TABLE; |1| |1| |1| |1| |1| |1| |1| |1| |1| |1| Can you please clarify why this query is executing or is there any chance that this query is executing from our code it self while check for rows count from dataframe. Please provide me your inputs on this. > Performance issue when Spark SQL connects with TeraData > > > Key: SPARK-41070 > URL: https://issues.apache.org/jira/browse/SPARK-41070 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Ramakrishna >Priority: Major > > We are connecting Tera data from spark SQL with below API > {color:#ff8b00}Dataset jdbcDF = spark.read().jdbc(connectionUrl, > tableQuery, connectionProperties);{color} > We are facing one issue when we execute above logic on large table with > million rows every time we are seeing below extra query is executing every > time as this resulting performance hit on DB. > This below information we got from DBA. We dont have any logs on SPARK SQL. > SELECT 1 FROM ONE_MILLION_ROWS_TABLE; > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > > Can you please clarify why this query is executing or is there any chance > that this type of query is executing from our code it self while check for > rows count from dataframe. > > Please provide me your inputs on this. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41070) Performance issue when Spark SQL connects with TeraData
[ https://issues.apache.org/jira/browse/SPARK-41070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna updated SPARK-41070: Description: We are connecting Tera data from spark SQL with below API {color:#ff8b00}Dataset jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, connectionProperties);{color} We are facing one issue when we execute above logic on large table with million rows every time we are seeing below extra query is executing every time as this resulting performance hit on DB. This below information we got from DBA. We dont have any logs on SPARK SQL. SELECT 1 FROM ONE_MILLION_ROWS_TABLE; |1| |1| |1| |1| |1| |1| |1| |1| |1| |1| Can you please clarify why this query is executing or is there any chance that this query is executing from our code it self while check for rows count from dataframe. Please provide me your inputs on this. was: We are connecting Tera data from spark SQL with below API Dataset jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, connectionProperties); We are facing one issue when we execute this logic on large table with million rows every time we are seeing below extra query is executing every times as this resulting performance hit on DB. This below information we got from DBA. We dont have any logs on SPARK SQL. SELECT 1 FROM ONE_MILLION_ROWS_TABLE; |1| |1| |1| |1| |1| |1| |1| |1| |1| |1| Can you please clarify why this query is executing or is there any chance that this query is executing from our code it self while check for rows count from dataframe. Please provide me your inputs on this. > Performance issue when Spark SQL connects with TeraData > > > Key: SPARK-41070 > URL: https://issues.apache.org/jira/browse/SPARK-41070 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Ramakrishna >Priority: Major > > We are connecting Tera data from spark SQL with below API > {color:#ff8b00}Dataset jdbcDF = spark.read().jdbc(connectionUrl, > tableQuery, connectionProperties);{color} > > We are facing one issue when we execute above logic on large table with > million rows every time we are seeing below extra query is executing every > time as this resulting performance hit on DB. > This below information we got from DBA. We dont have any logs on SPARK SQL. > SELECT 1 FROM ONE_MILLION_ROWS_TABLE; > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > |1| > > Can you please clarify why this query is executing or is there any chance > that this query is executing from our code it self while check for rows count > from dataframe. > > Please provide me your inputs on this. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41070) Performance issue when Spark SQL connects with TeraData
Ramakrishna created SPARK-41070: --- Summary: Performance issue when Spark SQL connects with TeraData Key: SPARK-41070 URL: https://issues.apache.org/jira/browse/SPARK-41070 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4 Reporter: Ramakrishna We are connecting Tera data from spark SQL with below API Dataset jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, connectionProperties); We are facing one issue when we execute this logic on large table with million rows every time we are seeing below extra query is executing every times as this resulting performance hit on DB. This below information we got from DBA. We dont have any logs on SPARK SQL. SELECT 1 FROM ONE_MILLION_ROWS_TABLE; |1| |1| |1| |1| |1| |1| |1| |1| |1| |1| Can you please clarify why this query is executing or is there any chance that this query is executing from our code it self while check for rows count from dataframe. Please provide me your inputs on this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13585) addPyFile behavior change between 1.6 and before
Santhosh Gorantla Ramakrishna created SPARK-13585: - Summary: addPyFile behavior change between 1.6 and before Key: SPARK-13585 URL: https://issues.apache.org/jira/browse/SPARK-13585 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.0 Reporter: Santhosh Gorantla Ramakrishna Priority: Minor addPyFile in earlier versions would remove the .py file if it already existed. In 1.6, it throws an exception "__.py exists and does not match contents of __.py". This might be because the underlying scala code needs an overwrite parameter, and this is being defaulted to false when called from python. private def copyFile( url: String, sourceFile: File, destFile: File, fileOverwrite: Boolean, removeSourceFile: Boolean = false): Unit = { Would be good if addPyFile takes a parameter to set the overwrite and default to false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org